ReinWiFi: A Reinforcement-Learning-Based Framework for the Application-Layer QoS Optimization of WiFi Networks

Qianren Li, Bojie Lv, Yuncong Hong, and Rui Wang
Southern University of Science and Technology
Abstract

In this paper, a reinforcement-learning-based scheduling framework is proposed and implemented to optimize the application-layer quality-of-service (QoS) of a practical wireless local area network (WLAN) suffering from unknown interference. Particularly, application-layer tasks of file delivery and delay-sensitive communication, e.g., screen projection, in a WLAN with enhanced distributed channel access (EDCA) mechanism, are jointly scheduled by adjusting the contention window sizes and application-layer throughput limitation, such that their QoS, including the throughput of file delivery and the round trip time of the delay-sensitive communication, can be optimized. Due to the unknown interference and vendor-dependent implementation of the network interface card, the relation between the scheduling policy and the system QoS is unknown. Hence, a reinforcement learning method is proposed, in which a novel Q-network is trained to map from the historical scheduling parameters and QoS observations to the current scheduling action. It is demonstrated on a testbed that the proposed framework can achieve a significantly better QoS than the conventional EDCA mechanism.

I Introduction

Reinforcement learning (RL) for radio resource management has been attracting tremendous attention since it is a promising technique to tackle unknown system statistics and solve the prohibitive policy optimization problem with tolerable complexity and good performance. Moreover, the RL technique also has great potential to optimize a wireless system even without accurate or complete observation of the system state, which might happen in practical implementations.

There have been a significant amount of works optimizing the throughput, delay or age-of-information (AoI) of wireless networks via the method of RL. Most of these works assumed full knowledge of the system state in algorithm design, which could be applied to the systems where the global system state could be collected at a centralized controller. On the other hand, RL was also utilized to optimize the performance of wireless systems with distributive transmission scheduling, e.g., wireless fidelity (WiFi) systems. For instance, an adaptive channel contention mechanism was proposed for WiFi systems in [1], where a local RL agent was deployed at each user equipment (UE). The local agents adjusted the minimum contention window (MCW) size according to the global statistics of successful channel contention such that the transmission fairness among the agents can be ensured. Instead of global statistics, a distributive RL algorithm with the assistance of federated learning was proposed in [2] to adapt the channel contention according to the local channel state, such that the local throughput was optimized. Moreover, a deep multi-agent RL technique based on the QMIX algorithm [3] was proposed in [4] to improve network throughput while maintaining user fairness. In this work, the channel contention decision was made according to the history of the last transmission duration. In order to resolve the collision issue of the distributive channel access, deep RL algorithms were proposed in [5] to determine the timing of doubling the contention window based on the estimated collision probability. In addition to the adaptive channel contention, a double deep Q-network (DDQN) [6] based rate adaptation algorithm was proposed in [7] to improve network throughput, where the agent learned the optimal transmission rate based on the modulation and coding scheme (MCS) and frame loss rate. Most of the above literature assumed knowledge of the physical (PHY) layer and media access control (MAC) layer system states. However, it might be challenging to obtain such knowledge in the scheduler design of a practical WiFi network. Moreover, the absence of knowledge on co-channel interference and the vendor-dependent implementation of WiFi adapters would also raise challenges in the optimization of scheduling policies.

In this paper, we would like to shed some light on the RL-based scheduling design for practical WiFi systems suffering from unknown co-channel interference. Particularly, a framework, namely ReinWiFi, is proposed for the scheduling of delay-sensitive communication tasks and file delivery tasks in the application layer of a WiFi network. In ReinWiFi, a controller periodically collects the past scheduling parameters and average quality-of-service (QoS) observations of all the application-layer tasks, determines rate limitation and contention window size for all the transmitters, such that the total throughput of file delivery tasks is maximized and the latency requirements of delay-sensitive tasks are ensured. It is shown by the experiments that the proposed framework can adapt to the variation of task number, interfering traffic, and link quality, and significantly outperforms the conventional EDCA mechanism.

II System Model

II-A Deployment Scenario

The proposed ReinWiFi system is deployed in a WiFi network with multiple connected access points (APs) and UEs working on the same channel. Denote the number of the devices, including the APs and UEs, in the WiFi network as U𝑈Uitalic_U, the set of these devices as 𝒰={ui|i=0,1,,U1}𝒰conditional-setsubscript𝑢𝑖𝑖01𝑈1\mathcal{U}=\{u_{i}|{i}=0,1,\ldots,U-1\}caligraphic_U = { italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 0 , 1 , … , italic_U - 1 }, and the communication link from the i𝑖iitalic_i-th device to the j𝑗jitalic_j-th one as the (i,j)𝑖𝑗({i},{j})( italic_i , italic_j )-th link (ui,uj𝒰for-allsubscript𝑢𝑖subscript𝑢𝑗𝒰\forall u_{i},u_{j}\in\mathcal{U}∀ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_U). The communication links can be from UE to AP, from AP to UE, or between UEs (i.e., WiFi Direct). We define \mathcal{L}caligraphic_L as the set of all communication links in the system and isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the set of communication links from the uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-th device. As a remark, one UE could simultaneously maintain the communication links to the AP and other UEs, where the transmission of the infrastructure and WiFi Direct modes is separated in the time domain.

The data traffics raised by the applications of UEs in 𝒰𝒰\mathcal{U}caligraphic_U are referred to as communication tasks in this paper. For example, the application projecting the screen of a mobile phone to a laptop via WiFi Direct will raise a delay-sensitive task, e.g., Miracast [8], where an application-layer packet (i.e., video frame) is generated and delivered periodically (the typical period is 16161616 ms). Moreover, file sharing between two devices will raise a file delivery task. For the elaboration convenience, we define 𝒯i,jfsuperscriptsubscript𝒯𝑖𝑗𝑓\mathcal{T}_{{i},{j}}^{f}caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT and 𝒯i,jrsuperscriptsubscript𝒯𝑖𝑗𝑟\mathcal{T}_{{i},{j}}^{r}caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT as the universal sets of file delivery tasks and delay-sensitive tasks on the (i,j)𝑖𝑗({i},{j})( italic_i , italic_j )-th link, respectively. A task is in the inactive state if there is no packet arrival or buffered file at the transmitter.

Because of the transmission latency constraint, the delay-sensitive tasks should be scheduled with higher priority than the file delivery ones. Hence, all the transmitters access the channel via the enhanced distributed channel access (EDCA) mechanism defined in IEEE 802.11e. Particularly, four access category (AC) queues, namely voice (VI), video (VO), best effort (BE), and background (BK), are adopted at all the transmitters. The transmission priorities of the four AC queues are differentiated by values of arbitration inter-frame spacing (AIFS) and contention window (CW) size. As in the practical systems, the file delivery tasks are scheduled with the BE priority, and the delay-sensitive tasks are scheduled with the VI priority. The latter has smaller AIFS and CW size, leading to a larger successful probability in channel contention. As a remark, due to the distributive channel contention mechanism, it is infeasible to accurately control the packet transmission order among the devices of a WiFi network with commercial network interface cards (NICs). Instead, the packet transmission in the ReinWiFi system is scheduled in a stochastic manner by adapting the CW sizes of AC queues in each device.

There are some other WiFi networks sharing the same channel in the coverage of the considered network. The traffic in these networks would degrade the QoS of the considered network, e.g., larger delivery latency and lower throughput. Denote the set of devices in the interfering networks as 𝒰Isubscript𝒰𝐼\mathcal{U}_{I}caligraphic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT. The communications among the devices in 𝒰Isubscript𝒰𝐼\mathcal{U}_{I}caligraphic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, namely interfering traffic, cannot be scheduled by the ReinWiFi system. Instead, the ReinWiFi system is designed to deduce the interference level and adjust the transmission accordingly.

II-B Task Queuing Model

For each file delivery task, all the information bits to be delivered are saved in an application-layer buffer, and a user datagram protocol (UDP) socket is established at the very beginning of transmission. The data dispatch from the buffer to the UDP socket is controlled by a dispatcher. The UDP socket encapsulates the received data from the dispatcher into UDP datagrams and forwards them to the driver of NIC for WiFi transmission accordingly. As a remark, the new datagrams at the NIC may not be transmitted immediately. In fact, each NIC maintains four MAC-layer AC queues associated with the four transmission priorities, respectively. The arrival datagrams are saved in the corresponding queues and transmitted following the vendor’s protocol. The queuing status of the NIC is usually not accessible in the application-layer. Thus, it is infeasible for the proposed system to know when the NIC completely delivers a datagram; it is, therefore, infeasible for the proposed system to precisely control the transmission of a UDP datagram or an application-layer packet. As a result, the scheduling of the proposed system is designed based on the average observable performance in the application layer.

Specifically, the transmission time is organized into a sequence of scheduling periods, each with a duration of Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT seconds. Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is sufficiently large to accommodate a number of MAC protocol data unit transmissions. Due to the invisibility of NIC status, the QoS of a file delivery task is measured by its application-layer throughput in one scheduling period. Particularly, for the m𝑚{m}italic_m-th file delivery task of the (i,j)𝑖𝑗({i},{j})( italic_i , italic_j )-th link, its QoS in the t𝑡titalic_t-th scheduling period ri,jm(t)superscriptsubscript𝑟𝑖𝑗𝑚𝑡r_{{i},{j}}^{m}(t)italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_t ) is defined as the number of information bits transferred from the task buffer to the associated UDP socket. The dispatcher is designed to adaptively limit the throughput of the file delivery task such that delay-sensitive tasks could have a larger chance to access the channel. Hence, let bi,jm(t)superscriptsubscript𝑏𝑖𝑗𝑚𝑡b_{{i},{j}}^{m}(t)italic_b start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_t ) be the throughput limitation of the m𝑚{m}italic_m-th file delivery task of (i,j)𝑖𝑗({i},{j})( italic_i , italic_j )-th link in the t𝑡titalic_t-th scheduling slot, the dispatcher would make sure

ri,jm(t)bi,jm(t).superscriptsubscript𝑟𝑖𝑗𝑚𝑡superscriptsubscript𝑏𝑖𝑗𝑚𝑡\displaystyle r_{{i},{j}}^{m}(t)\leq b_{{i},{j}}^{m}(t).italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_t ) ≤ italic_b start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_t ) . (1)

For each delay-sensitive task, a task queue and UDP socket are established at the very beginning. The application-layer packets arrive at the task queue periodically with a fixed average data rate. The first packet in the queue is forwarded to the UDP socket for WiFi transmission as long as the socket is idle. Due to the lack of MAC-layer status, the measurement of the transmission latency of a packet could hardly be accurate. Hence, we use the round-trip time (RTT) as the QoS measurement of delay-sensitive communication tasks. Particularly, for each delay-sensitive task, an acknowledgment will be sent back from the receiver to the transmitter when an application-layer packet is completely received. Hence, the transmitter can calculate the RTTs of all packet transmissions. For the m𝑚{m}italic_m-th delay-sensitive communication task of the (i,j)𝑖𝑗({i},{j})( italic_i , italic_j )-th link ((i,j),m𝒯i,jr,formulae-sequencefor-all𝑖𝑗𝑚superscriptsubscript𝒯𝑖𝑗𝑟\forall({i},{j})\in{\mathcal{L}},{m}\in{\mathcal{T}_{{i},{j}}^{r}},∀ ( italic_i , italic_j ) ∈ caligraphic_L , italic_m ∈ caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ,), its QoS in the t𝑡titalic_t-th scheduling period di,jm(t)superscriptsubscript𝑑𝑖𝑗𝑚𝑡d_{{i},{j}}^{m}(t)italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_t ) is defined as the average RTT of the packets transmitted in this scheduling period.

II-C Scheduling Model

Denote the CW sizes of the VI and BE priorities of the i𝑖{i}italic_i-th device at t𝑡titalic_t-th scheduling period as wi𝚅𝙸(t)superscriptsubscript𝑤𝑖𝚅𝙸𝑡w_{i}^{\mathtt{VI}}(t)italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_VI end_POSTSUPERSCRIPT ( italic_t ) and wi𝙱𝙴(t)superscriptsubscript𝑤𝑖𝙱𝙴𝑡w_{i}^{\mathtt{BE}}(t)italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_BE end_POSTSUPERSCRIPT ( italic_t ) respectively, we shall focus on the joint scheduling of these channel contention parameters as well as the dispatchers’ throughput limitation {bi,jm(t)|(i,j),m𝒯i,jf}conditional-setsuperscriptsubscript𝑏𝑖𝑗𝑚𝑡formulae-sequencefor-all𝑖𝑗𝑚superscriptsubscript𝒯𝑖𝑗𝑓\{b_{{i},{j}}^{m}(t)|\forall({i},{j})\in{\mathcal{L}},{m}\in{\mathcal{T}_{{i},% {j}}^{f}}\}{ italic_b start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_t ) | ∀ ( italic_i , italic_j ) ∈ caligraphic_L , italic_m ∈ caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT } in each scheduling period.

Particularly, each transmitter collects the QoS observations of its tasks at the end of each scheduling period and delivers them to a centralized controller, which can be implemented in an AP or other device. Not all the tasks in the universal task sets are in the active state. The average RTTs and throughputs of the inactive delay-sensitive and file delivery tasks are represented by a sufficiently large value and 00, respectively. Hence, the aggregation of QoS observations received at the controller at the end of the t𝑡titalic_t-th scheduling period can be represented as

𝒪t{ri,jm(t)|(i,j),m𝒯i,jf}{di,jm(t)|(i,j),m𝒯i,jr}.subscript𝒪𝑡conditional-setsuperscriptsubscript𝑟𝑖𝑗𝑚𝑡formulae-sequencefor-all𝑖𝑗𝑚superscriptsubscript𝒯𝑖𝑗𝑓conditional-setsuperscriptsubscript𝑑𝑖𝑗𝑚𝑡formulae-sequencefor-all𝑖𝑗𝑚superscriptsubscript𝒯𝑖𝑗𝑟\displaystyle\begin{split}\mathcal{O}_{t}\triangleq&\left\{r_{{i},{j}}^{m}(t)|% \forall({i},{j})\in{\mathcal{L}},{m}\in{\mathcal{T}_{{i},{j}}^{f}}\right\}\\ &\cup\left\{d_{{i},{j}}^{m}(t)|\forall({i},{j})\in{\mathcal{L}},{m}\in{% \mathcal{T}_{{i},{j}}^{r}}\right\}.\end{split}start_ROW start_CELL caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ end_CELL start_CELL { italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_t ) | ∀ ( italic_i , italic_j ) ∈ caligraphic_L , italic_m ∈ caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∪ { italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_t ) | ∀ ( italic_i , italic_j ) ∈ caligraphic_L , italic_m ∈ caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } . end_CELL end_ROW (2)

Due to the time-varying traffic of the interfering devices, the scheduling parameters, including the file throughput limitations and CW sizes, are adapted at the centralized controller in each schedule according to the system’s scheduling parameters and QoS observations in the past N𝑁Nitalic_N scheduling periods. Specifically, the aggregation of scheduling parameters over a period is represented as

𝒜t{bi,jm(t)|(i,j),m𝒯i,jf}{wi𝚅𝙸(t),wi𝙱𝙴(t)|i=0,1,,U1}.subscript𝒜𝑡conditional-setsuperscriptsubscript𝑏𝑖𝑗𝑚𝑡formulae-sequencefor-all𝑖𝑗𝑚superscriptsubscript𝒯𝑖𝑗𝑓conditional-setsuperscriptsubscript𝑤𝑖𝚅𝙸𝑡superscriptsubscript𝑤𝑖𝙱𝙴𝑡𝑖01𝑈1\displaystyle\begin{split}\mathcal{A}_{t}\triangleq&\left\{b_{{i},{j}}^{m}(t)|% \forall({i},{j})\in{\mathcal{L}},{m}\in{\mathcal{T}_{{i},{j}}^{f}}\right\}\\ &\cup\left\{w_{i}^{\mathtt{VI}}(t),w_{i}^{\mathtt{BE}}(t)|i=0,1,\ldots,U-1% \right\}.\end{split}start_ROW start_CELL caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ end_CELL start_CELL { italic_b start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_t ) | ∀ ( italic_i , italic_j ) ∈ caligraphic_L , italic_m ∈ caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∪ { italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_VI end_POSTSUPERSCRIPT ( italic_t ) , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_BE end_POSTSUPERSCRIPT ( italic_t ) | italic_i = 0 , 1 , … , italic_U - 1 } . end_CELL end_ROW (3)

Thus, at the very beginning of the t𝑡titalic_t-th scheduling period, 𝒜tsubscript𝒜𝑡\mathcal{A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (tfor-all𝑡\forall t∀ italic_t) is determined based on past scheduling parameters and QoS observations {(𝒪tN,𝒜tN),(𝒪tN+1,𝒜tN+1),,(𝒪t1,𝒜t1)}subscript𝒪𝑡𝑁subscript𝒜𝑡𝑁subscript𝒪𝑡𝑁1subscript𝒜𝑡𝑁1subscript𝒪𝑡1subscript𝒜𝑡1\{(\mathcal{O}_{t-N},\mathcal{A}_{t-N}),(\mathcal{O}_{t-N+1},\mathcal{A}_{t-N+% 1}),\ldots,(\mathcal{O}_{t-1},\mathcal{A}_{t-1})\}{ ( caligraphic_O start_POSTSUBSCRIPT italic_t - italic_N end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_t - italic_N end_POSTSUBSCRIPT ) , ( caligraphic_O start_POSTSUBSCRIPT italic_t - italic_N + 1 end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_t - italic_N + 1 end_POSTSUBSCRIPT ) , … , ( caligraphic_O start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) }.

III Problem Formulation

The proposed ReinWiFi system should successively make scheduling decisions for each scheduling period. Hence, it is formulated as a Markov decision process (MDP) in the following.

Definition 1 (System State)

In the t𝑡titalic_t-th scheduling period (tfor-all𝑡\forall t∀ italic_t), the system state is defined as the aggregation of the QoS observations and scheduling parameters of the past N𝑁Nitalic_N scheduling periods. Thus, 𝒮t{(𝒪tN,𝒜tN),,(𝒪t1,𝒜t1)}subscript𝒮𝑡subscript𝒪𝑡𝑁subscript𝒜𝑡𝑁subscript𝒪𝑡1subscript𝒜𝑡1\mathcal{S}_{t}\triangleq\left\{(\mathcal{O}_{t-N},\mathcal{A}_{t-N}),\ldots,(% \mathcal{O}_{t-1},\mathcal{A}_{t-1})\right\}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ { ( caligraphic_O start_POSTSUBSCRIPT italic_t - italic_N end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_t - italic_N end_POSTSUBSCRIPT ) , … , ( caligraphic_O start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) }.

Definition 2 (Scheduling Action and Policy)

Denote 𝒜tsubscript𝒜𝑡\mathcal{A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT defined in (3) as the scheduling action in the t𝑡titalic_t-th scheduling period, 𝒜ti{bi,jm(t)|ji,m𝒯i,jf}{wi𝚅𝙸(t),wi𝙱𝙴(t)}superscriptsubscript𝒜𝑡𝑖conditional-setsubscript𝑏𝑖superscript𝑗𝑚𝑡formulae-sequencefor-all𝑗subscript𝑖𝑚superscriptsubscript𝒯𝑖𝑗𝑓superscriptsubscript𝑤𝑖𝚅𝙸𝑡superscriptsubscript𝑤𝑖𝙱𝙴𝑡\mathcal{A}_{t}^{i}\triangleq\left\{b_{i},{j}^{m}(t)|\forall j\in\mathcal{L}_{% i},{m}\in{\mathcal{T}_{{i},{j}}^{f}}\right\}\cup\left\{w_{i}^{\mathtt{VI}}(t),% w_{i}^{\mathtt{BE}}(t)\right\}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≜ { italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_j start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_t ) | ∀ italic_j ∈ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m ∈ caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT } ∪ { italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_VI end_POSTSUPERSCRIPT ( italic_t ) , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_BE end_POSTSUPERSCRIPT ( italic_t ) }, as the local scheduling action of the i𝑖iitalic_i-th device in the t𝑡titalic_t-th scheduling period. The scheduling policy ΩΩ\Omegaroman_Ω is a map** from state space to action space as Ω(𝒮t)=𝒜tΩsubscript𝒮𝑡subscript𝒜𝑡\Omega(\mathcal{S}_{t})=\mathcal{A}_{t}roman_Ω ( caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Moreover, the system cost of the t𝑡titalic_t-th scheduling period is defined as

ct(𝒮t,𝒜t)(i,j)m𝒯i,jr𝟙(di,jm(t)>Di,jm)ω(i,j)m𝒯i,jfri,jm(t),subscript𝑐𝑡subscript𝒮𝑡subscript𝒜𝑡subscript𝑖𝑗subscript𝑚superscriptsubscript𝒯𝑖𝑗𝑟1superscriptsubscript𝑑𝑖𝑗𝑚𝑡superscriptsubscriptD𝑖𝑗𝑚𝜔subscript𝑖𝑗subscript𝑚superscriptsubscript𝒯𝑖𝑗𝑓superscriptsubscript𝑟𝑖𝑗𝑚𝑡\displaystyle\begin{split}c_{t}(\mathcal{S}_{t},\mathcal{A}_{t})\triangleq&% \sum_{({i},{j})\in{\mathcal{L}}}\sum_{{m}\in{\mathcal{T}_{{i},{j}}^{r}}}% \mathds{1}(d_{{i},{j}}^{m}(t)>\mathrm{D}_{{i},{j}}^{m})\\ &-\omega\sum_{({i},{j})\in{\mathcal{L}}}\sum_{{m}\in{\mathcal{T}_{{i},{j}}^{f}% }}r_{{i},{j}}^{m}(t),\end{split}start_ROW start_CELL italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ end_CELL start_CELL ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ caligraphic_L end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m ∈ caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_1 ( italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_t ) > roman_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - italic_ω ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ caligraphic_L end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m ∈ caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_t ) , end_CELL end_ROW (4)

where ω𝜔\omegaitalic_ω is a weight, Di,jmsuperscriptsubscriptD𝑖𝑗𝑚\mathrm{D}_{{i},{j}}^{m}roman_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the maximum tolerable RTT of the m𝑚{m}italic_m-th delay-sensitive task on the (i,j)𝑖𝑗({i},{j})( italic_i , italic_j )-th link. The indicator function 𝟙()1\mathds{1}(\mathcal{E})blackboard_1 ( caligraphic_E ) is 1111 if the event \mathcal{E}caligraphic_E is true, and 00 otherwise. Then, the average overall system cost is defined as the discounted summation of average system costs for all the scheduling periods, i.e.,

C¯(Ω)=limT𝔼[t=1Tγt1ct(𝒮t,Ω(𝒮t))].¯𝐶Ωsubscript𝑇𝔼delimited-[]superscriptsubscript𝑡1𝑇superscript𝛾𝑡1subscript𝑐𝑡subscript𝒮𝑡Ωsubscript𝒮𝑡\bar{C}(\Omega)=\lim_{T\rightarrow\infty}\mathbb{E}\bigg{[}\sum_{t=1}^{T}% \gamma^{t-1}c_{t}(\mathcal{S}_{t},\Omega(\mathcal{S}_{t}))\bigg{]}.over¯ start_ARG italic_C end_ARG ( roman_Ω ) = roman_lim start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Ω ( caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] . (5)

For the elaboration convenience, it is assumed that the system has run for at least N𝑁Nitalic_N scheduling periods before the first scheduling period, such that there are sufficient QoS observations in the system state. As a result, the controller design of the ReinWiFi system can be formulated as

Problem 1:minΩC¯(Ω).Problem 1:subscriptΩ¯𝐶Ω\text{\bf Problem 1:}\ \min_{\Omega}\ \bar{C}(\Omega).Problem 1: roman_min start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT over¯ start_ARG italic_C end_ARG ( roman_Ω ) . (6)

The Bellman’s equations for the above MDP is given by

Q(𝒮t,𝒜t)=𝔼𝒮t+1[ct(𝒮t,𝒜t)+γmin𝒜Q(𝒮t+1,𝒜)],𝑄subscript𝒮𝑡subscript𝒜𝑡subscript𝔼subscript𝒮𝑡1delimited-[]subscript𝑐𝑡subscript𝒮𝑡subscript𝒜𝑡𝛾subscriptsuperscript𝒜𝑄subscript𝒮𝑡1superscript𝒜\displaystyle Q(\mathcal{S}_{t},\mathcal{A}_{t})=\mathbb{E}_{\mathcal{S}_{t+1}% }\bigg{[}c_{t}(\mathcal{S}_{t},\mathcal{A}_{t})+\gamma\min_{\mathcal{A}^{% \prime}}Q(\mathcal{S}_{t+1},\mathcal{A}^{\prime})\bigg{]},italic_Q ( caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ roman_min start_POSTSUBSCRIPT caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q ( caligraphic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] , (7)

where Q(𝒮t,𝒜t)𝑄subscript𝒮𝑡subscript𝒜𝑡Q(\mathcal{S}_{t},\mathcal{A}_{t})italic_Q ( caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the Q-function with system state 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and action 𝒜tsubscript𝒜𝑡\mathcal{A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Moreover, the optimal scheduling is given by

Ω(𝒮)=argmin𝒜Q(𝒮,𝒜).superscriptΩ𝒮subscriptargmin𝒜𝑄𝒮𝒜\Omega^{*}(\mathcal{S})=\operatorname*{arg\,min}\limits_{\mathcal{A}}Q(% \mathcal{S},\mathcal{A}).roman_Ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_S ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT italic_Q ( caligraphic_S , caligraphic_A ) . (8)

Given the past scheduling actions and QoS observations (i.e., the system state), it is still difficult to accurately predict the relation between the scheduling action and task QoS in the current scheduling period. This is mainly because of the unknown interfering traffic and random channel contention. As a result, it is impossible to solve the above Bellman’s equations without any trial on the network performance. In this paper, we shall rely on the RL method to track the above unknown knowledge with the assistance of a preliminary observation dataset 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT.

Particularly, before the optimization, the dataset 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is collected from M𝑀Mitalic_M scheduling periods experiencing heterogeneous interfering traffic and link quality (e.g., the distances of links in \mathcal{L}caligraphic_L change due to mobility). Each of the scheduling periods (say the τ𝜏\tauitalic_τ-th one) is divided into two phases. In the first phase, a fixed testing scheduling action 𝒜psuperscript𝒜𝑝\mathcal{A}^{p}caligraphic_A start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is applied, and corresponding QoS observation 𝒪τpsubscriptsuperscript𝒪𝑝𝜏\mathcal{O}^{p}_{\tau}caligraphic_O start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is obtained; in the second phase, a random scheduling action 𝒜τssubscriptsuperscript𝒜𝑠𝜏\mathcal{A}^{s}_{\tau}caligraphic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT according to certain distribution is applied, and another QoS observation 𝒪τssubscriptsuperscript𝒪𝑠𝜏\mathcal{O}^{s}_{\tau}caligraphic_O start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is obtained. Hence, the dataset 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT can be expressed as 𝒮s{(𝒪τp,𝒜p,𝒪τs,𝒜τs)|τ=1,2,,M}superscript𝒮𝑠conditional-setsubscriptsuperscript𝒪𝑝𝜏superscript𝒜𝑝subscriptsuperscript𝒪𝑠𝜏subscriptsuperscript𝒜𝑠𝜏𝜏12𝑀\mathscr{S}^{s}\triangleq\left\{(\mathcal{O}^{p}_{\tau},\mathcal{A}^{p},% \mathcal{O}^{s}_{\tau},\mathcal{A}^{s}_{\tau})|\tau=1,2,\ldots,M\right\}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ≜ { ( caligraphic_O start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , caligraphic_A start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , caligraphic_O start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , caligraphic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) | italic_τ = 1 , 2 , … , italic_M }.

IV Q-Network for Online Scheduling

In this section, a novel Q-network design is proposed to approximate the Q-function. In order to accelerate the convergence of training and improve the scheduling performance, all the possible system performance of one scheduling period is divided into K𝐾Kitalic_K regions, and the inputs of the Q-network include not only the system state but also the performance region indices of the past N𝑁Nitalic_N scheduling periods.

Hence, the utilization of the proposed Q-network in the transmission scheduling can be divided into two stages. In the first stage, namely the offline stage, the performance regions are trained via the preliminary observation dataset 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, and the Q-network is then trained via 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT in all the performance regions respectively. In the second stage, namely the online stage, the Q-network is applied to the transmission scheduling and fine-trained according to the online QoS observations.

In this section, the performance region quantization is introduced first, followed by the structure of the Q-network. The hybrid offline and online training of Q-network is elaborated in Section V.

IV-A Performance Region Quantization

The QoS observations with the testing scheduling action 𝒜psuperscript𝒜𝑝\mathcal{A}^{p}caligraphic_A start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT are first extracted from the preliminary observation dataset 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT as 𝒮p{(𝒪τp,𝒜p)|τ=1,2,,M}superscript𝒮𝑝conditional-setsubscriptsuperscript𝒪𝑝𝜏superscript𝒜𝑝𝜏12𝑀\mathscr{S}^{p}\triangleq\left\{(\mathcal{O}^{p}_{\tau},\mathcal{A}^{p})|\tau=% 1,2,\ldots,M\right\}script_S start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ≜ { ( caligraphic_O start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , caligraphic_A start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) | italic_τ = 1 , 2 , … , italic_M }. The K𝐾Kitalic_K-means classification method [9] is then adopted to classify the QoS observations in 𝒮psuperscript𝒮𝑝\mathscr{S}^{p}script_S start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT into K𝐾Kitalic_K clusters. Denote the mean and variance of the observed throughputs (for the file delivery tasks) in 𝒮psuperscript𝒮𝑝\mathscr{S}^{p}script_S start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT as r¯¯𝑟\bar{r}over¯ start_ARG italic_r end_ARG and σr2superscriptsubscript𝜎𝑟2\sigma_{r}^{2}italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT respectively, the mean and variance of the RTTs (for the delay-sensitive tasks) as d¯¯𝑑\bar{d}over¯ start_ARG italic_d end_ARG and σd2superscriptsubscript𝜎𝑑2\sigma_{d}^{2}italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT respectively. The performance region quantization can be achieved by finding the K𝐾Kitalic_K cluster centers of the QoS observations in 𝒮psuperscript𝒮𝑝\mathscr{S}^{p}script_S start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT as follows:

{μ1,,μK}=argminμ1,,μKk=1Kτ=1Mϕ(𝒪τp)μk2,superscriptsubscript𝜇1superscriptsubscript𝜇𝐾subscriptargminsubscript𝜇1subscript𝜇𝐾superscriptsubscript𝑘1𝐾superscriptsubscript𝜏1𝑀superscriptnormitalic-ϕsubscriptsuperscript𝒪𝑝𝜏subscript𝜇𝑘2\{\mu_{1}^{*},\ldots,\mu_{K}^{*}\}=\operatorname*{arg\,min}\limits_{\mu_{1},% \ldots,\mu_{K}}\ \sum_{k=1}^{K}\sum_{\tau=1}^{M}\|\phi(\mathcal{O}^{p}_{\tau})% -\mu_{k}\|^{2},{ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ italic_ϕ ( caligraphic_O start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (9)

where ϕ(𝒪τp)italic-ϕsubscriptsuperscript𝒪𝑝𝜏\phi(\mathcal{O}^{p}_{\tau})italic_ϕ ( caligraphic_O start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) denotes the vectorization of the normalized QoS observations in 𝒪τpsubscriptsuperscript𝒪𝑝𝜏\mathcal{O}^{p}_{\tau}caligraphic_O start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT. Particularly, ϕ(𝒪τp)(𝐫τp,𝐝τp)italic-ϕsubscriptsuperscript𝒪𝑝𝜏superscriptsubscript𝐫𝜏𝑝superscriptsubscript𝐝𝜏𝑝\phi(\mathcal{O}^{p}_{\tau})\triangleq\left(\mathbf{r}_{\tau}^{p},\mathbf{d}_{% \tau}^{p}\right)italic_ϕ ( caligraphic_O start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ≜ ( bold_r start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_d start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ), where the row vector 𝐫τpsuperscriptsubscript𝐫𝜏𝑝\mathbf{r}_{\tau}^{p}bold_r start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT vectorizes the normalized throughputs of all file delivery tasks in 𝒪τpsubscriptsuperscript𝒪𝑝𝜏\mathcal{O}^{p}_{\tau}caligraphic_O start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT,

{ri,jm,p(τ)r¯σr|(i,j),m𝒯i,jf,ri,jm,p(τ)𝒪τp},conditional-setsuperscriptsubscript𝑟𝑖𝑗𝑚𝑝𝜏¯𝑟subscript𝜎𝑟formulae-sequencefor-all𝑖𝑗formulae-sequence𝑚superscriptsubscript𝒯𝑖𝑗𝑓superscriptsubscript𝑟𝑖𝑗𝑚𝑝𝜏subscriptsuperscript𝒪𝑝𝜏\left\{\frac{r_{{i},{j}}^{m,p}(\tau)-\bar{r}}{\sigma_{r}}\bigg{|}\forall({i},{% j})\in{\mathcal{L}},{m}\in{\mathcal{T}_{{i},{j}}^{f}},r_{{i},{j}}^{m,p}(\tau)% \in\mathcal{O}^{p}_{\tau}\right\},{ divide start_ARG italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_p end_POSTSUPERSCRIPT ( italic_τ ) - over¯ start_ARG italic_r end_ARG end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG | ∀ ( italic_i , italic_j ) ∈ caligraphic_L , italic_m ∈ caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_p end_POSTSUPERSCRIPT ( italic_τ ) ∈ caligraphic_O start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } ,

and the row vector 𝐝τpsuperscriptsubscript𝐝𝜏𝑝\mathbf{d}_{\tau}^{p}bold_d start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT vectorizes the normalized RTTs of all the delay-sensitive tasks in 𝒪τpsubscriptsuperscript𝒪𝑝𝜏\mathcal{O}^{p}_{\tau}caligraphic_O start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT,

{di,jm,p(τ)d¯σd|(i,j),m𝒯i,jr,di,jm,p(τ)𝒪τp}.conditional-setsuperscriptsubscript𝑑𝑖𝑗𝑚𝑝𝜏¯𝑑subscript𝜎𝑑formulae-sequencefor-all𝑖𝑗formulae-sequence𝑚superscriptsubscript𝒯𝑖𝑗𝑟superscriptsubscript𝑑𝑖𝑗𝑚𝑝𝜏subscriptsuperscript𝒪𝑝𝜏\left\{\frac{d_{{i},{j}}^{m,p}(\tau)-\bar{d}}{\sigma_{d}}\bigg{|}\forall({i},{% j})\in{\mathcal{L}},{m}\in{\mathcal{T}_{{i},{j}}^{r}},d_{{i},{j}}^{m,p}(\tau)% \in\mathcal{O}^{p}_{\tau}\right\}.{ divide start_ARG italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_p end_POSTSUPERSCRIPT ( italic_τ ) - over¯ start_ARG italic_d end_ARG end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG | ∀ ( italic_i , italic_j ) ∈ caligraphic_L , italic_m ∈ caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_p end_POSTSUPERSCRIPT ( italic_τ ) ∈ caligraphic_O start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } .

With {μ1,,μK}superscriptsubscript𝜇1superscriptsubscript𝜇𝐾\{\mu_{1}^{*},\ldots,\mu_{K}^{*}\}{ italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_μ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }, the performance region index of a scheduling period can be determined according to

ψ^=argminkϕ(𝒪^)μk2,^𝜓subscriptargmin𝑘superscriptnormitalic-ϕ^𝒪superscriptsubscript𝜇𝑘2\hat{\psi}=\operatorname*{arg\,min}\limits_{k}\ \|\phi(\widehat{\mathcal{O}})-% \mu_{k}^{*}\|^{2},over^ start_ARG italic_ψ end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_ϕ ( over^ start_ARG caligraphic_O end_ARG ) - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (10)

where 𝒪^^𝒪\widehat{\mathcal{O}}over^ start_ARG caligraphic_O end_ARG is the aggregation of QoS observations with the testing scheduling action 𝒜psuperscript𝒜𝑝\mathcal{A}^{p}caligraphic_A start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT in the scheduling period.

Remark 1

Note that the QoS observations of the testing scheduling action 𝒜psuperscript𝒜𝑝\mathcal{A}^{p}caligraphic_A start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT should be collected to determine the performance region index of one scheduling period. In the online stage, one short slot can be reversed in each scheduling period to apply the testing scheduling action 𝒜psuperscript𝒜𝑝\mathcal{A}^{p}caligraphic_A start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT.

IV-B Q-Network Structure

The input of the proposed Q-network is the extended system state of the current scheduling period, which is defined below:

Definition 3 (Extended System State)

In the t𝑡titalic_t-th scheduling period (tfor-all𝑡\forall t∀ italic_t) of either offline or online training, the extended system state consists of 𝒮^t{(ψ^tN,𝒪tN,𝒜tN),,(ψ^t1,𝒪t1,𝒜t1)}subscript^𝒮𝑡subscript^𝜓𝑡𝑁subscript𝒪𝑡𝑁subscript𝒜𝑡𝑁subscript^𝜓𝑡1subscript𝒪𝑡1subscript𝒜𝑡1\hat{\mathcal{S}}_{t}\triangleq\left\{(\hat{\psi}_{t-N},\mathcal{O}_{t-N},% \mathcal{A}_{t-N}),\ldots,(\hat{\psi}_{t-1},\mathcal{O}_{t-1},\mathcal{A}_{t-1% })\right\}over^ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ { ( over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_t - italic_N end_POSTSUBSCRIPT , caligraphic_O start_POSTSUBSCRIPT italic_t - italic_N end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_t - italic_N end_POSTSUBSCRIPT ) , … , ( over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , caligraphic_O start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) }, where ψ^tisubscript^𝜓𝑡𝑖\hat{\psi}_{t-i}over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT (i=1,2,,N𝑖12𝑁i=1,2,...,Nitalic_i = 1 , 2 , … , italic_N) is the performance region index.

The first part of the Q-network is a multi-head attention layer[10], which is trained to refine the performance region indices in the extended system state. The refined extended system state is then used as the input of the following three fully connected layers with 256256256256 nodes and ReLU activation function sequentially.

In order to address the issue of huge action space, we adopt the following linear approximation structure on the Q-function in the output of the Q-network:

Q(𝒮^,𝒜)i𝒰Qi(𝒮^,𝒜i),𝑄^𝒮𝒜subscript𝑖𝒰superscript𝑄𝑖^𝒮superscript𝒜𝑖Q(\hat{\mathcal{S}},\mathcal{A})\approx\sum_{{i}\in\mathcal{U}}Q^{i}(\hat{% \mathcal{S}},\mathcal{A}^{i}),italic_Q ( over^ start_ARG caligraphic_S end_ARG , caligraphic_A ) ≈ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_U end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( over^ start_ARG caligraphic_S end_ARG , caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , (11)

where Qi(𝒮^,𝒜i)superscript𝑄𝑖^𝒮superscript𝒜𝑖Q^{i}(\hat{\mathcal{S}},\mathcal{A}^{i})italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( over^ start_ARG caligraphic_S end_ARG , caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is referred to as the local Q-function of the i𝑖{i}italic_i-th device. Hence, the Q-network output consists of U𝑈Uitalic_U action clusters for U𝑈Uitalic_U devices, respectively. Each action cluster provides the values of the corresponding local Q-function for all possible local actions. As a result, the optimized local action of the i𝑖iitalic_i-th device (ifor-all𝑖\forall i∀ italic_i) in the t𝑡titalic_t-th scheduling period of either offline or online training can be obtained by minimizing the local Q-function, i.e.,

𝒜ti=argmin𝒜iQi(𝒮^t,𝒜i).subscriptsuperscript𝒜𝑖𝑡subscriptargminsuperscript𝒜𝑖superscript𝑄𝑖subscript^𝒮𝑡superscript𝒜𝑖\mathcal{A}^{i}_{t}=\operatorname*{arg\,min}\limits_{\mathcal{A}^{i}}Q^{i}(% \hat{\mathcal{S}}_{t},\mathcal{A}^{i}).caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( over^ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) . (12)

V Hybrid Q-Learning

LI(𝜽kw)=1|𝒪τs|[α(i,j)m𝒯i,jf(r^i,jm(𝒜;𝜽kw)ri,jm(τ))2+(i,j)n𝒯i,jr(d^i,jn(𝒜;𝜽kw)min{di,jn(τ),βDi,jn})2].superscript𝐿𝐼subscriptsuperscript𝜽𝑤𝑘1subscriptsuperscript𝒪𝑠𝜏delimited-[]𝛼subscript𝑖𝑗subscript𝑚superscriptsubscript𝒯𝑖𝑗𝑓superscriptsuperscriptsubscript^𝑟𝑖𝑗𝑚𝒜subscriptsuperscript𝜽𝑤𝑘superscriptsubscript𝑟𝑖𝑗𝑚𝜏2subscript𝑖𝑗subscript𝑛superscriptsubscript𝒯𝑖𝑗𝑟superscriptsuperscriptsubscript^𝑑𝑖𝑗𝑛𝒜subscriptsuperscript𝜽𝑤𝑘superscriptsubscript𝑑𝑖𝑗𝑛𝜏𝛽subscriptsuperscriptD𝑛𝑖𝑗2L^{I}(\boldsymbol{\theta}^{w}_{k})=\frac{1}{|\mathcal{O}^{s}_{\tau}|}\left[% \alpha\sum_{({i},{j})\in{\mathcal{L}}}\sum_{{m}\in{\mathcal{T}_{{i},{j}}^{f}}}% {\left(\hat{r}_{{i},{j}}^{{m}}(\mathcal{A};\boldsymbol{\theta}^{w}_{k})-r_{{i}% ,{j}}^{m}(\tau)\right)}^{2}+\sum_{({i},{j})\in{\mathcal{L}}}\sum_{{n}\in{% \mathcal{T}_{{i},{j}}^{r}}}{\left(\hat{d}_{{i},{j}}^{n}(\mathcal{A};% \boldsymbol{\theta}^{w}_{k})-\min\left\{d_{{i},{j}}^{n}(\tau),\beta\mathrm{D}^% {n}_{{i},{j}}\right\}\right)}^{2}\right].italic_L start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_O start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | end_ARG [ italic_α ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ caligraphic_L end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m ∈ caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( caligraphic_A ; bold_italic_θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_τ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ caligraphic_L end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_n ∈ caligraphic_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( caligraphic_A ; bold_italic_θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_min { italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_τ ) , italic_β roman_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (13)
Lq(𝜽tq)=𝔼[(ct(𝒮t,𝒜t)+γi𝒰min𝒜iQi(𝒮^t+1,𝒜i;𝜽tq,)i𝒰Qi(𝒮^t,𝒜ti;𝜽tq))2].superscript𝐿𝑞subscriptsuperscript𝜽𝑞𝑡𝔼delimited-[]superscriptsubscript𝑐𝑡subscript𝒮𝑡subscript𝒜𝑡𝛾subscript𝑖𝒰subscriptsuperscriptsuperscript𝒜𝑖superscript𝑄𝑖subscript^𝒮𝑡1superscript𝒜superscript𝑖subscriptsuperscript𝜽𝑞𝑡subscript𝑖𝒰superscript𝑄𝑖subscript^𝒮𝑡subscriptsuperscript𝒜𝑖𝑡subscriptsuperscript𝜽𝑞𝑡2L^{q}(\boldsymbol{\theta}^{q}_{t})=\mathbb{E}\left[{\left(c_{t}\left(\mathcal{% S}_{t},\mathcal{A}_{t}\right)+\gamma\sum_{{i}\in\mathcal{U}}\min_{{\mathcal{A}% ^{{i}}}^{^{\prime}}}Q^{i}\left(\hat{\mathcal{S}}_{t+1},\mathcal{A}^{{i}^{^{% \prime}}};\boldsymbol{\theta}^{q,-}_{t}\right)-\sum_{{i}\in\mathcal{U}}Q^{i}% \left(\hat{\mathcal{S}}_{t},\mathcal{A}^{i}_{t};\boldsymbol{\theta}^{q}_{t}% \right)\right)}^{2}\right].italic_L start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E [ ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_U end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( over^ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , caligraphic_A start_POSTSUPERSCRIPT italic_i start_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_q , - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_U end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( over^ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (14)

The Q-network is first trained in the offline stage based on the dataset 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, then tuned in the online stage.

V-A Offline Imitation Learning and Q-Network Training

To facilitate the offline training, the performance indices are calculated for all the scheduling periods in 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT according to (10). Denote the performance index of the τ𝜏\tauitalic_τ-th scheduling period in 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT as ψ^τssuperscriptsubscript^𝜓𝜏𝑠\hat{\psi}_{\tau}^{s}over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, the preliminary dataset 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT can be rewritten as

𝒮~s{(ψ^τs,𝒪τs,𝒜τs)|τ=1,2,,M}superscript~𝒮𝑠conditional-setsuperscriptsubscript^𝜓𝜏𝑠superscriptsubscript𝒪𝜏𝑠subscriptsuperscript𝒜𝑠𝜏𝜏12𝑀\tilde{\mathscr{S}}^{s}\triangleq\left\{(\hat{\psi}_{\tau}^{s},\mathcal{O}_{% \tau}^{s},\mathcal{A}^{s}_{\tau})|\tau=1,2,\ldots,M\right\}over~ start_ARG script_S end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ≜ { ( over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , caligraphic_O start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , caligraphic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) | italic_τ = 1 , 2 , … , italic_M } (15)

for notation convenience. Moreover, dataset 𝒮~ssuperscript~𝒮𝑠\tilde{\mathscr{S}}^{s}over~ start_ARG script_S end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT can be further divided into K𝐾Kitalic_K subsets as

𝒮~ks{(k,𝒪τs,𝒜τs)|ψ^τs=k}𝒮~s,k=1,,K.formulae-sequencesubscriptsuperscript~𝒮𝑠𝑘conditional-set𝑘superscriptsubscript𝒪𝜏𝑠subscriptsuperscript𝒜𝑠𝜏for-allsuperscriptsubscript^𝜓𝜏𝑠𝑘superscript~𝒮𝑠𝑘1𝐾\tilde{\mathscr{S}}^{s}_{k}\triangleq\left\{(k,\mathcal{O}_{\tau}^{s},\mathcal% {A}^{s}_{\tau})|\forall\hat{\psi}_{\tau}^{s}=k\right\}\subset\tilde{\mathscr{S% }}^{s},k=1,\ldots,K.over~ start_ARG script_S end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≜ { ( italic_k , caligraphic_O start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , caligraphic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) | ∀ over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = italic_k } ⊂ over~ start_ARG script_S end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_k = 1 , … , italic_K . (16)

Notice that the subsets 𝒮~kssubscriptsuperscript~𝒮𝑠𝑘\tilde{\mathscr{S}}^{s}_{k}over~ start_ARG script_S end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (k=1,2,,K𝑘12𝐾k=1,2,\ldots,Kitalic_k = 1 , 2 , … , italic_K) may not be sufficiently large for the training of the Q-network in all the performance regions, the imitation learning method is introduced. Particularly, we first train K𝐾Kitalic_K DNN networks (namely imitators), each of which consists of 10101010 fully connected layers and 256256256256 nodes per layer, to imitate the relation between the scheduling actions and QoS observations in the K𝐾Kitalic_K performance regions, respectively. Denote the imitators as f(𝒜;𝜽kw),k=1,2,,Kformulae-sequence𝑓𝒜subscriptsuperscript𝜽𝑤𝑘𝑘12𝐾f(\mathcal{A};\boldsymbol{\theta}^{w}_{k}),k=1,2,\ldots,Kitalic_f ( caligraphic_A ; bold_italic_θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_k = 1 , 2 , … , italic_K, where 𝒜𝒜\mathcal{A}caligraphic_A is the input action, and 𝜽kwsubscriptsuperscript𝜽𝑤𝑘\boldsymbol{\theta}^{w}_{k}bold_italic_θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents network parameters. The output of imitator f(𝒜;𝜽kw)𝑓𝒜subscriptsuperscript𝜽𝑤𝑘f(\mathcal{A};\boldsymbol{\theta}^{w}_{k})italic_f ( caligraphic_A ; bold_italic_θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is trained to approximate the QoS observations of the system in the k𝑘kitalic_k-th performance region with input action 𝒜𝒜\mathcal{A}caligraphic_A. Then, the Q-network can be trained via the K𝐾Kitalic_K imitators.

Imitator training: The k𝑘kitalic_k-th imitator (k=1,2,,K𝑘12𝐾k=1,2,...,Kitalic_k = 1 , 2 , … , italic_K) is trained by 𝒮~kssubscriptsuperscript~𝒮𝑠𝑘\tilde{\mathscr{S}}^{s}_{k}over~ start_ARG script_S end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Let r^i,jm(𝒜;𝜽kw)superscriptsubscript^𝑟𝑖𝑗𝑚𝒜subscriptsuperscript𝜽𝑤𝑘\hat{r}_{{i},{j}}^{{m}}(\mathcal{A};\boldsymbol{\theta}^{w}_{k})over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( caligraphic_A ; bold_italic_θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and d^i,jn(𝒜;𝜽kw)superscriptsubscript^𝑑𝑖𝑗𝑛𝒜subscriptsuperscript𝜽𝑤𝑘\hat{d}_{{i},{j}}^{n}(\mathcal{A};\boldsymbol{\theta}^{w}_{k})over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( caligraphic_A ; bold_italic_θ start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) be the throughput and RTT of the m𝑚{m}italic_m-th file delivery task and n𝑛{n}italic_n-th delay-sensitive task of the (i,j)𝑖𝑗({i},{j})( italic_i , italic_j )-th link in the output of the k𝑘kitalic_k-th imitator with input action 𝒜𝒜\mathcal{A}caligraphic_A. The loss function LIsuperscript𝐿𝐼L^{I}italic_L start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT is defined as (13), where ri,jm(τ),di,jn(τ)𝒪τssubscriptsuperscript𝑟𝑚𝑖𝑗𝜏subscriptsuperscript𝑑𝑛𝑖𝑗𝜏superscriptsubscript𝒪𝜏𝑠r^{m}_{{i},{j}}(\tau),d^{n}_{{i},{j}}(\tau)\in\mathcal{O}_{\tau}^{s}italic_r start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_τ ) , italic_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_τ ) ∈ caligraphic_O start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, α𝛼\alphaitalic_α and β𝛽\betaitalic_β are both weights, and the minimization is to limit the range of RTTs.

Offline Q-network training: Based on the imitators, the Q-network can be trained in each performance region respectively. Particularly, in the t𝑡titalic_t-th scheduling period of offline training with the k𝑘kitalic_k-th imitator (t,kfor-all𝑡𝑘\forall t,k∀ italic_t , italic_k), providing the scheduling action, the outputs of the imitator are treated as the QoS observations in the k𝑘kitalic_k-th performance region, which is then used to update the extended system state of the (t+1)𝑡1(t+1)( italic_t + 1 )-th scheduling period in the input of the Q-network. The Q-network is also updated in the above iterative procedure according to the Q-learning method [11]. The loss function Lqsuperscript𝐿𝑞L^{q}italic_L start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT is defined in (14), where Q(,;𝜽tq)𝑄subscriptsuperscript𝜽𝑞𝑡Q(\cdot,\cdot;\boldsymbol{\theta}^{q}_{t})italic_Q ( ⋅ , ⋅ ; bold_italic_θ start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) represents the Q-network parameters in the t𝑡titalic_t-th scheduling period, and 𝜽tq,subscriptsuperscript𝜽𝑞𝑡\boldsymbol{\theta}^{q,-}_{t}bold_italic_θ start_POSTSUPERSCRIPT italic_q , - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the parameter of target network as in [11].

In order to efficiently explore the action space, an upper confidence bound (UCB) based exploration policy is introduced to determine the scheduling action in the offline training of Q-network. Taking the t𝑡titalic_t-th scheduling period with the k𝑘kitalic_k-th imitator as the example, we first define the UCB of the action 𝒜isuperscript𝒜𝑖\mathcal{A}^{i}caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT of i𝑖{i}italic_i-th device as

UCBt(k,𝒜i)=Qti(𝒮^t,𝒜i;𝜽tq)+4ηlntTt(k,𝒜i),𝑈𝐶subscript𝐵𝑡𝑘superscript𝒜𝑖superscriptsubscript𝑄𝑡𝑖subscript^𝒮𝑡superscript𝒜𝑖subscriptsuperscript𝜽𝑞𝑡4𝜂𝑡subscript𝑇𝑡𝑘superscript𝒜𝑖UCB_{t}(k,\mathcal{A}^{i})=Q_{t}^{i}(\hat{\mathcal{S}}_{t},\mathcal{A}^{i};% \boldsymbol{\theta}^{q}_{t})+\sqrt{\frac{4\eta\ln t}{T_{t}(k,\mathcal{A}^{i})}},italic_U italic_C italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k , caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( over^ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + square-root start_ARG divide start_ARG 4 italic_η roman_ln italic_t end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k , caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_ARG end_ARG , (17)

where Tt(k,𝒜i)subscript𝑇𝑡𝑘superscript𝒜𝑖T_{t}(k,\mathcal{A}^{i})italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k , caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) counts the number of times the action 𝒜isuperscript𝒜𝑖\mathcal{A}^{i}caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is taken up to the t𝑡titalic_t-th scheduling period. The hyper-parameter η𝜂\etaitalic_η is used to balance the exploration and exploitation. As a result, the scheduling action is determined as follows:

𝒜ti={argminUCBt(k,𝒜i)with probability 1ϵt,𝒜iUnif(𝒜i)with probability ϵt,superscriptsubscript𝒜𝑡𝑖casesargmin𝑈𝐶subscript𝐵𝑡𝑘superscript𝒜𝑖with probability 1subscriptitalic-ϵ𝑡similar-tosuperscript𝒜𝑖Unifsuperscript𝒜𝑖with probability subscriptitalic-ϵ𝑡\mathcal{A}_{t}^{i}=\begin{cases}\operatorname*{arg\,min}UCB_{t}(k,\mathcal{A}% ^{i})&\text{with probability }1-\epsilon_{t},\\ \mathcal{A}^{i}\sim\mathrm{Unif}(\mathscr{A}^{i})&\text{with probability }% \epsilon_{t},\end{cases}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { start_ROW start_CELL start_OPERATOR roman_arg roman_min end_OPERATOR italic_U italic_C italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k , caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_CELL start_CELL with probability 1 - italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ roman_Unif ( script_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_CELL start_CELL with probability italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL end_ROW (18)

where Unif(𝒜i)Unifsuperscript𝒜𝑖\mathrm{Unif}(\mathscr{A}^{i})roman_Unif ( script_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is the uniform distribution over action space 𝒜isuperscript𝒜𝑖\mathscr{A}^{i}script_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT of i𝑖{i}italic_i-th device and exploration rate ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT should satisfy the limit condition limtϵt=0subscript𝑡subscriptitalic-ϵ𝑡0\lim_{t\to\infty}\epsilon_{t}=0roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.

V-B Online Q-network Training

The online Q-network training with the same loss function as in (14) could be applied to further improve the performance of the proposed ReinWiFi system. Particularly, in the t𝑡titalic_t-th scheduling period of the online stage, the scheduling action of the i𝑖{i}italic_i-th device, denoted as 𝒜tisuperscriptsubscript𝒜𝑡𝑖\mathcal{A}_{t}^{i}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, is determined by the ϵitalic-ϵ\epsilonitalic_ϵ-greedy policy as follows:

𝒜ti={argminQi(𝒮^t,𝒜i;𝜽t)with probability 1ϵt,𝒜iUnif(𝒜i)with probability ϵt,superscriptsubscript𝒜𝑡𝑖casesargminsuperscript𝑄𝑖subscript^𝒮𝑡superscript𝒜𝑖subscript𝜽𝑡with probability 1subscriptitalic-ϵ𝑡similar-tosuperscript𝒜𝑖Unifsuperscript𝒜𝑖with probability subscriptitalic-ϵ𝑡\mathcal{A}_{t}^{i}=\begin{cases}\operatorname*{arg\,min}Q^{i}(\hat{\mathcal{S% }}_{t},\mathcal{A}^{i};\boldsymbol{\theta}_{t})&\text{with probability }1-% \epsilon_{t},\\ \mathcal{A}^{i}\sim\mathrm{Unif}(\mathscr{A}^{i})&\text{with probability }% \epsilon_{t},\end{cases}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { start_ROW start_CELL start_OPERATOR roman_arg roman_min end_OPERATOR italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( over^ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL with probability 1 - italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ roman_Unif ( script_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_CELL start_CELL with probability italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL end_ROW (19)

where ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Unif(𝒜i)Unifsuperscript𝒜𝑖\mathrm{Unif}(\mathscr{A}^{i})roman_Unif ( script_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) are defined in (18).

VI Experiments

The proposed ReinWiFi system is implemented in a WiFi network with one HONOR XD30 AP and 3333 UEs each equipped with a TP-Link TL-WDN6200 USB WiFi adapter in the experiment111The source code of implementation is available online in https://github.com/QianrenLi/ReinWiFi.. Denote the AP as u0subscript𝑢0u_{0}italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the three UEs as u1,u2,u3subscript𝑢1subscript𝑢2subscript𝑢3u_{1},u_{2},u_{3}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, respectively. The network is working on the 5555G WiFi band following the IEEE 802.11ac specification. The real-time controller is implemented in a laptop with Intel Core i7-8750H CPU and Ubuntu 20.04 operating system. An Ethernet connection with a maximum data rate of 1111 Gbps is employed to facilitate communication between the controller and the AP. Moreover, we implement a Linux module to adapt the CW sizes of TL-WDN6200 adapters in real-time from user space. Hence, the controller can collect the QoS observations from UEs and notify the scheduling actions via WiFi, such that the UEs’ transmission scheduling can be adjusted accordingly.

Both file delivery tasks and delay-sensitive tasks are tested in the experiment. The former tasks with a sufficient backlog are transmitted with the BE priority. The latter tasks, consisting of two types, are delivered with the VI priority. The data rates of type I and II delay-sensitive tasks are λ1=50subscript𝜆150\lambda_{1}=50italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 50Mbps and λ2=25subscript𝜆225\lambda_{2}=25italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 25Mbps, respectively. The packet arrival intervals of the two types are both 16161616 ms. Moreover, the maximum tolerable RTTs are 16161616ms and 28282828ms, respectively. The universal set of communication tasks tested in the experiment includes a delay-sensitive task with arrival data rate λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (Task 1) and a file delivery task (Task 2) on the (u1,u0)subscript𝑢1subscript𝑢0(u_{1},u_{0})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )-th link; a delay-sensitive task with arrival data rate λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (Task 3) on the (u2,u0)subscript𝑢2subscript𝑢0(u_{2},u_{0})( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )-th link; a delay-sensitive task with arrival data rate λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (Task 4) on the (u3,u0)subscript𝑢3subscript𝑢0(u_{3},u_{0})( italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )-th link. The quality of the (u1,u0)subscript𝑢1subscript𝑢0(u_{1},u_{0})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )-th, (u2,u0)subscript𝑢2subscript𝑢0(u_{2},u_{0})( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )-th, and (u3,u0)subscript𝑢3subscript𝑢0(u_{3},u_{0})( italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )-th links depend on their distances and the propagation environment, which could be changed in the experiment.

In the experiment, the duration of the scheduling period is 1 second, the CW size takes values from {2i1i=1,2,,10}conditional-setsuperscript2𝑖1𝑖1210\{2^{i}-1\mid i=1,2,\ldots,10\}{ 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - 1 ∣ italic_i = 1 , 2 , … , 10 }, and throughput limitation takes values from {i20ri,jm,maxi=0,1,,20}conditional-set𝑖20superscriptsubscript𝑟𝑖𝑗𝑚𝑖0120\{\frac{i}{20}r_{{i},{j}}^{{m},\max}\mid i=0,1,\ldots,20\}{ divide start_ARG italic_i end_ARG start_ARG 20 end_ARG italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , roman_max end_POSTSUPERSCRIPT ∣ italic_i = 0 , 1 , … , 20 }, where ri,jm,maxsuperscriptsubscript𝑟𝑖𝑗𝑚r_{{i},{j}}^{{m},\max}italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , roman_max end_POSTSUPERSCRIPT = 600 Mbps. Moreover, in addition to the background interference, the interfering traffic between two interfering UEs, denoted as u4subscript𝑢4u_{4}italic_u start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and u5subscript𝑢5u_{5}italic_u start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, is generated with a random data rate and BE priority in the same channel.

The preliminary observation dataset 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is collected from the following three different traffic patterns (TPs): (1) Tasks 1 and 2 are activated; (2) Tasks 1, 2, and 3 are activated; and (3) Tasks 1, 2, 3 and 4 are activated. In all the TPs, the communication distances of the links are altered to exploit the diversity of link rates. In the collection of 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, the testing scheduling action 𝒜psuperscript𝒜𝑝\mathcal{A}^{p}caligraphic_A start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is first applied in the first half of the scheduling period, where the CW size and throughput limitations are 7777 and 300300300300 Mbps respectively. Then, a randomized action is applied in the second half. QoS observations of both actions are collected in each scheduling period.

Based on dataset 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, the performance of the three TPs are quantized into 3, 6, and 6 regions, respectively. Then, 15151515 QoS imitators are trained according to Section V with α=1𝛼1\alpha=1italic_α = 1, β=3𝛽3\beta=3italic_β = 3. Given the trained QoS imitators, the Q-network is further trained as elaborated in Section V with ω=1/ri,jm,max𝜔1subscriptsuperscript𝑟𝑚𝑖𝑗\omega=1/{r}^{m,\max}_{{i},{j}}italic_ω = 1 / italic_r start_POSTSUPERSCRIPT italic_m , roman_max end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT.

To demonstrate the performance gain, the proposed framework is compared with two baselines. The first baseline, namely Standard EDCA, relies on the conventional 802.11 EDCA protocol. The second baseline, namely Rate Control Only, adapts the throughput limitation of file delivery tasks via the proposed framework with the CW sizes following the 802.11 EDCA protocol. The performance evaluation and comparison are conducted in 11111111 distinct test scenarios listed in Table I, where only the first 5555 scenarios have been measured in the preliminary observation dataset 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT.

Scenario TP
Link Rate
(Mbps)
Scenario TP
Link Rate
(Mbps)
1 1 563, 499, 572 7 3 563, 424, 572
2 2 563, 499, 572 8 2 563, 400, 346
3 3 563, 499, 572 9 3 563, 400, 346
4 2 563, 370, 572 10 2 459, 499, 572
5 3 563, 370, 572 11 3 459, 499, 572
6 3 563, 499, 476
TABLE I: Table of test scenarios, where the link rate refers to the maximum data rates of the (u1,u0)subscript𝑢1subscript𝑢0(u_{1},u_{0})( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )-th, (u2,u0)subscript𝑢2subscript𝑢0(u_{2},u_{0})( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )-th, and (u3,u0)subscript𝑢3subscript𝑢0(u_{3},u_{0})( italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )-th links.
Refer to caption
Figure 1: Performance comparison in scenarios 15similar-to151\sim 51 ∼ 5.

The performance comparison of the proposed framework and the two baselines in the first 5555 test scenarios is illustrated in Fig. 1, where the online training of the Q-network is not applied in the proposed framework and the Baseline 2. It can be observed that the proposed Q-network offline trained via imitators significantly outperforms the conventional EDCA mechanism. Moreover, the performance gain of the Baseline 2 over Baseline 1 demonstrates the necessity of the throughput limitation, which has never been investigated in the existing literature.

The performance comparison in the test scenarios 6666 to 11111111 is illustrated in Fig. 2. Since these test scenarios are not measured in the preliminary observation dataset 𝒮ssuperscript𝒮𝑠\mathscr{S}^{s}script_S start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, the performance gain of the proposed scheme over the Baseline 1 demonstrates the good generalization capability of the proposed Q-network. It can also be observed that the online training could further improve the scheduling performance of the Q-network, which has already been trained in the offline stage.

Refer to caption
Figure 2: Performance comparison in scenarios 611similar-to6116\sim 116 ∼ 11.

VII Conclusion

In this paper, a reinforcement-learning-based framework, namely ReinWiFi, is proposed for the application-layer QoS optimization of WiFi networks. Due to the absence of PHY-layer and MAC-layer status, the historical scheduling parameters and QoS observations are considered as the system state in the determination of the current scheduling parameters. Because of the unknown interference and vendor-dependent implementations, a novel Q-network is proposed to track the relation between the system state, scheduling parameter, and the overall QoS. Moreover, an imitation learning method is introduced to improve the training efficiency. It is demonstrated via the testbed that the proposed framework, with the dynamic adaptation of CW size and throughput limitation, significantly outperforms the convention EDCA mechanism.

References

  • [1] A. Kumar, G. Verma, C. Rao, A. Swami, and S. Segarra, “Adaptive contention window design using deep Q𝑄Qitalic_Q-learning,” in IEEE Int. Conf. Acoust., Speech Signal Process.(ICASSP).   IEEE, Jun. 2021, pp. 4950–4954.
  • [2] L. Zhang, H. Yin, Z. Zhou, S. Roy, and Y. Sun, “Enhancing WiFi multiple access performance with federated deep reinforcement learning,” in IEEE Veh. Technol. Conf. (VTC Fall).   IEEE, Nov. 2020, pp. 1–6.
  • [3] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi-agent reinforcement learning,” J. Mach. Learn. Res., vol. 21, no. 1, pp. 7234–7284, Jan. 2020.
  • [4] Z. Guo, Z. Chen, P. Liu, J. Luo, X. Yang, and X. Sun, “Multi-agent reinforcement learning-based distributed channel access for next generation wireless networks,” IEEE J. Sel. Areas Commun., vol. 40, no. 5, pp. 1587–1599, May 2022.
  • [5] R. Ali, N. Shahin, Y. B. Zikria, B.-S. Kim, and S. W. Kim, “Deep reinforcement learning paradigm for performance optimization of channel observation–based MAC protocols in dense WLANs,” IEEE Access, vol. 7, pp. 3500–3511, 2018.
  • [6] H. v. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q𝑄Qitalic_Q-learning,” in Proc. AAAI Conf. Artif. Intell., Feb. 2016, p. 2094–2100.
  • [7] S.-C. Chen, C.-Y. Li, and C.-H. Chiu, “An experience driven design for IEEE 802.11ac rate adaptation based on reinforcement learning,” in Proc. IEEE Int. Conf. Comput. Commun. (INFOCOM), May 2021, pp. 1–10.
  • [8] Wi-Fi Alliance, Wi-Fi Display Technical Task Group, “Wi-Fi display technical specification v1.2n,” 2011.
  • [9] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. Berkeley Symp. Math. Statist. Probab., vol. 1, Oakland, CA, USA, 1967, pp. 281–297.
  • [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Int. Conf. Neural Inf. Process. Syst. (NIPS), Dec. 2017, p. 6000–6010.
  • [11] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.