ReinWiFi: A Reinforcement-Learning-Based Framework for the Application-Layer QoS Optimization of WiFi Networks
Abstract
In this paper, a reinforcement-learning-based scheduling framework is proposed and implemented to optimize the application-layer quality-of-service (QoS) of a practical wireless local area network (WLAN) suffering from unknown interference. Particularly, application-layer tasks of file delivery and delay-sensitive communication, e.g., screen projection, in a WLAN with enhanced distributed channel access (EDCA) mechanism, are jointly scheduled by adjusting the contention window sizes and application-layer throughput limitation, such that their QoS, including the throughput of file delivery and the round trip time of the delay-sensitive communication, can be optimized. Due to the unknown interference and vendor-dependent implementation of the network interface card, the relation between the scheduling policy and the system QoS is unknown. Hence, a reinforcement learning method is proposed, in which a novel Q-network is trained to map from the historical scheduling parameters and QoS observations to the current scheduling action. It is demonstrated on a testbed that the proposed framework can achieve a significantly better QoS than the conventional EDCA mechanism.
I Introduction
Reinforcement learning (RL) for radio resource management has been attracting tremendous attention since it is a promising technique to tackle unknown system statistics and solve the prohibitive policy optimization problem with tolerable complexity and good performance. Moreover, the RL technique also has great potential to optimize a wireless system even without accurate or complete observation of the system state, which might happen in practical implementations.
There have been a significant amount of works optimizing the throughput, delay or age-of-information (AoI) of wireless networks via the method of RL. Most of these works assumed full knowledge of the system state in algorithm design, which could be applied to the systems where the global system state could be collected at a centralized controller. On the other hand, RL was also utilized to optimize the performance of wireless systems with distributive transmission scheduling, e.g., wireless fidelity (WiFi) systems. For instance, an adaptive channel contention mechanism was proposed for WiFi systems in [1], where a local RL agent was deployed at each user equipment (UE). The local agents adjusted the minimum contention window (MCW) size according to the global statistics of successful channel contention such that the transmission fairness among the agents can be ensured. Instead of global statistics, a distributive RL algorithm with the assistance of federated learning was proposed in [2] to adapt the channel contention according to the local channel state, such that the local throughput was optimized. Moreover, a deep multi-agent RL technique based on the QMIX algorithm [3] was proposed in [4] to improve network throughput while maintaining user fairness. In this work, the channel contention decision was made according to the history of the last transmission duration. In order to resolve the collision issue of the distributive channel access, deep RL algorithms were proposed in [5] to determine the timing of doubling the contention window based on the estimated collision probability. In addition to the adaptive channel contention, a double deep Q-network (DDQN) [6] based rate adaptation algorithm was proposed in [7] to improve network throughput, where the agent learned the optimal transmission rate based on the modulation and coding scheme (MCS) and frame loss rate. Most of the above literature assumed knowledge of the physical (PHY) layer and media access control (MAC) layer system states. However, it might be challenging to obtain such knowledge in the scheduler design of a practical WiFi network. Moreover, the absence of knowledge on co-channel interference and the vendor-dependent implementation of WiFi adapters would also raise challenges in the optimization of scheduling policies.
In this paper, we would like to shed some light on the RL-based scheduling design for practical WiFi systems suffering from unknown co-channel interference. Particularly, a framework, namely ReinWiFi, is proposed for the scheduling of delay-sensitive communication tasks and file delivery tasks in the application layer of a WiFi network. In ReinWiFi, a controller periodically collects the past scheduling parameters and average quality-of-service (QoS) observations of all the application-layer tasks, determines rate limitation and contention window size for all the transmitters, such that the total throughput of file delivery tasks is maximized and the latency requirements of delay-sensitive tasks are ensured. It is shown by the experiments that the proposed framework can adapt to the variation of task number, interfering traffic, and link quality, and significantly outperforms the conventional EDCA mechanism.
II System Model
II-A Deployment Scenario
The proposed ReinWiFi system is deployed in a WiFi network with multiple connected access points (APs) and UEs working on the same channel. Denote the number of the devices, including the APs and UEs, in the WiFi network as , the set of these devices as , and the communication link from the -th device to the -th one as the -th link (). The communication links can be from UE to AP, from AP to UE, or between UEs (i.e., WiFi Direct). We define as the set of all communication links in the system and as the set of communication links from the -th device. As a remark, one UE could simultaneously maintain the communication links to the AP and other UEs, where the transmission of the infrastructure and WiFi Direct modes is separated in the time domain.
The data traffics raised by the applications of UEs in are referred to as communication tasks in this paper. For example, the application projecting the screen of a mobile phone to a laptop via WiFi Direct will raise a delay-sensitive task, e.g., Miracast [8], where an application-layer packet (i.e., video frame) is generated and delivered periodically (the typical period is ms). Moreover, file sharing between two devices will raise a file delivery task. For the elaboration convenience, we define and as the universal sets of file delivery tasks and delay-sensitive tasks on the -th link, respectively. A task is in the inactive state if there is no packet arrival or buffered file at the transmitter.
Because of the transmission latency constraint, the delay-sensitive tasks should be scheduled with higher priority than the file delivery ones. Hence, all the transmitters access the channel via the enhanced distributed channel access (EDCA) mechanism defined in IEEE 802.11e. Particularly, four access category (AC) queues, namely voice (VI), video (VO), best effort (BE), and background (BK), are adopted at all the transmitters. The transmission priorities of the four AC queues are differentiated by values of arbitration inter-frame spacing (AIFS) and contention window (CW) size. As in the practical systems, the file delivery tasks are scheduled with the BE priority, and the delay-sensitive tasks are scheduled with the VI priority. The latter has smaller AIFS and CW size, leading to a larger successful probability in channel contention. As a remark, due to the distributive channel contention mechanism, it is infeasible to accurately control the packet transmission order among the devices of a WiFi network with commercial network interface cards (NICs). Instead, the packet transmission in the ReinWiFi system is scheduled in a stochastic manner by adapting the CW sizes of AC queues in each device.
There are some other WiFi networks sharing the same channel in the coverage of the considered network. The traffic in these networks would degrade the QoS of the considered network, e.g., larger delivery latency and lower throughput. Denote the set of devices in the interfering networks as . The communications among the devices in , namely interfering traffic, cannot be scheduled by the ReinWiFi system. Instead, the ReinWiFi system is designed to deduce the interference level and adjust the transmission accordingly.
II-B Task Queuing Model
For each file delivery task, all the information bits to be delivered are saved in an application-layer buffer, and a user datagram protocol (UDP) socket is established at the very beginning of transmission. The data dispatch from the buffer to the UDP socket is controlled by a dispatcher. The UDP socket encapsulates the received data from the dispatcher into UDP datagrams and forwards them to the driver of NIC for WiFi transmission accordingly. As a remark, the new datagrams at the NIC may not be transmitted immediately. In fact, each NIC maintains four MAC-layer AC queues associated with the four transmission priorities, respectively. The arrival datagrams are saved in the corresponding queues and transmitted following the vendor’s protocol. The queuing status of the NIC is usually not accessible in the application-layer. Thus, it is infeasible for the proposed system to know when the NIC completely delivers a datagram; it is, therefore, infeasible for the proposed system to precisely control the transmission of a UDP datagram or an application-layer packet. As a result, the scheduling of the proposed system is designed based on the average observable performance in the application layer.
Specifically, the transmission time is organized into a sequence of scheduling periods, each with a duration of seconds. is sufficiently large to accommodate a number of MAC protocol data unit transmissions. Due to the invisibility of NIC status, the QoS of a file delivery task is measured by its application-layer throughput in one scheduling period. Particularly, for the -th file delivery task of the -th link, its QoS in the -th scheduling period is defined as the number of information bits transferred from the task buffer to the associated UDP socket. The dispatcher is designed to adaptively limit the throughput of the file delivery task such that delay-sensitive tasks could have a larger chance to access the channel. Hence, let be the throughput limitation of the -th file delivery task of -th link in the -th scheduling slot, the dispatcher would make sure
(1) |
For each delay-sensitive task, a task queue and UDP socket are established at the very beginning. The application-layer packets arrive at the task queue periodically with a fixed average data rate. The first packet in the queue is forwarded to the UDP socket for WiFi transmission as long as the socket is idle. Due to the lack of MAC-layer status, the measurement of the transmission latency of a packet could hardly be accurate. Hence, we use the round-trip time (RTT) as the QoS measurement of delay-sensitive communication tasks. Particularly, for each delay-sensitive task, an acknowledgment will be sent back from the receiver to the transmitter when an application-layer packet is completely received. Hence, the transmitter can calculate the RTTs of all packet transmissions. For the -th delay-sensitive communication task of the -th link (), its QoS in the -th scheduling period is defined as the average RTT of the packets transmitted in this scheduling period.
II-C Scheduling Model
Denote the CW sizes of the VI and BE priorities of the -th device at -th scheduling period as and respectively, we shall focus on the joint scheduling of these channel contention parameters as well as the dispatchers’ throughput limitation in each scheduling period.
Particularly, each transmitter collects the QoS observations of its tasks at the end of each scheduling period and delivers them to a centralized controller, which can be implemented in an AP or other device. Not all the tasks in the universal task sets are in the active state. The average RTTs and throughputs of the inactive delay-sensitive and file delivery tasks are represented by a sufficiently large value and , respectively. Hence, the aggregation of QoS observations received at the controller at the end of the -th scheduling period can be represented as
(2) |
Due to the time-varying traffic of the interfering devices, the scheduling parameters, including the file throughput limitations and CW sizes, are adapted at the centralized controller in each schedule according to the system’s scheduling parameters and QoS observations in the past scheduling periods. Specifically, the aggregation of scheduling parameters over a period is represented as
(3) |
Thus, at the very beginning of the -th scheduling period, () is determined based on past scheduling parameters and QoS observations .
III Problem Formulation
The proposed ReinWiFi system should successively make scheduling decisions for each scheduling period. Hence, it is formulated as a Markov decision process (MDP) in the following.
Definition 1 (System State)
In the -th scheduling period (), the system state is defined as the aggregation of the QoS observations and scheduling parameters of the past scheduling periods. Thus, .
Definition 2 (Scheduling Action and Policy)
Denote defined in (3) as the scheduling action in the -th scheduling period, , as the local scheduling action of the -th device in the -th scheduling period. The scheduling policy is a map** from state space to action space as .
Moreover, the system cost of the -th scheduling period is defined as
(4) |
where is a weight, is the maximum tolerable RTT of the -th delay-sensitive task on the -th link. The indicator function is if the event is true, and otherwise. Then, the average overall system cost is defined as the discounted summation of average system costs for all the scheduling periods, i.e.,
(5) |
For the elaboration convenience, it is assumed that the system has run for at least scheduling periods before the first scheduling period, such that there are sufficient QoS observations in the system state. As a result, the controller design of the ReinWiFi system can be formulated as
(6) |
The Bellman’s equations for the above MDP is given by
(7) |
where is the Q-function with system state and action . Moreover, the optimal scheduling is given by
(8) |
Given the past scheduling actions and QoS observations (i.e., the system state), it is still difficult to accurately predict the relation between the scheduling action and task QoS in the current scheduling period. This is mainly because of the unknown interfering traffic and random channel contention. As a result, it is impossible to solve the above Bellman’s equations without any trial on the network performance. In this paper, we shall rely on the RL method to track the above unknown knowledge with the assistance of a preliminary observation dataset .
Particularly, before the optimization, the dataset is collected from scheduling periods experiencing heterogeneous interfering traffic and link quality (e.g., the distances of links in change due to mobility). Each of the scheduling periods (say the -th one) is divided into two phases. In the first phase, a fixed testing scheduling action is applied, and corresponding QoS observation is obtained; in the second phase, a random scheduling action according to certain distribution is applied, and another QoS observation is obtained. Hence, the dataset can be expressed as .
IV Q-Network for Online Scheduling
In this section, a novel Q-network design is proposed to approximate the Q-function. In order to accelerate the convergence of training and improve the scheduling performance, all the possible system performance of one scheduling period is divided into regions, and the inputs of the Q-network include not only the system state but also the performance region indices of the past scheduling periods.
Hence, the utilization of the proposed Q-network in the transmission scheduling can be divided into two stages. In the first stage, namely the offline stage, the performance regions are trained via the preliminary observation dataset , and the Q-network is then trained via in all the performance regions respectively. In the second stage, namely the online stage, the Q-network is applied to the transmission scheduling and fine-trained according to the online QoS observations.
In this section, the performance region quantization is introduced first, followed by the structure of the Q-network. The hybrid offline and online training of Q-network is elaborated in Section V.
IV-A Performance Region Quantization
The QoS observations with the testing scheduling action are first extracted from the preliminary observation dataset as . The -means classification method [9] is then adopted to classify the QoS observations in into clusters. Denote the mean and variance of the observed throughputs (for the file delivery tasks) in as and respectively, the mean and variance of the RTTs (for the delay-sensitive tasks) as and respectively. The performance region quantization can be achieved by finding the cluster centers of the QoS observations in as follows:
(9) |
where denotes the vectorization of the normalized QoS observations in . Particularly, , where the row vector vectorizes the normalized throughputs of all file delivery tasks in ,
and the row vector vectorizes the normalized RTTs of all the delay-sensitive tasks in ,
With , the performance region index of a scheduling period can be determined according to
(10) |
where is the aggregation of QoS observations with the testing scheduling action in the scheduling period.
Remark 1
Note that the QoS observations of the testing scheduling action should be collected to determine the performance region index of one scheduling period. In the online stage, one short slot can be reversed in each scheduling period to apply the testing scheduling action .
IV-B Q-Network Structure
The input of the proposed Q-network is the extended system state of the current scheduling period, which is defined below:
Definition 3 (Extended System State)
In the -th scheduling period () of either offline or online training, the extended system state consists of , where () is the performance region index.
The first part of the Q-network is a multi-head attention layer[10], which is trained to refine the performance region indices in the extended system state. The refined extended system state is then used as the input of the following three fully connected layers with nodes and ReLU activation function sequentially.
In order to address the issue of huge action space, we adopt the following linear approximation structure on the Q-function in the output of the Q-network:
(11) |
where is referred to as the local Q-function of the -th device. Hence, the Q-network output consists of action clusters for devices, respectively. Each action cluster provides the values of the corresponding local Q-function for all possible local actions. As a result, the optimized local action of the -th device () in the -th scheduling period of either offline or online training can be obtained by minimizing the local Q-function, i.e.,
(12) |
V Hybrid Q-Learning
(13) |
(14) |
The Q-network is first trained in the offline stage based on the dataset , then tuned in the online stage.
V-A Offline Imitation Learning and Q-Network Training
To facilitate the offline training, the performance indices are calculated for all the scheduling periods in according to (10). Denote the performance index of the -th scheduling period in as , the preliminary dataset can be rewritten as
(15) |
for notation convenience. Moreover, dataset can be further divided into subsets as
(16) |
Notice that the subsets () may not be sufficiently large for the training of the Q-network in all the performance regions, the imitation learning method is introduced. Particularly, we first train DNN networks (namely imitators), each of which consists of fully connected layers and nodes per layer, to imitate the relation between the scheduling actions and QoS observations in the performance regions, respectively. Denote the imitators as , where is the input action, and represents network parameters. The output of imitator is trained to approximate the QoS observations of the system in the -th performance region with input action . Then, the Q-network can be trained via the imitators.
Imitator training: The -th imitator () is trained by . Let and be the throughput and RTT of the -th file delivery task and -th delay-sensitive task of the -th link in the output of the -th imitator with input action . The loss function is defined as (13), where , and are both weights, and the minimization is to limit the range of RTTs.
Offline Q-network training: Based on the imitators, the Q-network can be trained in each performance region respectively. Particularly, in the -th scheduling period of offline training with the -th imitator (), providing the scheduling action, the outputs of the imitator are treated as the QoS observations in the -th performance region, which is then used to update the extended system state of the -th scheduling period in the input of the Q-network. The Q-network is also updated in the above iterative procedure according to the Q-learning method [11]. The loss function is defined in (14), where represents the Q-network parameters in the -th scheduling period, and denotes the parameter of target network as in [11].
In order to efficiently explore the action space, an upper confidence bound (UCB) based exploration policy is introduced to determine the scheduling action in the offline training of Q-network. Taking the -th scheduling period with the -th imitator as the example, we first define the UCB of the action of -th device as
(17) |
where counts the number of times the action is taken up to the -th scheduling period. The hyper-parameter is used to balance the exploration and exploitation. As a result, the scheduling action is determined as follows:
(18) |
where is the uniform distribution over action space of -th device and exploration rate should satisfy the limit condition .
V-B Online Q-network Training
The online Q-network training with the same loss function as in (14) could be applied to further improve the performance of the proposed ReinWiFi system. Particularly, in the -th scheduling period of the online stage, the scheduling action of the -th device, denoted as , is determined by the -greedy policy as follows:
(19) |
where and are defined in (18).
VI Experiments
The proposed ReinWiFi system is implemented in a WiFi network with one HONOR XD30 AP and UEs each equipped with a TP-Link TL-WDN6200 USB WiFi adapter in the experiment111The source code of implementation is available online in https://github.com/QianrenLi/ReinWiFi.. Denote the AP as and the three UEs as , respectively. The network is working on the G WiFi band following the IEEE 802.11ac specification. The real-time controller is implemented in a laptop with Intel Core i7-8750H CPU and Ubuntu 20.04 operating system. An Ethernet connection with a maximum data rate of Gbps is employed to facilitate communication between the controller and the AP. Moreover, we implement a Linux module to adapt the CW sizes of TL-WDN6200 adapters in real-time from user space. Hence, the controller can collect the QoS observations from UEs and notify the scheduling actions via WiFi, such that the UEs’ transmission scheduling can be adjusted accordingly.
Both file delivery tasks and delay-sensitive tasks are tested in the experiment. The former tasks with a sufficient backlog are transmitted with the BE priority. The latter tasks, consisting of two types, are delivered with the VI priority. The data rates of type I and II delay-sensitive tasks are Mbps and Mbps, respectively. The packet arrival intervals of the two types are both ms. Moreover, the maximum tolerable RTTs are ms and ms, respectively. The universal set of communication tasks tested in the experiment includes a delay-sensitive task with arrival data rate (Task 1) and a file delivery task (Task 2) on the -th link; a delay-sensitive task with arrival data rate (Task 3) on the -th link; a delay-sensitive task with arrival data rate (Task 4) on the -th link. The quality of the -th, -th, and -th links depend on their distances and the propagation environment, which could be changed in the experiment.
In the experiment, the duration of the scheduling period is 1 second, the CW size takes values from , and throughput limitation takes values from , where = 600 Mbps. Moreover, in addition to the background interference, the interfering traffic between two interfering UEs, denoted as and , is generated with a random data rate and BE priority in the same channel.
The preliminary observation dataset is collected from the following three different traffic patterns (TPs): (1) Tasks 1 and 2 are activated; (2) Tasks 1, 2, and 3 are activated; and (3) Tasks 1, 2, 3 and 4 are activated. In all the TPs, the communication distances of the links are altered to exploit the diversity of link rates. In the collection of , the testing scheduling action is first applied in the first half of the scheduling period, where the CW size and throughput limitations are and Mbps respectively. Then, a randomized action is applied in the second half. QoS observations of both actions are collected in each scheduling period.
Based on dataset , the performance of the three TPs are quantized into 3, 6, and 6 regions, respectively. Then, QoS imitators are trained according to Section V with , . Given the trained QoS imitators, the Q-network is further trained as elaborated in Section V with .
To demonstrate the performance gain, the proposed framework is compared with two baselines. The first baseline, namely Standard EDCA, relies on the conventional 802.11 EDCA protocol. The second baseline, namely Rate Control Only, adapts the throughput limitation of file delivery tasks via the proposed framework with the CW sizes following the 802.11 EDCA protocol. The performance evaluation and comparison are conducted in distinct test scenarios listed in Table I, where only the first scenarios have been measured in the preliminary observation dataset .
Scenario | TP |
|
Scenario | TP |
|
||||
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 563, 499, 572 | 7 | 3 | 563, 424, 572 | ||||
2 | 2 | 563, 499, 572 | 8 | 2 | 563, 400, 346 | ||||
3 | 3 | 563, 499, 572 | 9 | 3 | 563, 400, 346 | ||||
4 | 2 | 563, 370, 572 | 10 | 2 | 459, 499, 572 | ||||
5 | 3 | 563, 370, 572 | 11 | 3 | 459, 499, 572 | ||||
6 | 3 | 563, 499, 476 |
The performance comparison of the proposed framework and the two baselines in the first test scenarios is illustrated in Fig. 1, where the online training of the Q-network is not applied in the proposed framework and the Baseline 2. It can be observed that the proposed Q-network offline trained via imitators significantly outperforms the conventional EDCA mechanism. Moreover, the performance gain of the Baseline 2 over Baseline 1 demonstrates the necessity of the throughput limitation, which has never been investigated in the existing literature.
The performance comparison in the test scenarios to is illustrated in Fig. 2. Since these test scenarios are not measured in the preliminary observation dataset , the performance gain of the proposed scheme over the Baseline 1 demonstrates the good generalization capability of the proposed Q-network. It can also be observed that the online training could further improve the scheduling performance of the Q-network, which has already been trained in the offline stage.
VII Conclusion
In this paper, a reinforcement-learning-based framework, namely ReinWiFi, is proposed for the application-layer QoS optimization of WiFi networks. Due to the absence of PHY-layer and MAC-layer status, the historical scheduling parameters and QoS observations are considered as the system state in the determination of the current scheduling parameters. Because of the unknown interference and vendor-dependent implementations, a novel Q-network is proposed to track the relation between the system state, scheduling parameter, and the overall QoS. Moreover, an imitation learning method is introduced to improve the training efficiency. It is demonstrated via the testbed that the proposed framework, with the dynamic adaptation of CW size and throughput limitation, significantly outperforms the convention EDCA mechanism.
References
- [1] A. Kumar, G. Verma, C. Rao, A. Swami, and S. Segarra, “Adaptive contention window design using deep -learning,” in IEEE Int. Conf. Acoust., Speech Signal Process.(ICASSP). IEEE, Jun. 2021, pp. 4950–4954.
- [2] L. Zhang, H. Yin, Z. Zhou, S. Roy, and Y. Sun, “Enhancing WiFi multiple access performance with federated deep reinforcement learning,” in IEEE Veh. Technol. Conf. (VTC Fall). IEEE, Nov. 2020, pp. 1–6.
- [3] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi-agent reinforcement learning,” J. Mach. Learn. Res., vol. 21, no. 1, pp. 7234–7284, Jan. 2020.
- [4] Z. Guo, Z. Chen, P. Liu, J. Luo, X. Yang, and X. Sun, “Multi-agent reinforcement learning-based distributed channel access for next generation wireless networks,” IEEE J. Sel. Areas Commun., vol. 40, no. 5, pp. 1587–1599, May 2022.
- [5] R. Ali, N. Shahin, Y. B. Zikria, B.-S. Kim, and S. W. Kim, “Deep reinforcement learning paradigm for performance optimization of channel observation–based MAC protocols in dense WLANs,” IEEE Access, vol. 7, pp. 3500–3511, 2018.
- [6] H. v. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double -learning,” in Proc. AAAI Conf. Artif. Intell., Feb. 2016, p. 2094–2100.
- [7] S.-C. Chen, C.-Y. Li, and C.-H. Chiu, “An experience driven design for IEEE 802.11ac rate adaptation based on reinforcement learning,” in Proc. IEEE Int. Conf. Comput. Commun. (INFOCOM), May 2021, pp. 1–10.
- [8] Wi-Fi Alliance, Wi-Fi Display Technical Task Group, “Wi-Fi display technical specification v1.2n,” 2011.
- [9] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. Berkeley Symp. Math. Statist. Probab., vol. 1, Oakland, CA, USA, 1967, pp. 281–297.
- [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Int. Conf. Neural Inf. Process. Syst. (NIPS), Dec. 2017, p. 6000–6010.
- [11] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.