Decision Transformer for Wireless Communications: A New Paradigm of Resource Management

Jie Zhang, Jun Li, Long Shi, Zhe Wang, Shi **, Wen Chen, and H. Vincent Poor Jie Zhang, Jun Li and Long Shi are with the School of Electronic and Optical Engineering, Nan**g University of Science and Technology, Nan**g 210094, China (e-mail: {zhangjie666, jun.li, long.shi}@njust.edu.cn). Zhe Wang is with the School of Computer Science and Engineering, Nan**g University of Science and Technology, Nan**g 210094, China (email: [email protected]). Shi ** is with the National Mobile Communications Research Laboratory, Southeast University, Nan**g 210096, China (e-mail: [email protected]). Wen Chen is with the Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: [email protected]). H. Vincent Poor is with the Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 08544, USA ([email protected]).

Abstract

As the next generation of mobile systems evolves, artificial intelligence (AI) is expected to deeply integrate with wireless communications for resource management in variable environments. In particular, deep reinforcement learning (DRL) is an important tool for addressing stochastic optimization issues of resource allocation. However, DRL has to start each new training process from the beginning once the state and action spaces change, causing low sample efficiency and poor generalization ability. Moreover, each DRL training process may take a large number of epochs to converge, which is unacceptable for time-sensitive scenarios. In this paper, we adopt an alternative AI technology, namely, the Decision Transformer (DT), and propose a DT-based adaptive decision architecture for wireless resource management. This architecture innovates through constructing pre-trained models in the cloud and then fine-tuning personalized models at the edges. By leveraging the power of DT models learned over extensive datasets, the proposed architecture is expected to achieve rapid convergence with many fewer training epochs and higher performance in a new context, e.g., similar tasks with different state and action spaces, compared with DRL. We then design DT frameworks for two typical communication scenarios: Intelligent reflecting surfaces-aided communications and unmanned aerial vehicle-aided edge computing. Simulations demonstrate that the proposed DT frameworks achieve over $3$ - $6$ times speedup in convergence and better performance relative to the classic DRL method, namely, proximal policy optimization.

Index Terms:

Decision transformer, reinforcement learning, stochastic optimization, resource management

I Introduction

The sixth generation of mobile communication systems (6G) is undergoing extensive development with anticipated advances in rapid data transmission, extensive coverage areas, and diverse service offerings [1]. However, optimizing network performance in time-varying contexts poses challenging issues, especially when facing the dynamic nature of network topologies, fluctuating service demands, unpredictable interference, and so on [2]. Traditional static optimization methods, such as convex optimizations, are unable to cope with these stochastic problems. In this regard, the application of artificial intelligence (AI) has begun to show its strength when addressing these challenges. Reinforcement learning (RL), as a key technology of AI, is capable of learning good polices via interactions with the environment, thereby offering a promising solution to stochastic optimization problems [3].

In recent years, there have been extensive works regarding wireless resource management based on RL techniques. For example, Liu et al.[4] proposed a double Q-network RL algorithm to jointly optimize the trajectory of an unmanned aerial vehicle (UAV) and task offloading for achieving high system throughput. In [5], Zhang et al. developed a multi-agent RL framework by jointly optimizing the transmit beamforming at base stations (BS) and phase shifts at intelligent reflecting surfaces (IRS). In [6], Yin et al. put forward a federated RL structure to enhance the quality of service for mobile users through optimizing the time-frequency resource allocation.

However, employing RL directly in wireless communications presents several inherent drawbacks. First, RL’s reliance on extensive interactions with the environment leads to a significant delay in achieving real-time responses and maintaining network efficiency, particularly in time-sensitive communication scenarios [7]. It often takes an large number of training epochs for RL algorithms to converge to an effective policy. Second, the sparse and delayed feedback from the environment further deteriorates the learning process, resulting in prolonged training durations or suboptimal performance [8].

Apart from the above two drawbacks, the most prominent weakness for the RL algorithms is the difficulty in generalization. That is, a well-trained RL policy cannot be directly transferred to similar tasks, especially when its state space or action space varies. An example is in UAV resource management when the number of UAVs changes from two to three, generating a different state space. A previously well-trained RL for the $2$ -UAV task has to learn a new policy from scratch to fit the $3$ -UAV task, even if the two tasks share great similarities. This limitation causes the waste of training resources, and thus motivates the development of new learning frameworks capable of generalizing across similar tasks.

To address the challenges raised by traditional RL’s inefficiency in utilizing limited samples and adapting to new tasks, the development of the Decision Transformer (DT) represents a significant advancement [9]. Drawing inspiration from the Transformer architecture [10], DT leverages the advanced comprehension and generalization properties of Transformers for handling sophisticated decision-making problems. Unlike conventional RL methods that rely heavily on value function approximation and policy optimization, the DT incorporates a novel approach to sequential decision-making, which is particularly adept at addressing issues rooted in RL [11, 12].

To be specific, a DT exhibits the capacity to generalize from a pre-trained model and swiftly adapt to new tasks through local fine-tuning. Moreover, its few-shot learning capability during personalized fine-tuning allows for rapid strategy adjustments with a few samples, significantly reducing training time for new tasks. This property makes the DT superior to conventional RL in efficiently solving policy optimization for a set of similar tasks, particularly when there are slight variations in state or action spaces among these tasks. Such adaptability is crucial for wireless networks with stringent real-time requirements, as it aligns with low-latency demands by facilitating online adjustments without the necessity for extensive retraining or offline simulations [13]. Introducing the DT to wireless communications offers two advantages. First, the DT’s ability to learn from large-scale historical samples is beneficial for enhanced comprehension of intricate communication environments and fosters more efficient decision-making. Second, the DT’s generalization capacity can handle a set of communication tasks with rapid convergence.

In this paper, we introduce the use of the DT to the field of wireless communications. The main contributions of this article are summarized as follows:

•

To our best knowledge, this paper is the first of its kind to develop DT-based strategies for wireless resource management. We propose a cloud-edge collaborative architecture in mobile systems, where DT models are pre-trained in the cloud based on collected samples from edges and then fine-turned locally at edges to fit different tasks. This architecture is able to efficiently facilitate model training and inferring across edges and cloud, inspiring the design of native AI in 6G systems.
•

We develop two potential applications of DTs in wireless resource allocation, i.e., IRS-aided communications and UAV-aided mobile edge computing (MEC). For the IRS scenario, we propose a prompt-based method and an action embedding strategy to facilitate the process of pre-training and local fine-tuning of the DT model, enabling it to swiftly transfer and generalize to new tasks where the number of IRS elements varies. For the UAV scenario, we design a parameter light-weighting method to ensure low latency and real-time response for the UAV decision model. Also, the locally fine-tuned models are adaptive to tasks with different numbers of UAVs.
•

Our simulations demonstrate the superiority of the proposed DT architecture over proximal policy optimization (PPO), a classic RL method. Specifically, in the IRS scenario, the convergence speed of the DT is improved by $3$ - $6$ times, and the performance is increased by $5.1\%$ , while in the UAV scenario, the convergence speed of the DT is enhanced by $4$ - $6$ times, compared to PPO.

The rest of this paper is organized as follows. In Section II, we provide some background on DTs. In Section III, we propose a novel DT architecture for wireless networks with cloud-edge collaboration. In Section IV, we discuss the design of model training and fine-tuning for IRS and UAV scenarios. In Section V, we analyze the simulation results, while Section VI summarizes the paper.

II Preliminary

To explore the potential of DTs in wireless communications, it is useful to begin with an understanding of their intrinsic behavior. Recent research shows that the use of DTs via sequence modeling is proving to be highly beneficial in complex environments. In these scenarios, the dynamic environmental states and agents’ actions evolve continuously, necessitating a robust model that can adapt and learn effectively over time. The DT reformulates the optimal policy learning process as a sequential modeling problem. We introduce three features of DT as follows.

•

Sequence Modeling: A DT model is trained on a dataset composed of state, action, and return-to-go tuples collected at various time steps. This dataset forms the basic unit of the learning process, termed a trajectory sequence. The return-to-go represents cumulative discounted returns from the current step to the final step of an episode. This approach of emphasizing cumulative returns facilitates an efficient policy improvement. The trajectory sequence provides a more comprehensive understanding of the consequences of actions relative to RL.
•

Offline Training: Another distinguishing feature of the DT is its reliance on offline training. The training process utilizes the collected trajectories to train the model in a non-interactive setting. Such an autoregressive modeling approach enables the DT to do offline pre-training and be simply extended through online fine-tuning.
•

Adaptive Action Prediction: A notable characteristic of the DT is its capability of adaptive action prediction. Unlike RL models that react only to current states, a DT leverages the entire sequence of historical information with positional encodings to predict future actions. This feature is particularly vital in stochastic and dynamic environments, since the optimal action at any step may depend on a series of prior experiences and decisions.

III System Architecture

In this section, we construct the system architecture of wireless resource management with a DT. As shown in Fig. 1, we introduce an architecture that leverages a DT in a cloud-edge synergistic manner, establishing a general foundation applicable to a wide range of scenarios in wireless communications. We particularly focus on two scenarios, i.e., IRS-aided communications and UAV-aided MEC, with detailed illustrations on the process of pre-training, assignment, and local fine-tuning of DT models.

Refer to caption — Figure 1: Cloud-edge coordinated DT architecture for wireless resource management. The platform consists of two layers: the cloud layer and the edge layer. For each scenario, training samples are collected from $N$ cases (a set of tasks for acquiring sufficient samples) and then uploaded to the cloud. After undergoing sequential modeling and time embedding, the trajectory sequences contain feature information about state, decision, and return. These sequences are then input into the DT neural network, whose parameters are updated by back-propagation. Afterwards, this pre-trained model in the cloud is assigned to edge devices for $M$ applications. Each application represents a local fine-tuning performed to obtain a personalized model tailored for a new task.

III-A Sample Collection and Uploading

The training process of a DT model commences with sample collection at the edge. First, the state, action, and return tuples of a sample are defined according to a specific scenario, e.g., IRS-aided communications [14], or UAV-aided MEC. In an IRS scenario, the state space may include channel state information (CSI) and real-time rate requests from end users. The decision may involve adjustments in the transmission power of a BS and the phase shift configuration of the IRS. The return can be linked to the transmission rate or quality of service (QoS). In the context of a UAV-aided MEC scenario, the state may encompass the remaining energy and computing capability of UAVs as well as the workload queues of mobile devices. The decision may include path planning of UAVs, user association, task offloading, and computing power. The return can be the cumulative throughput of the computing workload.

In each scenario, a variety of tasks may be encountered. These tasks, although similar, may have slight variations in state or action spaces. For example, different IRS tasks may have variations in terms of the number of IRSs, the phase shift levels, and the number of elements in each IRS device, and so on, which will change the state and action spaces. DTs are able to handle these variations since transformers allow flexible dimensions of input and output.

III-B Model Pre-training and Assignment

The cloud serves as a platform for develo** versatile and robust DT models given various communication scenarios. By performing sequential modeling based on samples from the data buffer and incorporating time embedding techniques, the cloud generates trajectory sequences that include the feature information of the state, action, and return. Upon obtaining these trajectory sequences, the DT models employ self-attention mechanisms to capture the complex temporal correlations within a sequence, while regularization layers (Add $\&$ Norm), feedforward networks, and residual connections are utilized to enhance the learning process and prevent overfitting. The inclusion of adaptive layers enables the DT models to adjust parameters according to the new tasks, improving their ability of generalization.

The models then calculate the loss function by comparing expert actions (typically derived from high-quality datasets) with the predicted output actions. A loss function can be in the form of mean squared error loss or cross-entropy loss. Through this supervised learning approach, the DT models back-propagate the loss to update parameters, enabling them to iteratively learn how to generate optimal actions under various conditions, thereby facilitating the convergence process. Afterwards, the pre-trained models completed in the cloud are assigned to the edge for further fine-tuning.

III-C Model Fine-tuning and Generalization

When confronting a new application task, the edge collects samples of this task and performs an efficient few-shot fine-tuning drawn upon the pre-trained DT model assigned from the cloud. This process tailors the model to align with the specific features and requirements of the local environment. Such a fine-tuning process aims at obtaining a personalized local DT model, so that it can fulfill optimal decisions for resource management of this task.

Furthermore, this DT-based architecture can be readily applied to other future communication system scenarios beyond IRS-aided systems and UAV-aided MEC networks. It also has the potential to be implemented to cutting-edge scenarios such as the Internet of Things, space-air-ground integrated networks, and so on, where resource management in stochastic environments is essential. Through continuous model pre-training and fine-tuning, the DT-based cloud-edge collaborative framework ensures rapid generalization and transfer of models across different tasks, achieving swift convergence and real-time response, even with few samples in the new tasks. This advantage of DTs is critical for achieving high performance in future mobile systems.

IV Two Typical Application Scenarios

In this section, we develop the DT architectures for two typical scenarios, i.e., IRS-aided communications and UAV-aided MEC. We will give a detailed illustration on how we design the training-and-tuning process for achieving fast convergence in these two scenarios.

IV-A Scenario 1: IRS-aided Communications

In this scenario, we construct a DT architecture for IRS-aided communications. Although the IRS technology can enhance the network capacity with convenient deployment and programmability, it may cause extra dynamicity and randomness in beamforming design. RL algorithms, while capable of handling the stochastic issues, often struggle to adapt when encountering new tasks in a changed environment. Thus, the DT architecture’s robust generalization capabilities can provide an advantage by achieving optimal IRS beamforming.

As illustrated in Fig 2, we propose a DT architecture for the IRS-assisted communications, aiming to maximize the transmission rate. First, acknowledging the diversity in environmental constraints, e.g., the various user requirements and the heterogeneity in IRS element configurations in terms of quantity and precision, we have integrated these constraints as part of the prompts for the DT model. Specifically, by embedding prompts alongside the inputs of state, action, and return sequences, the model receives additional context in the form of prompts that describe the current environmental constraints. By integrating this richer context, the DT can make more rational decisions that take into account the unique characteristics of the current situation, thereby enhancing its decision-making efficiency.

Regarding the action space, we confront the challenge of a hybrid action space comprising both discrete and continuous actions, such as IRS phase shifts (discrete) and BS transmission power (continuous). The direct application of DT outputs to such a mixed action space will potentially lead to suboptimal policies due to the mismatch in action representations. To resolve this issue, we propose an action embedding technique that maps discrete actions into a high-dimensional embedding space. When the model predicts a discrete action, we employ a nearest neighbor matching method in the embedding space to find the closest viable discrete action. For continuous actions, we propose an auto-regressive approach, enhancing the DT capability to generate accurate continuous actions. This innovation not only addresses the inherent challenges of a mixed action space, but also elevates the model performance. The architecture is further refined by designing a dual loss function that accommodates both action types, enabling effective model updates.

The offline training of the DT model is designed to extract knowledge from extensive historical datasets, encompassing variables like signal strength, CSI, IRS configurations, and user locations. The offline training empowers the DT model to identify effective IRS beamforming policies by analyzing complex datasets. Then in the real-time application phase, the edge fine-tunes the DT model based on real-time samples and new prompt constraints. The essence of this phase underscores prompt responsiveness and adaptability, allowing for timely recalibration of decision-making. The transfer and generalization ability of the DT allows it to excel not only in its original IRS tasks but also in a new task. The DT architecture maintains its operational efficiency and effectiveness across different IRS tasks with rapid convergence, showcasing its versatility in the face of environmental diversity.

IV-B Scenario 2: UAV-aided MEC

In the UAV-aided MEC scenario, each UAV can be treated as an intelligent agent. Due to heterogeneity in computing power, size of coverage area, and battery capacity, each UAV draws upon a different set of samples for decision-making. If each decision model were to be independently trained from scratch, it will consume considerable computational resources for model training without guaranteeing optimal decision-making performance. Therefore, it is of high efficiency if these UAVs can locally fine-tune the pre-trained DT model assigned from the cloud with a few local samples.

As shown in Fig. 3, each UAV is engaged in a distinct operational state set, e.g., hovering, charging, and data offloading and computing. It is equipped to make decisions in real-time according to their partial observations, e.g., geographic position, battery life, and potential physical obstructions. Moreover, each UAV can benefit from information sharing with its neighboring UAVs, facilitating distributed collaboration. For example, knowledge sharing between UAV A and UAV B allows for the refinement of their respective flight trajectories, task allocations, and historical strategies to enhance efficiency of the two UAVs’ maneuvers. On this premise, each UAV harnesses the localized samples to customize the pre-trained DT model, enabling a personalized response to diverse mission demands.

In this scenario, to satisfy the critical latency and responsiveness requirements, we propose a parameter light-weighting method in our DT architecture. We first design a parameter sharing method across different heads within the transformer structure. Instead of each head in the multi-head attention mechanism learning a unique transformation, parameter sharing allows heads across different layers to utilize a common set of weights. This drastically reduces the total number of neural network parameters, making the model more lightweight. To ensure this parameter reduction does not cause a degradation of the decision-making abilities, we propose a loss function based on parameter similarity, which can be adjusted according to the specified requirements of UAV tasks. This function is designed to balance the trade-off between inference speed and model performance. By reducing the model size and complexity through parameter sharing, the DT model is able to make decisions with a quick model referring process.

Moreover, the conventional transformer structure employs a dense attention mechanism, where each element in the sequence need to calculate the attention weight with all other elements. This process can be computationally intensive. Hence, we develop a sparse attention mechanism that masks future information and focuses within a pre-defined window for calculating the attention weights. This concept of sparse attention capitalizes on the insight that the immediate past information is more relevant than distant information in tasks. Implementing this mechanism reduces the overall complexity of the attention calculation from quadratic to linear with respect to the sequence length, which significantly decreases the computational overhead. These above methodical adjustments not only lead to reduced model inferring latency, ensuring swifter decision-making, but also save computational resources. The incorporation of the DT architecture with UAV-aided MEC marks a significant innovation, augmenting the efficiency of UAVs while simultaneously elevating the performance of the whole system.

V Numerical Results and Analysis

In this section, we provide simulations for the two scenarios to evaluate the performance of our proposed DT architecture. The DT model is pre-trained and fine-tuned on NVIDIA RTX A6000 GPU hardware, employing the AdamW optimizer with a learning rate of $10^{-4}$ . The main architecture encompasses three transformer blocks, each designed with a causal attention network, a feedforward network, and a dropout network with a dropout rate of $0.1$ .

We first consider the scenario of an IRS-aided communication system consisting of a BS, an IRS, and a mobile user. The channel model in [5] is adopted. Each training epoch consists of $100$ time slots. Each environment exhibits channel characteristics that possess slightly distinct channel exponents. We define the CSI as the state, the IRS phase shift as the action, and the data rate as the return. The offline datasets for pre-training are collected from PPO, a conventional RL algorithm [15]. During the fine-tuning process, a handful of samples are generated by interacting with the environment of the new task employing the PPO algorithm from scratch. These few-shot samples are utilized to precisely fine-tune the DT model, namely DT-FT. Then we evaluate the performance of the DT-FT in the new environment.

In Fig. 4, we compare the average data rate performance of the proposed DT-FT with the PPO algorithm and a random selection method by considering $128$ and $64$ IRS elements as the two new tasks. The comparison distinctly illustrates that the performance of the DT-FT for the two new tasks is superior to that of the PPO. To be specific, DT achieves $5.1\%$ performance improvement as well as $3$ - $6$ times faster in terms of convergence speed than PPO. This is primarily attributed to DT’s trajectory learning capability, which enables it to capture the intricate inner relationships between states, actions, and returns. Moreover, the DT’s proficiency in leveraging offline training samples allows it to swiftly converge in new tasks only using a small number of samples.

We further consider the scenario of UAV-aided MEC, where UAVs fly within a $100\times 100$ $\mathrm{m}^{2}$ region to provide computing services to ground users. Our objective is to optimize the flight trajectories of the UAVs and their associations with users to maximize long-term throughput of computing workload. The simulation encompasses $10$ users, whose mobility is characterized by a Gaussian Markov model. The volumes of computational workload are uniformly distributed within the range of $[10,20]$ Mb. The state is composed of the UAVs’ positional coordinates coupled with local information sharing from users or other UAVs, such as the volume of remaining workload and previous location. The action includes the UAVs’ path planning and the selections of users to serve. The return is defined as workload throughput. The PPO algorithm is utilized for collecting training samples that are used to pre-train the DT model. Subsequently, the DT model undergoes a fine-tuning process tailored to new tasks.

As depicted in Fig. 5, we plot the throughput per episode utilizing two new tasks with $3$ UAVs and $4$ UAVs, respectively. The results from the new tasks show the superiority of the DT in terms of rapid convergence compared to the PPO, which is $4$ - $6$ times faster, showcasing the DT’s powerful few-shot learning ability. This advantage highlights the proficiency of the DT in distilling generalized knowledge from pre-trained tasks and in adapting to new tasks, indicating its significant potential and applicability in the UAV-aided MEC scenario.

VI Conclusions

In this paper, we have explored the DT-based stochastic resource management for wireless communications, addressing the generalization challenge that is commonly encountered in conventional RL. We have proposed a novel DT architecture for leveraging its strengths based on a cloud-edge coordination manner. Simulations indicate that the proposed DT architecture is superior to conventional RL in terms of convergence speed and performance, with $4$ - $6$ times faster and $5.1\%$ performance improvement. Our approach is not limited to only the discussed of IRS-aided communications and UAV-aided MEC, but is also a promising approach to develo** scalable solutions for other scenarios, such as resource allocations for the Internet of Things, satellite communications, and so on.

References

[1] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y.-J. A. Zhang, “The roadmap to 6G: AI empowered wireless networks,” IEEE Commun. Mag., vol. 57, no. 8, pp. 84–90, Aug. 2019.
[2] Y. Xu, G. Gui, H. Gacanin, and F. Adachi, “A survey on resource allocation for 5G heterogeneous networks: Current research, future trends, and challenges,” IEEE Commun. Surv. Tuts., vol. 23, no. 2, pp. 668–695, Feb. 2021.
[3] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C. Liang, and D. I. Kim, “Applications of deep reinforcement learning in communications and networking: A survey,” IEEE Commun. Surv. Tutor., vol. 21, no. 4, pp. 3133–3174, May 2019.
[4] Q. Liu, L. Shi, L. Sun, J. Li, M. Ding, and F. Shu, “Path planning for UAV-mounted mobile edge computing with deep reinforcement learning,” IEEE Trans. Veh. Technol., vol. 69, no. 5, pp. 5723–5728, Mar. 2020.
[5] J. Zhang, J. Li, Y. Zhang, Q. Wu, X. Wu, F. Shu, S. **, and W. Chen, “Collaborative intelligent reflecting surface networks with multi-agent reinforcement learning,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 3, pp. 532–545, Apr. 2022.
[6] Z. Yin, Z. Wang, J. Li, M. Ding, W. Chen, and S. **, “Decentralized federated reinforcement learning for user-centric dynamic TFDD control,” IEEE J. Sel. Top. Signal Process., vol. 17, no. 1, pp. 40–53, Jan. 2023.
[7] H. Ke, J. Wang, L. Deng, Y. Ge, and H. Wang, “Deep reinforcement learning-based adaptive computation offloading for MEC in heterogeneous vehicular networks,” IEEE Trans. Veh. Technol., vol. 69, no. 7, pp. 7916–7929, Jul. 2020.
[8] L. Huang, S. Bi, and Y.-J. A. Zhang, “Deep reinforcement learning for online computation offloading in wireless powered mobile-edge computing networks,” IEEE Trans. Mob. Comput., vol. 19, no. 11, pp. 2581–2593, Nov. 2020.
[9] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Reinforcement learning via sequence modeling,” in Adv. Neural Inf. Process. Syst., vol. 34, Dec. 2021, pp. 15 084–15 097.
[10] A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30, Dec. 2017, pp. 5998–6008.
[11] M. Janner, Q. Li, and S. Levine, “Offline reinforcement learning as one big sequence modeling problem,” in Adv. Neural Inf. Process. Syst., vol. 34, Dec. 2021, pp. 1273–1286.
[12] K.-H. Lee, O. Nachum, M. Yang, L. Y. Lee, D. Freeman, W. Xu, S. Guadarrama, I. S. Fischer, E. Jang, H. Michalewski, and I. Mordatch, “Multi-game decision transformers,” in Adv. Neural Inf. Process. Syst., vol. 35, Dec. 2022, pp. 27 921–27 936.
[13] Q. Zheng, A. Zhang, and A. Grover, “Online decision transformer,” in Int. Conf. Mach. Learn., vol. 162, Jul. 2022, pp. 27 042–27 059.
[14] S. Gong, X. Lu, D. T. Hoang, D. Niyato, L. Shu, D. I. Kim, and Y.-C. Liang, “Toward smart wireless communications via intelligent reflecting surfaces: A contemporary survey,” IEEE Commun. Surv. Tutor., vol. 22, no. 4, pp. 2283–2314, Jun. 2020.
[15] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.