Next-slot OFDM-CSI Prediction:
Multi-head Self-attention or State Space Model?

Mohamed Akrout, , Faouzi Bellili, ,
Amine Mezghani, , Robert W. Heath The authors are with the Department of Electrical and Computer Engineering at the University of Manitoba, Winnipeg, MB, Canada (emails:[email protected], {Faouzi.Bellili,Amine.Mezghani}@umanitoba.ca). R. W. Heath is at the University of California, San Diego (email: [email protected]). This work was supported by the Discovery Grants Program of the Natural Sciences and Engineering Research Council of Canada (NSERC) and the US National Science Foundation (NSF) Grant No. ECCS-1711702 and CNS-1731658.

Abstract

The ongoing fifth-generation (5G) standardization is exploring the use of deep learning (DL) methods to enhance the new radio (NR) interface. Both in academia and industry, researchers are investigating the performance and complexity of multiple DL architecture candidates for specific one-sided and two-sided use cases such as channel state estimation (CSI) feedback, CSI prediction, beam management, and positioning. In this paper, we set focus on the CSI prediction task and study the performance and generalization of the two main DL layers that are being extensively benchmarked within the DL community, namely, multi-head self-attention (MSA) and state-space model (SSM). We train and evaluate MSA and SSM layers to predict the next slot for uplink and downlink communication scenarios over urban microcell (UMi) and urban macrocell (UMa) OFDM 5G channel models. Our numerical results demonstrate that SSMs exhibit better prediction and generalization capabilities than MSAs only for SISO cases. For MIMO scenarios, however, the MSA layer outperforms the SSM one. While both layers represent potential DL architectures for future DL-enabled 5G use cases, the overall investigation of this paper favors MSAs over SSMs.

Index Terms:

CSI prediction, OFDM, slot, 3GPP channel models, multi-head self-attention, state space models.

I Introduction

I-A Background and motivation

Generative artificial intelligence (GenAI) has recently emerged as a result of research advancements in deep learning (DL), with a promising potential to transform the technological future across numerous areas. Specifically, large language models (LLMs) and large multi-modal models (LMMs) developed within the field of natural language processing and computer vision research communities are driving innovation by enhancing automation, language translation services, and human-computer interaction (cf. [1] for a comprehensive overview). While GenAI is being progressively adopted by different industries, some research studies at the intersection of DL and wireless communication proposed the use of LLMs as part of self-organizing networks (SONs) [2]. These networks are expected to be highly autonomous and adaptive as they continuously optimize their functions and parameters depending on the communication conditions and user demands. To accommodate such high flexibility, GenAI for wireless communication comes into play as a key technology to generate personalized communication parameters according to network patterns and KPIs learned from massive Telecom datasets. Such AI generation can target estimation or prediction of parameters pertaining to either the physical or the network layer depending on the nature of the collected datasets and the considered downstream tasks at hand.

In this context, one of the key AI use cases considered in the recent 3rd Generation Partnership Project (3GPP) 17 and 18 releases is the channel state information (CSI) prediction [3]. An important issue with the current CSI reporting system in the new radio (NR) interface is the delay between the CSI’s reporting time and the moment the CSI is actually used. This delay makes the CSI outdated due to the channel time variations. The rate at which the CSI loses its relevance depends on the channel properties and is amplified by the speed of user equipment. To address this challenge, both model-based and learning-based channel prediction techniques leverage historical CSI correlations to forecast future channel conditions and/or realizations. To capture the dynamic behavior of the channel, model-based methods employ linear extrapolation [4], sum-of-sinusoids [5], and autoregressive models [6] (see [7] for a comprehensive overview). Due to their low complexity, learning-based approaches using deep neural networks (DNNs) stood out in the 3GPP’s 5G standard discussions as a promising low-complexity strategy to predict the channel and mitigate the impact of outdated CSI. Indeed, when the channel blockage model is not available, model-based methods cannot accurately capture the large number of blockage possibilities. Such a setting is equivalent to having a non-stationary channel whose transitions can not be predicted well using linear methods.

I-B Related work

Many standard architectures of DNNs have been investigated for multiple-input-multiple-output (MIMO) channel predictions. Multi-layer perception (MLP) was used in [8] to rely on the uplink CSI to predict the downlink one under the assumption of a direct user-channel matrix relationship, which is not always applicable. Convolutional neural networks (CNNs) and long short-term memory (LSTM) networks were employed in [9], yielding notable prediction performance compared to traditional methods like maximum likelihood and minimum mean squared error (MMSE). To improve upon this, recurrent neural networks (RNNs) were combined with CNNs for feature extraction, outperforming standalone CNNs for channel prediction [10]. Because the channel is complex-valued and CNNs are designed for real-valued processing only, a complex-valued 3D CNN was proposed for CSI prediction in [11], improving the CSI prediction accuracy of real-valued networks. Graph neural networks (GNNs) have also been applied to CSI prediction as a multivariate time-series forecasting problem [12] by exploiting the spectral and temporal correlations of the historical CSI. To mitigate the sequential processing nature of LSTM networks, transformers rely on the attention mechanism to process entire sequences of input data in parallel, thereby significantly reducing training times and enabling the model to scale with the amount of data more effectively [13]. When applied to CSI prediction, transformers outperformed all other DDN architecture in terms of both mean square error and achievable rate [14].

The collaborative development of 3GPP standard involving research bodies and industry stakeholders is currently investigating the performance of multiple DNN models in terms of floating point operations per second (FLOPS) and memory complexity. Because the CSI prediction problem follows a one-sided model (i.e., inference can be conducted by one side: either the base station or the UE)¹¹1This is to be opposed to two-sided models in the 3GPP standard where the first part of the inference runs on the UE side and the second part runs on the base station side or vice-versa (e.g., encoder-decoder models)., the performance of deployable models depends heavily on specific UE/base station vendors’ hardware. For this reason, it is of utmost importance in practice to investigate the prediction capability of AI layers and avoid stacking dozens of them with the only goal of claiming state-of-the-art performance at the cost of FLOPS and memory complexity. From a wireless communication perspective (i.e., non-language application), the AI models in the 3GPP standard should benefit from the recent architectures of AI layers which have proven effective for GenAI models, namely, the multi-head self-attention (MSA) layer [13] and the state space model (SSM) layer [15]. To see why this is possible, Fig. 1 depicts the striking similarity between CSI prediction in orthogonal frequency division multiplexing (OFDM) systems and next token prediction for LLMs. Specifically, $K$ input word embeddings obtained after tokenization and present at the position interval $p\in[T-K,T]$ are analogous to $K$ input CSI at the time interval $t\in[T-K,T]$ . In other words, the equivalence becomes more apparent when the token positions are substituted by the CSI (e.g., OFDM slot) position in time.

The AI community is currently benchmarking these two layers for LLMs [16], vision-related tasks including classification of images [17] and videos [18], and graph-related tasks [19], just to name a few. We believe it is also timely that the communication community examines how these layers can be leveraged for CSI prediction. If the vision of AI for wireless is to be core component in future-generation communication systems rather than a specific functionality among other ones, it is important to understand the capabilities of AI layers in terms of both performance and FLOPS as well as area and power consumption metrics when AI models are implemented on FPGAs or ASICs. This is particularly important to pursue because the 3GPP discussions are still ongoing and no final decision about the AI models have been made yet.

I-C Contribution

We study the prediction capabilities of MSA and SSM layers for CSI prediction at the UE side as a one-sided model defined in the 3GPP standard. Different from the aforementioned work, our goal is to neither beat other DNN architectures nor obtain state-of-the-art results by cascading DNN layers at the cost of higher FLOPS. For this reason, we set focus on shadow networks with either one single MSA or SSM layer and examine their CSI predictive ability. By doing so, we provide insights into the performance of MSA and SSM layers and their competitiveness to be considered by the industry among the deployable AI models on UE devices. Toward this goal, we first define the task of predicting the next-slot OFDM-CSI and describe the parameters of the 5G wireless channels used for training and evaluation, namely, urban microcell (UMi) and urban macrocell (UMa). We then conduct an exhaustive empirical comparison between MSA and SSM layers for both in-distribution (ID) and out-of-distribution (OOD) evaluations as a function of the SNR and speed of the UE. This is because rigorous investigations of AI models must examine the trade-off between generalization and accuracy [20, 21]. Our empirical investigation reveals the following main results:

•

For SISO communication scenarios: SSMs exhibit better generalization capabilities in terms of SNR and user speeds compared to MSAs. For MIMO communication scenarios, however, MSAs outperform SSMs in both ID and OOD evaluations.
•

Diversifying communication scenarios (i.e., many SNR levels within the training dataset) over which DNNs are trained for slot prediction is only beneficial for MSAs for SISO scenarios. This diversification has a negative impact on the CSI prediction MSE for MIMO scenarios where DNNs with lower SNR levels have a lower MSE performance for CSI prediction. This can be justified by the fact that introducing noise as a data augmentation technique to the training samples prevents overfitting [22], improves robustness [23], and is equivalent to Tikhonov regularization [24]. Such dataset diversification confirms the challenge of choosing the training dataset and its parameters to train DNNs. This adds on top of the data/model agreement difficulty between vendors in the context of future AI-enabled communication use cases.

The code to train and test MSA and SSM layers is available at https://github.com/makrout/Next-Slot-OFDM-CSI-Prediction.

I-D Outline

We structure the rest of this paper as follows. In Section II, we introduce the relevant background of MSA and SSM layers. In Section III, we define the next-slot OFDM-CSI prediction task and present the parameters of the 3GPP OFDM channel models. Our simulation results are presented in Section IV for both SISO and MIMO communications, from which we draw out some concluding remarks In Section V.

II Background

In this section, we review the architecture of the MSA and SSM layers at the detail needed for their comprehensive exposition and comparison.

II-A Multi-head attention layer

Let $\bm{X}\in\mathbb{R}^{N\times D}$ be the input sentence, where $N$ is the sequence length and $D$ is the embedding dimension. Let also $D_{h}$ denote the dimension of each self-attention head (a.k.a., the query size) and $H=D/D_{h}$ be the number of heads. A self-attention layer starts by computing query, key, and value matrices $\bm{Q}$ , $\bm{K}$ and $\bm{V}$ from $\bm{X}$ using linear transformations:


$\displaystyle\bm{Q}$	$\displaystyle=\bm{X}\,\bm{W}_{{q}},$	(1a)
$\displaystyle\bm{K}$	$\displaystyle=\bm{X}\,\bm{W}_{{k}},$	(1b)
$\displaystyle\bm{V}$	$\displaystyle=\bm{X}\,\bm{W}_{{v}}.$	(1c)

where $\bm{W}_{{q}}\in\mathbb{R}^{D\times D_{h}}$ , $\bm{W}_{{k}}\in\mathbb{R}^{D\times D_{h}}$ , and $\bm{W}_{{v}}\in\mathbb{R}^{D\times D_{h}}$ are learnable parameters. Eq. (1) can be rewritten in a compact form as

[\bm{Q},\bm{K},\bm{V}]=\bm{X}\,\bm{W}_{{qkv}},

(2)

where $\bm{W}_{{qkv}}\in\mathbb{R}^{D\times 3\,D_{h}}$ is an overall learnable parameter matrix. The attention map $\bm{M}\in\mathbb{R}^{N\times N}$ is then computed by scaled inner products from $\bm{Q}$ and $\bm{K}$ and normalized by the softmax function as follows:

\bm{M}=\textrm{softmax}\left(\frac{\bm{Q}\,\bm{K}^{\top}}{\sqrt{D_{h}}}\right).

(3)

Here, the $ij$ th entry, $M_{ij}$ , in $\bm{M}$ represents the attention score between $\bm{Q}_{i}$ and $\bm{K}_{j}$ . The self-attention operation is then applied on the value vectors to produce the output matrix

\bm{O}=\bm{M}\,\bm{V}\leavevmode\nobreak\ \in\mathbb{R}^{N\times D_{h}}.

(4)

Finally, the output $\bm{Y}\in\mathbb{R}^{N\times D}$ of the self-attention layer is calculated by a learnable linear projection $\bm{W}_{\textrm{proj}}\in\mathbb{R}^{D\times D}$ for the concatenated self-attention outputs of each head, i.e.:

\bm{Y}=[\bm{O}_{1},\bm{O}_{2},\dots,\bm{O}_{H}]\,\bm{W}_{\textrm{proj}}.

(5)

Overall, the MSA layer can be seen as a learnable module that takes an input $\bm{X}$ and returns an output $\bm{Y}$ of the same dimension. Note, however, that both $\bm{X}$ and $\bm{Y}$ are divided logically among $H$ heads. Consequently, different segments of $\bm{X}$ and $\bm{Y}$ are able to learn the correlation patterns of some input chunks in relation to the other ones within the sequence. This division enables the multi-head attention layer to acquire richer correlation patterns within the input sequence $\bm{Y}$ .

In terms of computational complexity, the FLOPS of the MSA layer is divided across four steps:

$i)$

the three linear projections in (1) with complexity $3\,N\,D^{2}$ ,
$ii)$

the computation of the attention map $\bm{M}$ in (3) with complexity $N^{2}\,D$ ,
$iii)$

the self-attention operation in (4) with complexity is $N^{2}\,D$ ,
$iv)$

the linear projection for the concatenated self-attention outputs in (5) with complexity $N\,D^{2}$ .

Summing the complexity of these steps yields the overall number of FLOPS for an MSA layer as $4\,N\,D^{2}+2\,N\,D^{2}$ . Due to the quadratic dependence of the complexity on the sequence length $N$ , AI researchers have been actively looking for novel and cheaper alternatives without sacrificing the MSA performance. The most promising of existing competitive methods is SSMs, especially the Mamba layer, which will be described in the next Section.

II-B State-space model layer

In classical control and filtering theories, the evolution of continuous systems as a function of time $t$ with state $\bm{h}(t)\in\mathbb{R}^{D}$ and input $\bm{x}(t)\in\mathbb{R}^{N}$ is described according to the SSM:


$\displaystyle\bm{h}^{\prime}(t)$	$\displaystyle=\bm{A}\,\bm{h}(t)+\bm{B}\,\bm{x}(t),\hskip 44.10185pt\textrm{(% state equation)}$	(6a)
$\displaystyle\bm{y}(t)$	$\displaystyle=\bm{C}\,\bm{h}(t)+\bm{D}\,\bm{x}(t),\hskip 42.67912pt\textrm{(% output equation)}$	(6b)

In (6), the state equation describes how the state $\bm{h}(t)$ changes (through the matrix $\bm{A}$ ) based on how the input $\bm{x}(t)$ influences the state (through the matrix $\bm{B}$ ). The output equation describes how the state $\bm{h}(t)$ is observed in the output $\bm{y}(t)\in\mathbb{R}^{N}$ (through the matrix $\bm{C}$ ) and how the input $\bm{x}(t)$ influences the output (through the matrix $\bm{D}$ ). For sequence models, the input $\bm{x}(t)$ represents the token embedding at position $t$ while $\bm{y}(t)$ denotes the next token embedding. By learning the parameters $\bm{A}$ , $\bm{B}$ , $\bm{C}$ and $\bm{D}$ , the SSM layer captures the evolution parameters of the dynamics from one token to the other. For systems with discrete state and input like textual sequences, the continuous-time SSM in (6) must be discretized using a step size $\bm{\Delta}$ which represents the resolution of the input. In other words, a discrete input $\bm{x}_{t}$ is a sample of the continuous input $\bm{x}(t)$ where $\bm{x}_{t}=\bm{x}(t\,\bm{\Delta})$ . Using the bilinear method [25], the discrete-time SSM is given by²²2Note that it is common to omit the parameter ${\bm{D}}$ during the discretization because the term ${\bm{D}}\,\bm{x}(t)$ is equivalent to a skip connection which be incorporated easily in the SSM layer architecture.:


$\displaystyle\bm{h}_{t}$	$\displaystyle=\hbox{\vbox{\hrule height=0.5pt\kern 0.86108pt\hbox{\kern-1.0000% 6pt$\bm{A}$\kern-1.00006pt}}}\,{\bm{h}}_{t-1}+\hbox{\vbox{\hrule height=0.5pt% \kern 0.86108pt\hbox{\kern-1.00006pt$\bm{B}$\kern-1.00006pt}}}\,\bm{x}_{t},$	(7a)
$\displaystyle\bm{y}_{t}$	$\displaystyle=\hbox{\vbox{\hrule height=0.5pt\kern 0.86108pt\hbox{\kern-1.0000% 6pt$\bm{C}$\kern-1.00006pt}}}\,\bm{h}_{t},\vspace{-0.5cm}$	(7b)

where


$\bm{A}$	$\displaystyle\leavevmode\nobreak\ \triangleq\leavevmode\nobreak\ (\bm{I}-\bm{% \Delta}/2\cdot\bm{A})^{-1}(\bm{I}+\bm{\Delta}/2\cdot\bm{A}),$
$\bm{B}$	$\displaystyle\leavevmode\nobreak\ \triangleq\leavevmode\nobreak\ (\bm{I}-\bm{% \Delta}/2\cdot\bm{A})^{-1}\bm{\Delta}\,\bm{B},$	(8a)
$\bm{C}$	$\displaystyle\leavevmode\nobreak\ \triangleq\leavevmode\nobreak\ \bm{C}.$	(8b)

The fact that the learnable parameters $\bm{A}$ , $\bm{B}$ , and $\bm{C}$ are constant means that the discrete-time SSM describes a linear time invariant (LTI) system with strong ties to convolution. Indeed, one can set the initial state $\bm{x}_{-1}$ to $\bm{0}$ for simplicity and rewrite (7)–(8) in the convolution representation for $t\in[1,T]$ as follows [26]:

\bm{y}=\hbox{\vbox{\hrule height=0.5pt\kern 0.86108pt\hbox{\kern-1.00006pt$\bm% {K}$\kern-1.00006pt}}}\ast\bm{x},

(9)

where $\hbox{\vbox{\hrule height=0.5pt\kern 0.86108pt\hbox{\kern-1.00006pt$\bm{K}$% \kern-1.00006pt}}}\triangleq\left(\hbox{\vbox{\hrule height=0.5pt\kern 0.86108% pt\hbox{\kern-1.00006pt$\bm{C}$\kern-1.00006pt}}}\,\hbox{\vbox{\hrule height=0% .5pt\kern 0.86108pt\hbox{\kern-1.00006pt$\bm{B}$\kern-1.00006pt}}},\hbox{\vbox% {\hrule height=0.5pt\kern 0.86108pt\hbox{\kern-1.00006pt$\bm{C}$\kern-1.00006% pt}}}\,\hbox{\vbox{\hrule height=0.5pt\kern 0.86108pt\hbox{\kern-1.00006pt$\bm% {A}$\kern-1.00006pt}}}\,\hbox{\vbox{\hrule height=0.5pt\kern 0.86108pt\hbox{% \kern-1.00006pt$\bm{B}$\kern-1.00006pt}}},\ldots,\hbox{\vbox{\hrule height=0.5% pt\kern 0.86108pt\hbox{\kern-1.00006pt$\bm{C}$\kern-1.00006pt}}}\,\overline{% \bm{A}}^{T-1}\overline{\bm{B}}\right)$ represents the SSM convolution kernel. By avoiding the standard recurrent representation, the convolution representation in (9) offers a compact and efficient parallel computation for SSM layers. However, because $\bm{K}$ is a giant filter, the naive implementation of the convolution as in (9) is slow and memory inefficient. To sidestep this limitation, many AI studies proposed restricting the structure of the SSM parameters to specific forms. Triangular $\bm{A}$ matrices kee** track of the Legendre polynomial’s coefficients are computationally efficient and produce a hidden state $\bm{h}_{t}$ that memorizes the input history [27]. Structured state space sequence models (S4) have also been introduced for SSMs where the parameters have a diagonal plus low-rank (DPLR) structure in the complex space [26]. Such a structure offers efficient SSMs with linear-time complexity instead of attention. More recently, Mamba [15] enhanced the S4 model by introducing a selective input mechanism that enables the model to choose relevant information based on the input $\bm{x}_{t}$ . This approach, combined with an implementation that is optimized for hardware, allowed Mamba to outperform transformers on different dense modalities like language and genomics. For input and state sequences $\bm{x}_{t}$ and $\bm{h}_{t}$ of size $N$ and $D$ , the number of FLOPS of the Mamba layer scales linearly in $N$ , more precisely it is $\mathcal{O}(N\,D)$ . Another important aspect of the Mamba model is that it is the first SSM to be time-invariant by indirectly updating $\bm{A}$ through $\bm{\Delta}$ and directly updating $\bm{B}$ and $\bm{C}$ over time through its selective scan mechanism.

III AI-based OFDM-CSI Prediction

In this section, we describe the proposed CSI prediction mechanism for the next-slot OFDM-CSI prediction. We also describe how the input dimensions of SSM and MSA layers are mapped to the CSI dimensions. We then present the 3GPP channel models and their key parameters which will be later considered in our simulation results in Section IV.

III-A Prediction tasks

In both 4G (a.k.a. LTE) and 5G networks, uplink and downlink transmissions are organized into radio frames of 10 ms each as depicted in Fig. 2. Each frame is divided into ten equally sized subframes. The duration of each subframe is 1 ms. In LTE, each subframe is further divided into two equal-size time slots, and each slot is of duration 0.5 ms. In 5G, however, the slot length changes depending on the used subcarrier spacing (a.k.a., numerology) associated with the operational frequency band and the service requirements.

In OFDM systems, the channel is a two-dimensional grid of $N_{s}$ symbols in time and $N_{f}$ sub-carriers in frequency. Specifically, consider a downlink system with $N_{r}$ antennas at the receiver (i.e., UE) and $N_{t}$ antennas at the transmitter (i.e., base station). The UE is continuously forecasting the CSI given the previously determined ones. To train and test SSM and MSA layers on this task, we consider the following CSI prediction problem: given the previous slot CSI, the UE predicts the CSI pertaining to next slot within the same subframe as depicted Fig. 2. This task covers slot-wise CSI prediction across subframes as well, i.e., between the last slot of subframe $i$ and the first slot of subframe $i+1$ .

Because the input of the tasks depends on the characteristics of the SSM and MSA layers, we associate their dimensions with those of the OFDM grid as follows:

III-A1 State-space model

Given $N_{s_{0}}$ OFDM symbols spanning $N_{f_{0}}$ sub-carriers, the input sequence $\bm{x}_{t}$ consists of $N_{s_{0}}$ symbols in time analogously to the token positions in sentences, while the sub-carrier dimension represents the number of channel coefficients in $\bm{x}_{t}$ . As a result, the obtained input vector $\bm{x}_{t}$ belongs to $\mathbb{R}^{N_{s_{0}}\times 2\,N_{f_{0}}}$ , where the factor $2$ follows from the concatenation of the real and complex parts of the OFDM symbols.

III-A2 Multi-head attention

Similarly the SSM input, the $N_{s_{0}}$ OFDM symbols over the $N_{f_{0}}$ sub-carriers represent the input sequence of the attention layer. We use two attention heads for real and imaginary parts of the sequence.

III-B 3GPP channel models

We consider two 5G channel models from the 3GPP specification for frequency bands up to 100 GHz, namely the UMi and UMa channel models [28]. They were derived based on extensive measurement and ray tracing results across a multitude of frequencies from 5 GHz to 100 GHz. We summarize in Table I the key parameters we vary to assess the AI performance in next-slot OFDM-CSI prediction tasks.

Table I: Summary of the 3GPP channel parameters.

Parameter	Values
OFDM channel type	UMi, UMa
User speed [m/s]	$\{0,10,20,30\}$
SNR [dB]	$\{-30,-10,0,10,30\}$
carrier frequency [GHz]	{5, 28}
carrier spacing [KHz]	$30$

Both UMi and UMa channels are considered without a line-of-sight between the base station and UEs.

IV Numerical Results and Discussions

In this section, we extensively assess the performance of SSM and MSA layers over multiple wireless scenarios. Throughout this section, we denote by $\mathcal{S}=\{-30,-10,0,10,30\}$ in [dB] and $\mathcal{V}=\{0,10,20,30\}$ in [m/s] the set of possible values for the SNR and user speeds. The code to train and test SSM and MSA layers is available on Github at https://github.com/makrout/Next-Slot-OFDM-CSI-Prediction.

Specifically, we consider the following two communication scenarios:

•

Uplink SISO transmission between a base station and a single user, both having one antenna. The task on the user side forecasts the next-slot OFDM-CSI.
•

Downlink MIMO transmission between a base station with $n_{\textrm{T}_{\textrm{x}}}$ antennas and $n_{\textrm{U}}$ users each with one single antenna. The task at the base station side forecasts the next-slot OFDM-CSI prediction for all users simultaneously.

In all simulations, we consider transmissions using $2$ -QAM constellations. For a fixed configuration of OFDM channel type, carrier frequency, and carrier spacing (cf. Table I), we train MSA and SSM layers to minimize the MSE between the next-slot OFDM-CSI and the predicted one for a given communication scenario determined by the SNR and user speed values in $\mathcal{S}$ and $\mathcal{V}$ , respectively. Users have single antennas with vertical polarization and an omnidirectional antenna pattern. The base station, however, has a uniform linear array with dual polarization, with each antenna element having a 3GPP 38.901 antenna pattern.

Then, we test the trained layers on the same speed values for ID evaluation and on different ones for OOD evaluation. We set the number of epochs to 1000 and we report the average MSE performance after 100 iterations. We use the Sionna library [29] to generate the OFDM grids for each training and test communication scenario considered.

As mentioned in Section I-C, we highlight the fact that the goal of our simulation results is not to design state-of-the-art DNN architectures for CSI predictions, but rather compare the predictive capability of the SSM and MSA layers only.

IV-A SISO experiments

IV-A1 In-distribution evaluation

For a fixed user speed $v_{\textrm{train}}$ , we train separate MSA and SSM layers for each SNR level in $\mathcal{S}$ at $f_{c}=5$ GHz. We also train an additional MSA and SSM layers on a communication scenario over the UMi channel with all SNR levels combined by uniformly sampling the SNR over $\mathcal{S}$ , which we refer to as the SNR value “all”. Fig. 3 depicts the OFDM-CSI prediction MSE pertaining to each considered network when it is evaluated over all possible SNR levels for static and highly mobile users, i.e., $v_{\textrm{train}}=v_{\textrm{test}}\in\{0,30\}$ . By comparing Figs. 3(a) and 3(b), it is seen that both SSM and MSA layers exhibit comparable MSE performance for static users only (i.e., $v=0$ ). For mobile users with $v=30$ , it is observed how SSMs in Fig. 3(d) outperform MSAs in Fig. 3(c), with the MSE being an order of magnitude smaller for SNR values larger than $0$ dB. However, the overall profile of the MSE over the entire SNR range increases for both models when users are mobile.

On the other hand, Figs. 3(a) and 3(b) show that the MSE decreases as a function of the SNR when users are static. This suggests that the task of learning the CSI prediction is more impacted by user mobility than the SNR. It is also noteworthy to observe how training the MSA layer with samples using all SNR levels (i.e., the gray curve) yields the lowest MSE across all test SNR levels. Interestingly, this SNR-wise diversification of samples, however, does not offer the lowest MSE for SSMs. This can be attributed to the fact that SSMs compress the signal sequence $\bm{x}$ in the state equation given in (7a) by ensuring that $\bm{h}_{t}$ is a fixed-sized low-dimensional hidden state compared to $\bm{x}_{t}$ . Such compression toward learning a state-space model or equivalently a transfer function³³3As a matter of fact, any state-space model can be seen as a transfer function in the Laplace domain [30]. is not equally impacted by diversified communication scenarios in the dataset.

We then repeat the same training and evaluation at the mmwave carrier frequency $f_{c}=28$ GHz. Simulation results are presented in Appendix A due to space limitation where the ID and OOD MSE evaluations are reported in Figs. 7 and 8. There, similar MSE trends to those at $f_{c}=5$ GHz are observed. Overall, the MSE is higher due to the significant path loss in mmwave bands. It is also interesting to note that both SSMs and MSAs trained with all SNR values do not yield the lowest MSE performance. Similar MSE profiles reported in Appendix B are also obtained after training and evaluating with UMa channels. The only notable difference is that MSE values are higher for UMa channels compared to UMi channels because macro-cell models cover a wider area with less dense networks.

IV-A2 Out-of-distribution evaluation

Unlike the previous experiment, we now train and test MSAs and SSMs on different user speeds. In Figs. 4(a) and 4(b), we train on mobile user scenarios (i.e., $v_{\textrm{train}}=30$ ) and test on static user scenarios (i.e., $v_{\textrm{test}}=0$ ). We perform the inverse training and test strategy in Figs. 4(c) and 4(d) by training on static users and testing on mobile ones. When comparing the range of the MSE between Figs. 4(a) and 4(b) and Figs. 4(c) and 4(d), it is seen that training on challenging CSI prediction tasks (i.e., when users are mobile) and testing on easier ones (i.e., when users are static) provides a better MSE on OOD scenarios. We also note that networks trained on lower SNR values generalize better than those trained on higher SNR levels. Moreover, when testing with static users, SSMs and MSAs exhibit a similar range of MSEs as shown in Figs. 4(a) and 4(b). However, for test scenarios on mobile users, SSMs exhibit a much lower MSE profile compared to MSAs as shown in Figs. 4(c) and 4(d). Indeed, it is well known that adding noise to the training samples of a DNN is equivalent to the Tikhonov regularization and can lead to significant improvements in generalization performance [24]. Similarly to the ID evaluation in Section IV-A1, the MSA model trained on all SNR levels is among the best performers for MSA layers, unlike the SSM ones.

IV-B MIMO experiments

For downlink MIMO CSI prediction, we consider a base station endowed with 20 transmit antennas communicating with 5 users, each of which with a single antenna.

IV-B1 In-distribution evaluation

Fig. 5 shows the CSI prediction MSE of both SSM and MSA networks when evaluated over all possible SNR levels for static and mobile users, i.e., $v_{\textrm{train}}=v_{\textrm{test}}\in\{0,30\}$ . Unlike the SISO case where the MSE performance of SSM and MSA layers were comparable, it is seen here that MSAs provide a lower MSE for both static and mobile user scenarios. It is interesting to observe again how networks trained with all SNR values do not provide the best performance. Training both models on MIMO scenarios with mobile or static users impacts the ID evaluation in the same way. This is to be opposed to the SISO case where training on mobile-user scenarios and testing on static-user scenarios yields better results than the opposite training and testing strategy. This does not reveal that the user speed is not a critical parameter but rather suggests that trained DNNs did not capture the correlation of fast-time varying channels due to the user mobility in MIMO scenarios. Indeed, the MSE is now two order of magnitude higher compared to the SISO case, suggesting that next-slot OFDM-CSI prediction is a challenging task for MIMO communication.

IV-B2 Out-of-distribution evaluation

When we compare the MSE of mobile user training with static user evaluation against static user training with mobile user evaluation (i.e., Fig. 6(a) vs. Fig. 6(c) for MSAs, and Fig. 6(b) and Fig. 6(d) for SSMs), a negligible variation in the MSE is observed. This does not suggest that these models exhibit a strong generalization performance over user speeds given the high MSE values, but rather confirms the challenge in predicting the next-slot OFDM-CSI for MIMO scenarios as already reported for ID evaluation in Section IV-B1.

V Conclusion

The existing applications of generative AI for wireless focus on language processing applications (e.g., prompt generation for compression, semantic communication). In this paper, we investigate the predictive capabilities of two key generative AI layers (i.e., multi-head attention and state space model) for OFDM slot prediction tasks. For these signal processing use cases, we compared the in-distribution and out-distribution performance of these two layers and empirically showed that multi-head attention layers outperform state space models for MIMO communication. However, we emphasize that the state space model layer has many advantages over the multi-head attention layer in terms of memory and computational complexity, which are also important factors for training and inference on long CSI inputs. Many avenues for further extension of this work are noteworthy. It is possible to design new hybrid architectures that endow state space models with an attention-like mechanism. One can also extend our benchmark to include more scenarios and models (e.g., antenna models, and frequency bands). One can also incorporate more wireless knowledge in the design of generative AI layers (e.g., prediction in the beam space).

References

[1] J. Yang, H. **, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu, “Harnessing the power of llms in practice: A survey on chatgpt and beyond,” ACM Transactions on Knowledge Discovery from Data, vol. 18, no. 6, pp. 1–32, 2024.
[2] L. Bariah, Q. Zhao, H. Zou, Y. Tian, F. Bader, and M. Debbah, “Large generative ai models for telecom: The next big thing?” IEEE Communications Magazine, 2024.
[3] X. Lin, “An overview of the 3gpp study on artificial intelligence for 5g new radio,” arXiv preprint arXiv:2308.05315, 2023.
[4] H. Kim, S. Kim, H. Lee, C. Jang, Y. Choi, and J. Choi, “Massive mimo channel prediction: Kalman filtering vs. machine learning,” IEEE Transactions on Communications, vol. 69, no. 1, pp. 518–528, 2020.
[5] J. B. Andersen, J. Jensen, S. H. Jensen, and F. Frederiksen, “Prediction of future fading based on past measurements,” in Gateway to 21st Century Communications Village. VTC 1999-Fall. IEEE VTS 50th Vehicular Technology Conference (Cat. No. 99CH36324), vol. 1. IEEE, 1999, pp. 151–155.
[6] W. Peng, M. Zou, and T. Jiang, “Channel prediction in time-varying massive mimo environments,” IEEE Access, vol. 5, pp. 23 938–23 946, 2017.
[7] W. Jiang and H. D. Schotten, “Neural network-based fading channel prediction: A comprehensive overview,” IEEE Access, vol. 7, pp. 118 112–118 124, 2019.
[8] Y. Yang, F. Gao, Z. Zhong, B. Ai, and A. Alkhateeb, “Deep transfer learning-based downlink channel prediction for fdd massive mimo systems,” IEEE Transactions on Communications, vol. 68, no. 12, pp. 7485–7497, 2020.
[9] C. Luo, J. Ji, Q. Wang, X. Chen, and P. Li, “Channel state information prediction for 5g wireless communications: A deep learning approach,” IEEE transactions on network science and engineering, vol. 7, no. 1, pp. 227–236, 2018.
[10] J. Wang, Y. Ding, S. Bian, Y. Peng, M. Liu, and G. Gui, “Ul-csi data driven deep learning for predicting dl-csi in cellular fdd systems,” IEEE Access, vol. 7, pp. 96 105–96 112, 2019.
[11] Y. Zhang, J. Wang, J. Sun, B. Adebisi, H. Gacanin, G. Gui, and F. Adachi, “Cv-3dcnn: Complex-valued deep learning for csi prediction in fdd massive mimo systems,” IEEE Wireless Communications Letters, vol. 10, no. 2, pp. 266–270, 2020.
[12] S. Mourya, P. Reddy, S. Amuru, and K. K. Kuchi, “Spectral temporal graph neural network for massive mimo csi prediction,” IEEE Wireless Communications Letters, 2024.
[13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[14] H. Jiang, M. Cui, D. W. K. Ng, and L. Dai, “Accurate channel prediction based on transformer: Making mobility negligible,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 9, pp. 2717–2732, 2022.
[15] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
[16] S. Jelassi, D. Brandfonbrener, S. M. Kakade, and E. Malach, “Repeat after me: Transformers are better than state space models at copying,” arXiv preprint arXiv:2402.01032, 2024.
[17] E. Nguyen, K. Goel, A. Gu, G. Downs, P. Shah, T. Dao, S. Baccus, and C. Ré, “S4nd: Modeling images and videos as multidimensional signals with state spaces,” Advances in neural information processing systems, vol. 35, pp. 2846–2861, 2022.
[18] M. M. Islam and G. Bertasius, “Long movie clip classification with state-space video models,” in European Conference on Computer Vision. Springer, 2022, pp. 87–104.
[19] C. Wang, O. Tsepa, J. Ma, and B. Wang, “Graph-mamba: Towards long-range graph sequence modeling with selective state spaces,” arXiv preprint arXiv:2402.00789, 2024.
[20] M. Akrout, A. Mezghani, E. Hossain, F. Bellili, and R. W. Heath, “From multilayer perceptron to gpt: A reflection on deep learning research for wireless physical layer,” arXiv preprint arXiv:2307.07359, 2023.
[21] M. Akrout, A. Feriani, F. Bellili, A. Mezghani, and E. Hossain, “Domain generalization in machine learning models for wireless communications: Concepts, state-of-the-art, and open issues,” IEEE Communications Surveys & Tutorials, 2023.
[22] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 13 001–13 008.
[23] R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk, “Improving robustness without sacrificing accuracy with patch gaussian augmentation,” arXiv preprint arXiv:1906.02611, 2019.
[24] C. M. Bishop, “Training with noise is equivalent to tikhonov regularization,” Neural computation, vol. 7, no. 1, pp. 108–116, 1995.
[25] A. Tustin, “A method of analysing the behaviour of linear systems in terms of time series,” Journal of the Institution of Electrical Engineers-Part IIA: Automatic Regulators and Servo Mechanisms, vol. 94, no. 1, pp. 130–142, 1947.
[26] A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396, 2021.
[27] A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Ré, “Hippo: Recurrent memory with optimal polynomial projections,” Advances in neural information processing systems, vol. 33, pp. 1474–1487, 2020.
[28] G. T. 38.901, “Study on channel model for frequencies from 0.5 to 100 ghz,” 2017.
[29] J. Hoydis, S. Cammerer, F. A. Aoudia, A. Vem, N. Binder, G. Marcus, and A. Keller, “Sionna: An open-source library for next-generation physical layer research,” arXiv preprint arXiv:2203.11854, 2022.
[30] J. R. Leigh, Control theory. Iet, 2004, vol. 64.

Appendix A SISO Simulations at $f=28$ GHz

In this appendix, we present the evaluation of both SSM and MSA layers for the SISO communication scenario described in Section IV-A when the carrier frequency is fixed at $f_{c}=28$ GHz. Figs. 7 and 8 depict the ID and OOD evaluations.