Low-Complexity CSI Feedback for FDD Massive MIMO Systems via Learning to Optimize

Yifan Ma, Graduate Student Member, IEEE, Hengtao He, Member, IEEE, Shenghui Song, Senior Member, IEEE, Jun Zhang Fellow, IEEE, and Khaled B. Letaief, Fellow, IEEE The authors are with the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong (E-mail: {ymabj, eehthe, eeshsong, eejzhang, eekhaled}@ust.hk).
Abstract

In frequency-division duplex (FDD) massive multiple-input multiple-output (MIMO) systems, the growing number of base station antennas leads to prohibitive feedback overhead for downlink channel state information (CSI). To address this challenge, state-of-the-art (SOTA) fully data-driven deep learning (DL)-based CSI feedback schemes have been proposed. However, the high computational complexity and memory requirements of these methods hinder their practical deployment on resource-constrained devices like mobile phones. To solve the problem, we propose a model-driven DL-based CSI feedback approach by integrating the wisdom of compressive sensing and learning to optimize (L2O). Specifically, only a linear learnable projection is adopted at the encoder side to compress the CSI matrix, thereby significantly cutting down the user-side complexity and memory expenditure. On the other hand, the decoder incorporates two specially designed components, i.e., a learnable sparse transformation and an element-wise L2O reconstruction module. The former is developed to learn a sparse basis for CSI within the angular domain, which explores channel sparsity effectively. The latter shares the same long short term memory (LSTM) network across all elements of the optimization variable, eliminating the retraining cost when problem scale changes. Simulation results show that the proposed method achieves a comparable performance with the SOTA CSI feedback scheme but with much-reduced complexity, and enables multiple-rate feedback.

Index Terms:
6G, CSI feedback, learning to optimize, massive MIMO, model-driven deep learning.

I Introduction

Massive multiple-input multiple-output (MIMO) is regarded as a key enabler for the fifth-generation and beyond wireless communication systems, as it empowers high throughput, simultaneous multiple streams, and ubiquitous coverage for diverse applications [1]. For future sixth-generation (6G) wireless communication networks, extremely large-scale MIMO is considered as a critical technological advancement, where a much larger number of antennas will be deployed at the base station (BS) [2, 3]. However, such large-scale MIMO systems pose significant challenges in the physical layer algorithm design. For example, in frequency-division duplexing (FDD) massive MIMO systems, accurate downlink channel state information (CSI) needs to be fed from users back to the BS for high-quality downlink beamforming. Unfortunately, the dimension of the CSI escalates substantially with the number of antennas at the BS, resulting in a prohibitive feedback overhead if the full CSI matrix is directly sent back. The conventional compressive sensing (CS)-based methods, widely applied for CSI compression and recovery [4, 5], suffer from noteworthy limitations, such as the impractical assumption of channel sparsity, the limited ability to exploit the channel structures, and the high computational cost of the iterative operations [6]. Therefore, innovative technologies are imperative to solve the high-dimensional nonlinear CSI feedback problem.

With the success of artificial intelligence (AI) in various fields, its integration with wireless communication has attracted significant interests recently [7]. One key application of deep learning (DL) in the physical layer is DL-based CSI feedback [8], which leverages auto-encoder and decoder structures to compress and reconstruct the downlink CSI. These kinds of fully data-driven CSI feedback methods outperform traditional algorithms in terms of performance [9, 10, 11], thus attracting widespread attention from both the academic and industry. The authors of [9] proposed a convolutional neural network (CNN)-based scheme, named CsiNet, which outperforms the CS-based algorithms especially with low compression ratios. Several subsequent studies, including ConvCsiNet [10] and TransNet [11], aimed to further improve the feedback accuracy using deeper CNNs and attention mechanism, respectively. However, the performance improvement is achieved at the cost of computational complexity. For example, the number of floating point operations (FLOPs) of TransNet is almost 11 times more than that of CsiNet. Although TransNet achieves the state-of-the-art (SOTA) performance for CSI feedback, its heavy computational complexity and memory cost in the encoder side hinder the practical deployment on resource-constrained devices, such as mobile phones, internet-of-things (IoT) devices, and embedded systems [12].

Existing auto-encoder and decoder-based CSI feedback schemes are completely data-driven, and thus ignore the physical characteristics of the wireless channel in the encoding and decoding process. This typically leads to a large number of learnable parameters, tricky training schemes, and can also drag down their performance without explicit physical guidance [13, 14]. To solve these issues, another line of research combines communication domain knowledge with DL, where deep unfolding is considered as one of the representative solutions [15, 16, 17, 18]. Deep unfolding relates iterative optimization methods with deep neural networks. It treats each iteration in the optimization algorithm as one layer of the neural network, where a number of trainable parameters are introduced to be learned by DL techniques. In deep unfolding-based CSI feedback approaches, the CS processing pipeline is preserved, i.e., a small number of codewords (observations) are obtained through linear projection and nonlinear learnable map**s are adopted to recover the CSI. For example, the authors of [19] proposed a sparse autoencoder to learn the sparse transformations in each iteration of iterative shrinkage-thresholding algorithm (ISTA). In [20], instead of using l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm as the regularization term, a learnable regularization module is introduced in ISTA to automatically adapt to the characteristics of CSI. Those proposals adopt a single linear projection at the encoder side, making it applicable for resource-constrained devices in practice. However, traditional deep unfolding methods are built by truncating an iterative algorithm into finite and fixed layers, which makes it difficult to scale to variable numbers of iterations and hard to ensure convergence [18]. Additionally, the direct parameterization requires dimension matching of learnable parameters and the problem scale, indicating that the model, once trained, is not applicable to optimization problems of varying scales during inference [21]. For massive MIMO CSI feedback, the compression ratio has to be adjusted according to the dynamic environments and varying coherence time [22]. Therefore, it is crucial to develop a DL-based CSI feedback scheme that guarantees convergence and is able to generalize to different compression ratios.

To address these challenges, in this paper, we propose a model-driven DL method for CSI feedback. Inspired by the recent success of utilizing AI, especially DL, for solving mathematical problems, we propose a Learning to Optimize (L2O)-based approach that combines the wisdom of both the CS algorithm and DL. Using L2O models to solve optimization problems involves the design of a learnable update rule [23, 24, 21], leading to an autonomously learned optimization algorithms from data. While L2O strategies can achieve a faster convergence and better performance than conventional non-learning optimization algorithms [24, 21], directly implementing them for CSI feedback still meets obstacles. Specifically, the reconstruction performance is highly dependent on the signal sparsity of the data in a specific transform domain. However, the wireless channel is not exactly sparse in some domains. Without an effective transformation and sufficient sparsity level, the L2O method will have the severe performance degradation. Although traditional manually designed transformations, e.g., discrete fourier transform (DFT) and wavelet transformation, can be utilized, they require a large number of iterations at the decoder, resulting in high computational complexity and restricting their practical applications. Therefore, it requires special design for the L2O-based CSI feedback approach.

I-A Contributions

To deal with the imbalanced computational capability between the mobile equipment and BS and reduce the retraining cost when problem scale changes, we propose an L2O-based CSI feedback scheme, i.e., Csi-L2O, in this paper. It enjoys ultra low-complexity at the encoder side, comparable performance compared to SOTA, and adaptability to multiple feedback rates without retraining the neural network. The major contributions are summarized as follows:

  • Low Complexity: The overall framework integrates the wisdom of CS and DL. Inspired by CS, the codeword is obtained through a linear projection at the user side and full CSI is recovered via a parameterized update rule at the BS side. Different from the auto-encoder and decoder structures that adopt convolutional kernels, fully connected layers, or attention mechanism, the linear projection encoding module inherently enjoys ultra low-complexity, which is more suitable for practical wireless communication systems.

  • Comparable Performance: To maintain performance, we propose a data-driven channel sparse transformation and L2O module at the decoder side. In contrast to manually designed sparse transformation, we propose to learn the sparse transformation in the angular domain, resulting in a more efficient sparse representation for CSI. The following L2O module is proposed to capture dynamics among different layers and learns the optimization update rule automatically from data, ensuring a good reconstruction accuracy.

  • No Retraining Cost: To make the proposed Csi-L2O generalizable to different compression ratios, we adopt an “element-wise” long short term memory (LSTM) to generate the optimization parameters at the decoder. In particular, the same neural network is shared across each element of the optimization variables, so that the proposed single model can be applied to optimization problems of any scales without retraining and enable the multiple-rate feedback.

  • Simulations: Extensive simulations will demonstrate that the performance of the proposed L2O-based method is close to existing SOTA, i.e., TransNet, while enjoying significantly improved computational efficiency compared with the fully data-driven methods. In particular, the proposed L2O method achieves 3.88 dB higher reconstruction accuracy than SOTA, TransNet, in an indoor scenario with a compression ratio of 1/161161/161 / 16. In addition, the encoder side FLOPs of the proposed method is only 0.15%percent0.150.15\%0.15 % of that of SOTA, making the deployment to resource constraint devices practical.

I-B Organization and Notations

The paper is organized as follows. Section II introduces the system model and existing approaches. In Section III, the key design properties and the proposed Csi-L2O architecture are presented. Then, we perform the convergence analysis and computational complexity analysis in Section IV. Extensive simulations are demonstrated in Section V and conclusions are drawn in Section VI.

In this paper, x𝑥xitalic_x is a scalar, 𝐱𝐱{\mathbf{x}}bold_x is a vector, and 𝐗𝐗{\mathbf{X}}bold_X denotes a matrix. Let 𝐗Tsuperscript𝐗𝑇{{\mathbf{X}}^{T}}bold_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and 𝐗Hsuperscript𝐗𝐻{{\mathbf{X}}^{H}}bold_X start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT denote the transpose and conjugate transpose of matrix 𝐗𝐗{\mathbf{X}}bold_X, respectively. 𝐈𝐈{{\mathbf{I}}}bold_I stands for an identity matrix, 𝟏1\mathbf{1}bold_1 represents the vector whose all elements are all ones, and 𝟎0\mathbf{0}bold_0 denotes the zero vector. 𝐗2subscriptnorm𝐗2||{\mathbf{X}}||_{2}| | bold_X | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝐗1superscript𝐗1\mathbf{X}^{-1}bold_X start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT denote the Frobenius norm and the inverse of matrix 𝐗𝐗\mathbf{X}bold_X, respectively. 𝔼{}𝔼\mathbb{E}\{{\cdot}\}blackboard_E { ⋅ } denotes the statistical expectation. fθsubscript𝑓𝜃f_{\mathbf{\theta}}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denotes a map** parameterized by learnable parameters θ𝜃\mathbf{\theta}italic_θ. Function sign()sign\operatorname{sign}(\cdot)roman_sign ( ⋅ ) represents element-wise sign fuction. Function max(𝐱,𝐲)𝐱𝐲\max(\mathbf{x},\mathbf{y})roman_max ( bold_x , bold_y ) returns element-wise maximum value between vector 𝐱𝐱\mathbf{x}bold_x and 𝐲𝐲\mathbf{y}bold_y. 𝐗=diag(𝐱)𝐗diag𝐱\mathbf{X}=\text{diag}(\mathbf{x})bold_X = diag ( bold_x ) defines 𝐗𝐗\mathbf{X}bold_X as a diagonal matrix with 𝐱𝐱\mathbf{x}bold_x as its diagonal. m×nsuperscript𝑚𝑛\mathbb{C}^{m\times n}blackboard_C start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT is the set of all m×n𝑚𝑛{m\times n}italic_m × italic_n complex-valued matrices. The Hadamard product is denoted by direct-product\odot.

II System Model and Existing Approaches

In this section, we first formulate the CSI feedback problem. Then, existing DL-based CSI feeback schemes are introduced, which motivates the proposed method.

II-A System Model

Refer to caption
Figure 1: An illustration of the considered communication system and CSI feedback problem.

As illustrated in Fig. 1(a), we consider a single-cell FDD massive MIMO system where the BS is equipped with Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT antennas and the user is equipped with a single antenna. For ease of illustration, a single user case is considered while the proposed scheme can be easily generalized to the multi-user scenario. An orthogonal frequency division multiplexing (OFDM) system with Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT subcarriers is considered. The received signal on the n𝑛nitalic_n-th subcarrier is expressed as

yn=𝐡nH𝐯nxn+zn,subscript𝑦𝑛superscriptsubscript𝐡𝑛𝐻subscript𝐯𝑛subscript𝑥𝑛subscript𝑧𝑛y_{n}=\mathbf{h}_{n}^{H}\mathbf{v}_{n}x_{n}+z_{n},italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , (1)

where 𝐡nNt×1subscript𝐡𝑛superscriptsubscript𝑁𝑡1\mathbf{h}_{n}\in\mathbb{C}^{N_{t}\times 1}bold_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT, 𝐯nNt×1subscript𝐯𝑛superscriptsubscript𝑁𝑡1\mathbf{v}_{n}\in\mathbb{C}^{N_{t}\times 1}bold_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT, xnsubscript𝑥𝑛x_{n}\in\mathbb{C}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_C, and znsubscript𝑧𝑛z_{n}\in\mathbb{C}italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_C denote the downlink channel vector, the downlink beamforming vector, the transmit symbol, and the additive noise of the n𝑛nitalic_n-th subcarrier, respectively. The downlink beamforming requires the BS to know the downlink CSI, denoted by 𝐇=[𝐡1,,𝐡Nc]HNc×Nt𝐇superscriptsubscript𝐡1subscript𝐡subscript𝑁𝑐𝐻superscriptsubscript𝑁𝑐subscript𝑁𝑡\mathbf{H}=[\mathbf{h}_{1},\cdots,\mathbf{h}_{N_{c}}]^{H}\in\mathbb{C}^{N_{c}% \times N_{t}}bold_H = [ bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_h start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. In this paper, we assume that the downlink channel is perfectly known at the user side via pilot-based training and focus on the efficient feedback design [9, 10, 11].

The channel matrix 𝐇𝐇\mathbf{H}bold_H contains 2NcNt2subscript𝑁𝑐subscript𝑁𝑡2N_{c}N_{t}2 italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT real elements. As Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are large in FDD massive MIMO systems, directly feeding back 𝐇𝐇\mathbf{H}bold_H will result in prohibitive feedback overhead. To tackle this issue, we first sparsify 𝐇𝐇\mathbf{H}bold_H in the angular-delay domain using a 2D discrete Fourier transform (2D-DFT) [9] as follows

𝐇=𝐅d𝐇𝐅a,superscript𝐇subscript𝐅dsubscript𝐇𝐅a\mathbf{H}^{\prime}=\mathbf{F}_{\mathrm{d}}\mathbf{H}\mathbf{F}_{\mathrm{a}},bold_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_F start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT bold_HF start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT , (2)

where 𝐅dNc×Ncsubscript𝐅dsuperscriptsubscript𝑁𝑐subscript𝑁𝑐\mathbf{F}_{\mathrm{d}}\in\mathbb{C}^{N_{c}\times N_{c}}bold_F start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐅aNt×Ntsubscript𝐅asuperscriptsubscript𝑁𝑡subscript𝑁𝑡\mathbf{F}_{\mathrm{a}}\in\mathbb{C}^{N_{t}\times N_{t}}bold_F start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are two DFT matrices. Only the first Nasubscript𝑁𝑎N_{a}italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT rows of 𝐇superscript𝐇\mathbf{H}^{\prime}bold_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT contain significant values and other elements are close to zero because the time delays between multipath arrivals are within a limited period [9]. Therefore, we take the first Nasubscript𝑁𝑎N_{a}italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT rows of 𝐇superscript𝐇\mathbf{H}^{\prime}bold_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (Na<Ncsubscript𝑁𝑎subscript𝑁𝑐N_{a}<N_{c}italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT < italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT) and define a new matrix 𝐇′′Na×Ntsuperscript𝐇′′superscriptsubscript𝑁𝑎subscript𝑁𝑡\mathbf{H}^{\prime\prime}\in\mathbb{C}^{N_{a}\times N_{t}}bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. By doing this, we can compress 𝐇′′superscript𝐇′′\mathbf{H}^{\prime\prime}bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT instead of 𝐇𝐇\mathbf{H}bold_H with only 2NaNt2subscript𝑁𝑎subscript𝑁𝑡2N_{a}N_{t}2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT elements and imperceptible information loss.

DL-based methods have been applied for CSI feedback [9, 10, 11]. As demonstrated in Fig. 1(b), the encoding process at the user side is given by

𝐬=θe(𝐇′′),𝐬subscriptsubscript𝜃esuperscript𝐇′′\mathbf{s}=\mathcal{E}_{\theta_{\mathrm{e}}}(\mathbf{H}^{\prime\prime}),bold_s = caligraphic_E start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , (3)

which further compresses the channel matrix 𝐇′′superscript𝐇′′\mathbf{H}^{\prime\prime}bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT into an M×1𝑀1M\times 1italic_M × 1 codeword 𝐬𝐬\mathbf{s}bold_s. The parameterized map** θe()subscriptsubscript𝜃e\mathcal{E}_{\theta_{\mathrm{e}}}(\cdot)caligraphic_E start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) denotes the compression procedure and θesubscript𝜃e\theta_{\mathrm{e}}italic_θ start_POSTSUBSCRIPT roman_e end_POSTSUBSCRIPT is the trainable parameters in the encoder. The compression ratio is defined as M/2NaNt𝑀2subscript𝑁𝑎subscript𝑁𝑡M/2N_{a}N_{t}italic_M / 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We use the same setting as [9, 10, 11] and assume 𝐬𝐬\mathbf{s}bold_s is sent to the BS via error-free transmission. After receiving the codeword, the BS reconstructs the channel matrix through a decoder, expressed as

𝐇^′′=𝒟θd(𝐬),superscript^𝐇′′subscript𝒟subscript𝜃d𝐬\hat{\mathbf{H}}^{\prime\prime}=\mathcal{D}_{\theta_{\mathrm{d}}}(\mathbf{s}),over^ start_ARG bold_H end_ARG start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_s ) , (4)

where 𝒟θd()subscript𝒟subscript𝜃d\mathcal{D}_{\theta_{\mathrm{d}}}(\cdot)caligraphic_D start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) denotes the recovery procedure and θdsubscript𝜃d\theta_{\mathrm{d}}italic_θ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT represents the trainable parameters at the decoder. The objective of the CSI feedback is to minimize the mean-squared-error (MSE) between the recovered channel and the true channel, given by

minθe,θd𝔼{𝐇′′𝒟θd(θe(𝐇′′))22}.subscriptsubscript𝜃esubscript𝜃d𝔼superscriptsubscriptnormsuperscript𝐇′′subscript𝒟subscript𝜃dsubscriptsubscript𝜃esuperscript𝐇′′22\min_{\theta_{\mathrm{e}},\theta_{\mathrm{d}}}\quad\mathbb{E}\left\{||\mathbf{% H}^{\prime\prime}-\mathcal{D}_{\theta_{\mathrm{d}}}(\mathcal{E}_{\theta_{% \mathrm{e}}}(\mathbf{H}^{\prime\prime}))||_{2}^{2}\right\}.roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_e end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E { | | bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT - caligraphic_D start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_E start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } . (5)

Many existing works aim to solve Problem (5) and the most representative approaches are fully data-driven methods and deep unfolding.

II-B Existing Approaches

II-B1 Fully Data-Driven DL-based Methods

In order to solve Problem (5), fully data-driven DL-based methods [9, 10, 11] have been developed. The map** θe()subscriptsubscript𝜃e\mathcal{E}_{\theta_{\mathrm{e}}}(\cdot)caligraphic_E start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) and 𝒟θd()subscript𝒟subscript𝜃d\mathcal{D}_{\theta_{\mathrm{d}}}(\cdot)caligraphic_D start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) can be instantiated as DL-based encoder and decoder, and jointly trained via end-to-end learning [9, 10, 11]. Fully data-driven DL-based approaches obtain better performance than traditional CS-based methods, especially at low compression ratios. This is because of the powerful representation ability and universal approximation of neural networks. However, most of the existing works improve the reconstruction accuracy at the cost of higher neural network complexity, e.g., larger kernels, deeper neural networks, or complicated attention mechanism, which is not affordable for resource-constrained devices, e.g., mobile phones. For example, assume that the compression ratio is 1/16 and the CSI feedback and recovery period is 1 millisecond. The computational overhead required by the TransNet encoder is about 17.07 G floating point operations per second (FLOPS). Note that Kirin 659, one of the mid-end mobile systems on chip (SoC), has a total peak computation capability of 57.6 G FLOPS [25]. If the TransNet is deployed in practice, around 30%percent3030\%30 % of the mobile’s computational power is used for CSI feedback, which cannot be acceptable. Although TransNet achieves SOTA performance, the extensive computational demands and memory requirements hinder its practical deployments.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: (a) The overall architecture of the proposed Csi-L2O. (b) The structure of the proposed element-wise L2O mechanism.

II-B2 Deep Unfolding

By taking the physical meaning of encoding and decoding process into consideration, deep unfolding methods were proposed for CSI feedback [19, 20, 26]. It is shown in classic CS theory that when a signal exhibits a certain sparsity in a specific transform domain, we can obtain a small number of codewords (observations) through linear projection and use nonlinear recovery map** to get an accurate estimation of the original signal [13]. By amalgamating the CS knowledge, deep unfolding-based methods implement a linear learnable encoding process to reduce the signal dimension at the encoder side. The projected codeword can be expressed as

𝐬=𝐖𝐡vec,𝐬subscript𝐖𝐡vec\mathbf{s}=\mathbf{W}\mathbf{h_{\text{vec}}},bold_s = bold_Wh start_POSTSUBSCRIPT vec end_POSTSUBSCRIPT , (6)

where 𝐖𝐖\mathbf{W}bold_W is the sampling matrix and 𝐡vec2NaNt×1subscript𝐡vecsuperscript2subscript𝑁𝑎subscript𝑁𝑡1\mathbf{h_{\text{vec}}}\in\mathbb{R}^{2N_{a}N_{t}\times 1}bold_h start_POSTSUBSCRIPT vec end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT is the vectorized channel matrix 𝐇′′superscript𝐇′′\mathbf{H}^{\prime\prime}bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT stacking the real and imaginary part. The decoding process at the BS can be regarded as solving an inverse problem. The dimensionality reduction in (6) makes the signal recovery notably ill-posed. A regularization term is typically added to the optimization function to make use of known prior information about the optimal solution, which is expressed as

min𝐱12𝐬𝐖𝐱22+R(𝐱),subscript𝐱12superscriptsubscriptnorm𝐬𝐖𝐱22𝑅𝐱\min_{\mathbf{x}}\frac{1}{2}||\mathbf{s}-\mathbf{Wx}||_{2}^{2}+R(\mathbf{x}),roman_min start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG | | bold_s - bold_Wx | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_R ( bold_x ) , (7)

where R(𝐱)𝑅𝐱R(\mathbf{x})italic_R ( bold_x ) is the regularization term. Typically, l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm is utilized as a regularizer, i.e., R(𝐱)=λΨ𝐱1𝑅𝐱𝜆subscriptnormΨ𝐱1R(\mathbf{x})=\lambda||\Psi\mathbf{x}||_{1}italic_R ( bold_x ) = italic_λ | | roman_Ψ bold_x | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where ΨΨ\Psiroman_Ψ is a certain sparse transformation. Problem (7) is then written as

min𝐱12𝐬𝐖𝐱22+λΨ𝐱1.subscript𝐱12superscriptsubscriptnorm𝐬𝐖𝐱22𝜆subscriptnormΨ𝐱1\min_{\mathbf{x}}\frac{1}{2}\|\mathbf{s}-\mathbf{W}\mathbf{x}\|_{2}^{2}+% \lambda\|\Psi\mathbf{x}\|_{1}.roman_min start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_s - bold_Wx ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ roman_Ψ bold_x ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (8)

Iterative Shrinkage-Thresholding Algorithm (ISTA) is a classic iterative method to solve Problem (8), and the follow-up model-driven DL methods for CSI feedback are inspired by ISTA-based algorithms. At the t𝑡titalic_t-th step of ISTA, the iterative process is expressed as

𝐮[t]superscript𝐮delimited-[]𝑡\displaystyle\mathbf{u}^{[t]}bold_u start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT =𝐱[t1]α𝐖T(𝐖𝐱[t1]𝐬),absentsuperscript𝐱delimited-[]𝑡1𝛼superscript𝐖𝑇superscript𝐖𝐱delimited-[]𝑡1𝐬\displaystyle=\mathbf{x}^{[t-1]}-\alpha\mathbf{W}^{T}\left(\mathbf{W}\mathbf{x% }^{[t-1]}-\mathbf{s}\right),= bold_x start_POSTSUPERSCRIPT [ italic_t - 1 ] end_POSTSUPERSCRIPT - italic_α bold_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_Wx start_POSTSUPERSCRIPT [ italic_t - 1 ] end_POSTSUPERSCRIPT - bold_s ) , (9)
𝐱[t]superscript𝐱delimited-[]𝑡\displaystyle\mathbf{x}^{[t]}bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT =sign(Ψ𝐮[t])max(𝟎,|Ψ𝐮[t]|θ),absentsignΨsuperscript𝐮delimited-[]𝑡0Ψsuperscript𝐮delimited-[]𝑡𝜃\displaystyle=\operatorname{sign}\left(\Psi\mathbf{u}^{[t]}\right)\max\left(% \mathbf{0},\left|\Psi\mathbf{u}^{[t]}\right|-\mathbf{\theta}\right),= roman_sign ( roman_Ψ bold_u start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) roman_max ( bold_0 , | roman_Ψ bold_u start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT | - italic_θ ) ,

where 𝐮[t]superscript𝐮delimited-[]𝑡\mathbf{u}^{[t]}bold_u start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, α𝛼\alphaitalic_α, and θ𝜃\thetaitalic_θ are the intermediate variable, step size, and thresholding parameter, respectively. In [27], a model-driven DL method, ISTA-Net, is proposed. It is designed to learn optimal parameters, i.e. thresholds, step sizes as well as nonlinear transforms, without hand-crafted settings, in an end-to-end manner. The ISTA-Net method adopts CNN to approximate the nonlinear sparse transformation and improves the recovery performance compared to conventional CS algorithm. As a deep unfolding method for CSI feedback, TiLISTA [19] utilizes a sparse auto-encoder to learn the sparse transformation in the spatial domain. Nevertheless, due to the truncation of the ISTA algorithm into a finite and fixed number of layers for both training and inference stages, ISTA-Net and TiLISTA struggle with scaling to accommodate a variable number of iterations and face challenges in guaranteeing convergence upon implementation. These problems motivate us to propose a new model-driven DL-based network for CSI feedback with provable convergence guarantee.

III Proposed Csi-L2O Method

In this section, we propose a new model-driven DL approach, Csi-L2O, which embraces the wisdom of wireless domain knowledge and AI for CSI feedback in FDD massive MIMO systems. We will introduce the general architecture of the proposed Csi-L2O framework, the learnable linear projection at the encoder side, the angular domain sparse transformation at the decoder, and the element-wise L2O decoding module, respectively.

Refer to caption
(a)
Refer to caption
(b)
Figure 3: (a) The visualization of indoor and outdoor channels in angular-delay domain generated by COST2100 model when Na=Nt=32subscript𝑁𝑎subscript𝑁𝑡32N_{a}=N_{t}=32italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 32. (b) The neural network structure of the proposed angular domain sparse transformation.

III-A Architecture of Csi-L2O

In alignment with the CSI compression and feedback procedure, the proposed Csi-L2O architecture consists of two modules: A compression module and a reconstruction module. The overall architecture of the proposed Csi-L2O is shown in Fig. 2(a). From the insight of CS, the encoding side is a linear projection and the decoding side is an iterative recovery algorithm. This fits the practical requirement of CSI feedback problem, i.e., the encoder is typically resource-constrained while the decoder enjoys powerful computational capability. At the encoder side, according to Eqn. (6), we employ a learnable projection to compress the CSI where the sampling matrix 𝐖𝐖\mathbf{W}bold_W is set learnable and instantiated as a linear layer. Therefore, the encoder is a lightweight and memory-efficient encoding module. Concurrently, at the decoding side, the proposal is enhanced with two distinctively engineered components: a learnable sparse transformation and an element-wise L2O mechanism. The learnable sparse transformation is designed to identify a sparse representation of CSI in the angular domain, which capitalizes on the inherent sparsity of the channel and consequently improves reconstruction precision. Furthermore, different from traditional deep unfolding methods that unroll an existing CS algorithm, we adopt an L2O framework that autonomously learns an optimization algorithm from data. Optimization parameters of t𝑡titalic_t-th iteration are colored red in Fig. 2(a), e.g., preconditioner 𝐩[t]superscript𝐩delimited-[]𝑡\mathbf{p}^{[t]}bold_p start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, thresholding parameter α[t]superscript𝛼delimited-[]𝑡\alpha^{[t]}italic_α start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, accelerator 𝐚[t]superscript𝐚delimited-[]𝑡\mathbf{a}^{[t]}bold_a start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT. These parameters are learned using element-wise L2O module, which is elaborated in Section III-D. Different from existing fully data-driven DL-based methods that treat CSI matrix as a 2D image, the proposed CSI feedback scheme preserves the CS processing pipeline and takes the physical meaning of the wireless channel and sparse recovery into consideration.

III-B Encoder: Learnable Linear Projection

As shown in (6), traditional CS infers the original signal 𝐡vecsubscript𝐡vec\mathbf{h_{\text{vec}}}bold_h start_POSTSUBSCRIPT vec end_POSTSUBSCRIPT from the randomized CS measurements 𝐬𝐬\mathbf{s}bold_s, where 𝐖𝐖\mathbf{W}bold_W is a linear random projection matrix. It is important to emphasize that the design of the sampling matrix 𝐖𝐖\mathbf{W}bold_W plays a crucial role in preserving the essential elements of the original signal. Researchers have devoted large efforts for develo** optimal sampling matrices that contain as much information from the original signals as possible [28]. Three types of sampling matrices were proposed in the CS context, which are random, deterministic, and partially orthogonal sampling matrices [29].

In this paper, by capitalizing on the powerful representation ability of DL, we make the matrix 𝐖𝐖\mathbf{W}bold_W learnable. The sampling process at the encoder is efficiently implemented as a simple linear layer neural network. 𝐖𝐖\mathbf{W}bold_W is naturally the learnable weight of a single fully-connected layer without bias. The sampling matrix is thus able to be trained end-to-end with the decoding module, enabling a good reconstruction accuracy and low encoder-side complexity. Different from conventional fully data-driven method which typically adopts convolutional kernels, fully connected layers, or attention mechanism, our encoder design requires lower computational and memory cost, and thus is more sutable for practical resource-constrained devices.

III-C Decoder: Angular Domain Sparse Transformation

At the decoder side, after receiving the codeword 𝐱𝐱\mathbf{x}bold_x, the channel reconstruction problem is formulated as

min𝐱12𝐬𝐖𝐱22+λft(𝐱)1,subscript𝐱12superscriptsubscriptnorm𝐬𝐖𝐱22𝜆subscriptnormsubscript𝑓𝑡𝐱1\min_{\mathbf{x}}\frac{1}{2}||\mathbf{s}-\mathbf{Wx}||_{2}^{2}+\lambda||f_{t}(% \mathbf{x})||_{1},roman_min start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG | | bold_s - bold_Wx | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ | | italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (10)

where ft()subscript𝑓𝑡f_{t}(\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) denotes the sparse transformation and λ𝜆\lambdaitalic_λ is the regularization parameter. While sparse transformation is widely utilized in numerous signal compression methodologies, identifying a transformation basis that can sufficiently sparsify CSI remains a challenging task.

III-C1 Channel Sparsity Observations

Since wireless channels are typically non-stationary, traditional fixed domains, e.g., DFT wavelet transformation, usually result in poor reconstruction performance. In practice, the spatial angles are continuous rather than discrete, which makes the sparsity of the channel coefficients after DFT transformation still insufficient [19]. To demonstrate this conclusion, we plot the gray-scale channel visualizations in angular-delay domain in Fig. 3(a). We can observe from Fig. 3(a) that due to the multipath effect, there is a high level of sparsity in the delay domain, i.e., only a few elements in each column of channel matrx 𝐇′′superscript𝐇′′\mathbf{H}^{\prime\prime}bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT contains significant values. However, in angular domain (each row of the channel matrix), the sparsity level is still unsatisfactory. This reveals that the signals after the DFT transformation is still not strictly sparse when the number of antennas is not large enough. Besides, due to the complicated outdoor communication surroundings, the sparsity level of outdoor scenario is less satisfactory compared to the indoor scenario. In this paper, considering the characteristic of wireless channels, we design a learnable angular domain sparse transformation for CSI feedback.

III-C2 Neural Network Design

The details of the sparse transformation and the inverse transformation are shown in Fig. 3(b). To enhance the sparsity level in the angular domain, each row of the channel matrix 𝐇′′superscript𝐇′′\mathbf{H}^{\prime\prime}bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT is selected and fed into the neural network individually. We employ an MLP with three fully-connected layers as ft()subscript𝑓𝑡f_{t}(\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ). Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the output dimension of ft()subscript𝑓𝑡f_{t}(\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ). In order to obtain strictly sparse signals, only the largest G𝐺Gitalic_G values of the output of ft()subscript𝑓𝑡f_{t}(\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) are retained and all the other values are set zero, referring to the top G𝐺Gitalic_G activation. By doing this, the proposed learning-based sparse transformation function transforms angular domain channels into another domain with sparse features. The inverse transformation fi()subscript𝑓𝑖f_{i}(\cdot)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) exhibits a reverse structure compared to ft()subscript𝑓𝑡f_{t}(\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ). It maps the channels in the learned sparse domain back to the angular domain. The rows of 𝐇′′superscript𝐇′′\mathbf{H}^{\prime\prime}bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT are processed in parallel. After obtaining the output of fi()subscript𝑓𝑖f_{i}(\cdot)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ), the estimated channel matrix can be constructed by stacking rows into a whole matrix.

The proposed ft()subscript𝑓𝑡f_{t}(\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) and fi()subscript𝑓𝑖f_{i}(\cdot)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) guarantee sparsity in the transformed signals and strive to ensure that the signals, when inversely transformed, closely resemble the original ones. ft()subscript𝑓𝑡f_{t}(\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) and fi()subscript𝑓𝑖f_{i}(\cdot)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) are trained end-to-end with other learning components and the training loss is

Loss=1Di=1DLoss1𝐷superscriptsubscript𝑖1𝐷\displaystyle\text{Loss}=\frac{1}{D}\sum_{i=1}^{D}Loss = divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT {||𝐇i′′𝒟θd(θe(𝐇i′′))||22+\displaystyle\{||\mathbf{H}^{\prime\prime}_{i}-\mathcal{D}_{\theta_{\mathrm{d}% }}(\mathcal{E}_{\theta_{\mathrm{e}}}(\mathbf{H}^{\prime\prime}_{i}))||_{2}^{2}+{ | | bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - caligraphic_D start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_E start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + (11)
β||𝐇i′′fi(ft(𝐇i′′))||22},\displaystyle\beta||\mathbf{H}^{\prime\prime}_{i}-f_{i}(f_{t}(\mathbf{H}^{% \prime\prime}_{i}))||_{2}^{2}\},italic_β | | bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } ,

where 𝐇i′′subscriptsuperscript𝐇′′𝑖\mathbf{H}^{\prime\prime}_{i}bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the i𝑖iitalic_i-th channel matrix in the traning dataset, D𝐷Ditalic_D denotes the total number of training samples, and β𝛽\betaitalic_β denotes the balancing term between channel recovery MSE and the sparse transformation MSE. The proposed sparse transformation effectively overcomes the shortcomings of manually designed transformations for wireless channels. It seeks to discover a sparse basis specifically within the angular domain of the CSI matrix. Moreover, the sparse transformation and inverse transformation learned from the numerous CSI training data is more consistent with the data of the specific channel model [19]. Therefore, the learnable sparse transformation can obtain a more effective sparse representation of CSI, which improves the reconstruction accuracy of the proposed network.

III-D Decoder: Element-Wise L2O

In order to tackle Problem (10), we propose the L2O strategy that entails parameterizing the update rule into a learnable model. Different from existing CS method that adopts a tedious hand-crafted iterative recovery algorithm, we propose an autonomous learned optimization algorithm from data.

III-D1 Proposed L2O Structure

Let F(𝐱)𝐹𝐱F(\mathbf{x})italic_F ( bold_x ) denote the objective function of (10). Conventional CS algorithms, e.g., ISTA, solve Problem (10) via proximal gradient descent. However, the use of fixed update rule and manually designed optimization parameters leads to a large number of iterations and high computational cost. In contrast to ISTA, we propose to learn the update rule from data to boost decoder-side convergence. The proposed method is designed to determine the update directions by taking the current estimate, i.e., 𝐱[t]superscript𝐱delimited-[]𝑡\mathbf{x}^{[t]}bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, and the gradient of the objective function, i.e., F(𝐱[t])𝐹superscript𝐱delimited-[]𝑡\nabla F(\mathbf{x}^{[t]})∇ italic_F ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ), as inputs. The general update rule of the t𝑡titalic_t-th iteration is written as:

𝐱[t+1]=𝐱[t]𝐝[t](𝐳[t])superscript𝐱delimited-[]𝑡1superscript𝐱delimited-[]𝑡superscript𝐝delimited-[]𝑡superscript𝐳delimited-[]𝑡\mathbf{x}^{[t+1]}=\mathbf{x}^{[t]}-\mathbf{d}^{[t]}(\mathbf{z}^{[t]})bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT = bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_z start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) (12)

where 𝐝[t]:𝒵2NaNt:superscript𝐝delimited-[]𝑡𝒵superscript2subscript𝑁𝑎subscript𝑁𝑡\mathbf{d}^{[t]}:\mathcal{Z}\to\mathbb{R}^{2N_{a}N_{t}}bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT : caligraphic_Z → blackboard_R start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the update direction, 𝐳[t]𝒵superscript𝐳delimited-[]𝑡𝒵\mathbf{z}^{[t]}\in\mathcal{Z}bold_z start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∈ caligraphic_Z is the input vector, and 𝒵𝒵\mathcal{Z}caligraphic_Z is the input space. The input vector involves dynamic information, for example {𝐱[t],F(𝐱[t]),F(𝐱[t])}superscript𝐱delimited-[]𝑡𝐹superscript𝐱delimited-[]𝑡𝐹superscript𝐱delimited-[]𝑡\{\mathbf{x}^{[t]},F(\mathbf{x}^{[t]}),\nabla F(\mathbf{x}^{[t]})\}{ bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , italic_F ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) , ∇ italic_F ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) }. We assume that the update rule 𝐝[t]()superscript𝐝delimited-[]𝑡\mathbf{d}^{[t]}(\cdot)bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( ⋅ ) is differentiable with respect to the input 𝐳[t]superscript𝐳delimited-[]𝑡\mathbf{z}^{[t]}bold_z start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT and its Jacobian is bounded by a scalar C𝐶Citalic_C. Formally speaking, the space of update rules is as follows.

Definition 1

[Space of Update Rules [21]]. Let J𝐝(𝐳)J𝐝𝐳\mathrm{J}\mathbf{d}(\mathbf{z})roman_J bold_d ( bold_z ) denote the Jacobian matrix of operator 𝐝:𝒵2NaNt:𝐝𝒵superscript2subscript𝑁𝑎subscript𝑁𝑡\mathbf{d}:\mathcal{Z}\to\mathbb{R}^{2N_{a}N_{t}}bold_d : caligraphic_Z → blackboard_R start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and F\|\cdot\|_{\mathrm{F}}∥ ⋅ ∥ start_POSTSUBSCRIPT roman_F end_POSTSUBSCRIPT denote the Frobenius norm, we define the space:

𝒟C(𝒵)={𝐝:𝒵2NaNt|𝐝 is differentiable,\displaystyle\mathcal{D}_{C}(\mathcal{Z})=\Big{\{}\mathbf{d}:\mathcal{Z}\to% \mathbb{R}^{2N_{a}N_{t}}~{}\big{|}~{}\mathbf{d}\textnormal{ is differentiable,% ~{}~{}~{}}caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( caligraphic_Z ) = { bold_d : caligraphic_Z → blackboard_R start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | bold_d is differentiable,
J𝐝(𝐳)FC,𝐳𝒵}.\displaystyle\|\mathrm{J}\mathbf{d}(\mathbf{z})\|_{\mathrm{F}}\leq C,~{}% \forall\mathbf{z}\in\mathcal{Z}\Big{\}}.∥ roman_J bold_d ( bold_z ) ∥ start_POSTSUBSCRIPT roman_F end_POSTSUBSCRIPT ≤ italic_C , ∀ bold_z ∈ caligraphic_Z } .

In practice, training the deep neural network that is parameterized from 𝐝[t]()superscript𝐝delimited-[]𝑡\mathbf{d}^{[t]}(\cdot)bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( ⋅ ) will require the derivatives of 𝐝[t]()superscript𝐝delimited-[]𝑡\mathbf{d}^{[t]}(\cdot)bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( ⋅ ). Therefore, the differentiablility and bounded Jacobian of the update direction are important. Note that many existing L2O approaches, e.g., LSTM in [24, 30], achieve 𝐝[t]()𝒟C(𝒵)superscript𝐝delimited-[]𝑡subscript𝒟𝐶𝒵\mathbf{d}^{[t]}(\cdot)\in\mathcal{D}_{C}(\mathcal{Z})bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( ⋅ ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( caligraphic_Z ). The definition of the update rule space will help guanrantee the convergence of the proposed L2O scheme, which is proved in Section IV-A.

Note that the objective function in Problem (10) contains a smooth fidelity function f(𝐱)=12𝐬𝐖𝐱22𝑓𝐱12superscriptsubscriptnorm𝐬𝐖𝐱22f(\mathbf{x})=\frac{1}{2}||\mathbf{s}-\mathbf{Wx}||_{2}^{2}italic_f ( bold_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG | | bold_s - bold_Wx | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and a non-smooth regularization function r(𝐱)=λft(𝐱)1𝑟𝐱𝜆subscriptnormsubscript𝑓𝑡𝐱1r(\mathbf{x})=\lambda||f_{t}(\mathbf{x})||_{1}italic_r ( bold_x ) = italic_λ | | italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. For the smooth part, 𝐱[t]superscript𝐱delimited-[]𝑡\mathbf{x}^{[t]}bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT and f(𝐱[t])𝑓superscript𝐱delimited-[]𝑡\nabla f(\mathbf{x}^{[t]})∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) are taken as the input to the update rule. For the non-smooth part, a subgradient 𝐠[t]superscript𝐠delimited-[]𝑡\mathbf{g}^{[t]}bold_g start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT of r(𝐱)𝑟𝐱r(\mathbf{x})italic_r ( bold_x ) can be utilized. However, the convergence of subgradient descent is generally unstable, and it will not converge to the solution if the step size is constant [31]. Proximal Point Algorithm (PPA) [32] converges faster and more stably than the subgradient descent method. While subgradient descent adopts explicit update, the PPA method takes implicit update rule, i.e.,

𝐱[t+1]=𝐱[t]αPPA𝐠[t+1],superscript𝐱delimited-[]𝑡1superscript𝐱delimited-[]𝑡subscript𝛼PPAsuperscript𝐠delimited-[]𝑡1\mathbf{x}^{[t+1]}=\mathbf{x}^{[t]}-\alpha_{\text{PPA}}\mathbf{g}^{[t+1]},bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT = bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - italic_α start_POSTSUBSCRIPT PPA end_POSTSUBSCRIPT bold_g start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT , (13)

where αPPAsubscript𝛼PPA\alpha_{\text{PPA}}italic_α start_POSTSUBSCRIPT PPA end_POSTSUBSCRIPT denotes the step size of PPA algorithm. Inspired by PPA, we select 𝐱[t+1]superscript𝐱delimited-[]𝑡1\mathbf{x}^{[t+1]}bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT and 𝐠[t+1]superscript𝐠delimited-[]𝑡1\mathbf{g}^{[t+1]}bold_g start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT to be the input to the update rule 𝐝[t]()superscript𝐝delimited-[]𝑡\mathbf{d}^{[t]}(\cdot)bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( ⋅ ).

In addition to 𝐱[t]superscript𝐱delimited-[]𝑡\mathbf{x}^{[t]}bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, f(𝐱[t])𝑓superscript𝐱delimited-[]𝑡\nabla f(\mathbf{x}^{[t]})∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ), 𝐱[t+1]superscript𝐱delimited-[]𝑡1\mathbf{x}^{[t+1]}bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT, and 𝐠[t+1]superscript𝐠delimited-[]𝑡1\mathbf{g}^{[t+1]}bold_g start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT, we also introduce an auxiliary input 𝐲[t]superscript𝐲delimited-[]𝑡\mathbf{y}^{[t]}bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT to 𝐝[t]()superscript𝐝delimited-[]𝑡\mathbf{d}^{[t]}(\cdot)bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( ⋅ ) which contains information about the past estimates and is able to accelerate convergence. Recall that the update schemes (9) of existing deep unfolding methods introduced in previous section explicitly depend on only the current status 𝐱[t]superscript𝐱delimited-[]𝑡\mathbf{x}^{[t]}bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT. Therefore, they lose the ability to capture dynamics between iterations and tend to memorize the datasets. To address this drawback, in the proposed method, we introduce an auxiliary variable 𝐲[t]superscript𝐲delimited-[]𝑡\mathbf{y}^{[t]}bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT that encodes historical information through an operator 𝐦𝐦\mathbf{m}bold_m:

𝐲[t]=𝐦(𝐱[t],𝐱[t1],,𝐱[tK]),superscript𝐲delimited-[]𝑡𝐦superscript𝐱delimited-[]𝑡superscript𝐱delimited-[]𝑡1superscript𝐱delimited-[]𝑡𝐾\mathbf{y}^{[t]}=\mathbf{m}(\mathbf{x}^{[t]},\mathbf{x}^{[t-1]},\cdots,\mathbf% {x}^{[t-K]}),bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = bold_m ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT [ italic_t - 1 ] end_POSTSUPERSCRIPT , ⋯ , bold_x start_POSTSUPERSCRIPT [ italic_t - italic_K ] end_POSTSUPERSCRIPT ) , (14)

where in addition to the current estimate 𝐱[t]superscript𝐱delimited-[]𝑡\mathbf{x}^{[t]}bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, the past K𝐾Kitalic_K iterations estimates are also taken into consideration. To facilitate parameterization and training, we assume 𝐦𝐦\mathbf{m}bold_m is differentiable, i.e., 𝐦𝒟C((T+1)×2NaNt)𝐦subscript𝒟𝐶superscript𝑇12subscript𝑁𝑎subscript𝑁𝑡\mathbf{m}\in\mathcal{D}_{C}(\mathbb{R}^{(T+1)\times 2N_{a}N_{t}})bold_m ∈ caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT ( italic_T + 1 ) × 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ). With the help of 𝐲[t]superscript𝐲delimited-[]𝑡\mathbf{y}^{[t]}bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, we are able to infuse more information into the update rule. We set the current estimate, the gradient, the current auxiliary variable, and the gradient of the auxiliary variable as the inputs of the update rule 𝐝[t]superscript𝐝delimited-[]𝑡\mathbf{d}^{[t]}bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT. The update rule is then shown as [21]

𝐱[t+1]=𝐱[t]𝐝[t](\displaystyle\mathbf{x}^{[t+1]}=\mathbf{x}^{[t]}-\mathbf{d}^{[t]}(bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT = bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( 𝐱[t],f(𝐱[t]),𝐱[t+1],𝐠[t+1],superscript𝐱delimited-[]𝑡𝑓superscript𝐱delimited-[]𝑡superscript𝐱delimited-[]𝑡1superscript𝐠delimited-[]𝑡1\displaystyle\mathbf{x}^{[t]},\nabla f(\mathbf{x}^{[t]}),\mathbf{x}^{[t+1]},% \mathbf{g}^{[t+1]},bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) , bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT , bold_g start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT , (15)
𝐲[t],f(𝐲[t])).\displaystyle\mathbf{y}^{[t]},\nabla f(\mathbf{y}^{[t]})).bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , ∇ italic_f ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ) .

Follow the derivation in [21, Theorem 4], a good update rule should satisfy asymptotic fixed point condition and global convergence condition. We then derive a math-structured update rule from generic update rule (15), i.e., for any bounded matrix sequence {𝐁[t]}t=1superscriptsubscriptsuperscript𝐁delimited-[]𝑡𝑡1\{\mathbf{B}^{[t]}\}_{t=1}^{\infty}{ bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT, there exist

𝐱[t+1]superscript𝐱delimited-[]𝑡1\displaystyle\mathbf{x}^{[t+1]}bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT =𝐱[t](𝐏1[t]𝐏2[t])f(𝐱[t])𝐏2[t]f(𝐲[t])𝐛1[t]absentsuperscript𝐱delimited-[]𝑡superscriptsubscript𝐏1delimited-[]𝑡superscriptsubscript𝐏2delimited-[]𝑡𝑓superscript𝐱delimited-[]𝑡superscriptsubscript𝐏2delimited-[]𝑡𝑓superscript𝐲delimited-[]𝑡superscriptsubscript𝐛1delimited-[]𝑡\displaystyle=\mathbf{x}^{[t]}-(\mathbf{P}_{1}^{[t]}-\mathbf{P}_{2}^{[t]})% \nabla f(\mathbf{x}^{[t]})-\mathbf{P}_{2}^{[t]}\nabla f(\mathbf{y}^{[t]})-% \mathbf{b}_{1}^{[t]}= bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - ( bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) - bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∇ italic_f ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) - bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT (16)
𝐏1[t]𝐠[t+1]+𝐁[t](𝐲[t]𝐱[t]),superscriptsubscript𝐏1delimited-[]𝑡superscript𝐠delimited-[]𝑡1superscript𝐁delimited-[]𝑡superscript𝐲delimited-[]𝑡superscript𝐱delimited-[]𝑡\displaystyle-\mathbf{P}_{1}^{[t]}\mathbf{g}^{[t+1]}+\mathbf{B}^{[t]}(\mathbf{% y}^{[t]}-\mathbf{x}^{[t]}),- bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT bold_g start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT + bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ,
𝐲[t+1]superscript𝐲delimited-[]𝑡1\displaystyle\mathbf{y}^{[t+1]}bold_y start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT =(𝐈𝐀[t])𝐱[t+1]+𝐀[t]𝐱[t]+𝐛2[t],absent𝐈superscript𝐀delimited-[]𝑡superscript𝐱delimited-[]𝑡1superscript𝐀delimited-[]𝑡superscript𝐱delimited-[]𝑡superscriptsubscript𝐛2delimited-[]𝑡\displaystyle=(\mathbf{I}-\mathbf{A}^{[t]})\mathbf{x}^{[t+1]}+\mathbf{A}^{[t]}% \mathbf{x}^{[t]}+\mathbf{b}_{2}^{[t]},= ( bold_I - bold_A start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT + bold_A start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT + bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ,

for all t=1,2,𝑡12t=1,2,\cdotsitalic_t = 1 , 2 , ⋯, with {𝐏1[t],𝐏2[t],𝐀[t]}superscriptsubscript𝐏1delimited-[]𝑡superscriptsubscript𝐏2delimited-[]𝑡superscript𝐀delimited-[]𝑡\{\mathbf{P}_{1}^{[t]},\mathbf{P}_{2}^{[t]},\mathbf{A}^{[t]}\}{ bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_A start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT } being bounded, and 𝐛1[t]𝟎,𝐛2[t]𝟎formulae-sequencesuperscriptsubscript𝐛1delimited-[]𝑡0superscriptsubscript𝐛2delimited-[]𝑡0\mathbf{b}_{1}^{[t]}\to\mathbf{0},\mathbf{b}_{2}^{[t]}\to\mathbf{0}bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT → bold_0 , bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT → bold_0 as t𝑡t\to\inftyitalic_t → ∞. If we further assume 𝐏1[t]superscriptsubscript𝐏1delimited-[]𝑡\mathbf{P}_{1}^{[t]}bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT is uniformly symmetric positive definite, then we can substitute 𝐏2[t]𝐏1[t]1superscriptsubscript𝐏2delimited-[]𝑡superscriptsuperscriptsubscript𝐏1delimited-[]𝑡1\mathbf{P}_{2}^{[t]}{\mathbf{P}_{1}^{[t]}}^{-1}bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT with 𝐁[t]superscript𝐁delimited-[]𝑡\mathbf{B}^{[t]}bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT and obtain

𝐱^[t]superscript^𝐱delimited-[]𝑡\displaystyle\hat{\mathbf{x}}^{[t]}over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT =𝐱[t]𝐏1[t]f(𝐱[t]),absentsuperscript𝐱delimited-[]𝑡superscriptsubscript𝐏1delimited-[]𝑡𝑓superscript𝐱delimited-[]𝑡\displaystyle=\mathbf{x}^{[t]}-\mathbf{P}_{1}^{[t]}\nabla f(\mathbf{x}^{[t]}),= bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) , (17)
𝐲^[t]superscript^𝐲delimited-[]𝑡\displaystyle\hat{\mathbf{y}}^{[t]}over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT =𝐲[t]𝐏1[t]f(𝐲[t]),absentsuperscript𝐲delimited-[]𝑡superscriptsubscript𝐏1delimited-[]𝑡𝑓superscript𝐲delimited-[]𝑡\displaystyle=\mathbf{y}^{[t]}-\mathbf{P}_{1}^{[t]}\nabla f(\mathbf{y}^{[t]}),= bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∇ italic_f ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ,
𝐱[t+1]superscript𝐱delimited-[]𝑡1\displaystyle\mathbf{x}^{[t+1]}bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT =proxr,𝐏1[t]((𝐈𝐁[t])𝐱^[t]+𝐁[t]𝐲^[t]𝐛1[t]),absentsubscriptprox𝑟superscriptsubscript𝐏1delimited-[]𝑡𝐈superscript𝐁delimited-[]𝑡superscript^𝐱delimited-[]𝑡superscript𝐁delimited-[]𝑡superscript^𝐲delimited-[]𝑡superscriptsubscript𝐛1delimited-[]𝑡\displaystyle=\operatorname{prox}_{r,\mathbf{P}_{1}^{[t]}}\Big{(}(\mathbf{I}-% \mathbf{B}^{[t]})\hat{\mathbf{x}}^{[t]}+\mathbf{B}^{[t]}\hat{\mathbf{y}}^{[t]}% -\mathbf{b}_{1}^{[t]}\Big{)},= roman_prox start_POSTSUBSCRIPT italic_r , bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ( bold_I - bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT + bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ,
𝐲[t+1]superscript𝐲delimited-[]𝑡1\displaystyle\mathbf{y}^{[t+1]}bold_y start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT =𝐱[t+1]+𝐀[t](𝐱[t+1]𝐱[t])+𝐛2[t],absentsuperscript𝐱delimited-[]𝑡1superscript𝐀delimited-[]𝑡superscript𝐱delimited-[]𝑡1superscript𝐱delimited-[]𝑡superscriptsubscript𝐛2delimited-[]𝑡\displaystyle=\mathbf{x}^{[t+1]}+\mathbf{A}^{[t]}(\mathbf{x}^{[t+1]}-\mathbf{x% }^{[t]})+\mathbf{b}_{2}^{[t]},= bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT + bold_A start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) + bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ,

where proxr,𝐏1[t]()subscriptprox𝑟superscriptsubscript𝐏1delimited-[]𝑡\operatorname{prox}_{r,\mathbf{P}_{1}^{[t]}}(\cdot)roman_prox start_POSTSUBSCRIPT italic_r , bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ ) denotes the proximal operator and is defined as

proxr,𝐏(𝐱¯):=argmin𝐱r(𝐱)+12𝐱𝐱¯𝐏12.assignsubscriptprox𝑟𝐏¯𝐱subscriptargmin𝐱𝑟𝐱12subscriptsuperscriptnorm𝐱¯𝐱2superscript𝐏1\operatorname{prox}_{r,\mathbf{P}}(\bar{\mathbf{x}}):=\operatorname*{arg\,min}% _{\mathbf{x}}r(\mathbf{x})+\frac{1}{2}\|\mathbf{x}-\bar{\mathbf{x}}\|^{2}_{% \mathbf{P}^{-1}}.roman_prox start_POSTSUBSCRIPT italic_r , bold_P end_POSTSUBSCRIPT ( over¯ start_ARG bold_x end_ARG ) := start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_r ( bold_x ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_x - over¯ start_ARG bold_x end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (18)

The norm 𝐏1\|\cdot\|_{\mathbf{P}^{-1}}∥ ⋅ ∥ start_POSTSUBSCRIPT bold_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is defined as 𝐱𝐏1:=𝐱𝐏1𝐱assignsubscriptnorm𝐱superscript𝐏1superscript𝐱topsuperscript𝐏1𝐱\|\mathbf{x}\|_{\mathbf{P}^{-1}}:=\sqrt{\mathbf{x}^{\top}\mathbf{P}^{-1}% \mathbf{x}}∥ bold_x ∥ start_POSTSUBSCRIPT bold_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT := square-root start_ARG bold_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_x end_ARG.

In the update scheme (17), 𝐛1[t]superscriptsubscript𝐛1delimited-[]𝑡\mathbf{b}_{1}^{[t]}bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT and 𝐛2[t]superscriptsubscript𝐛2delimited-[]𝑡\mathbf{b}_{2}^{[t]}bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT are biases; 𝐀[t]superscript𝐀delimited-[]𝑡\mathbf{A}^{[t]}bold_A start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT is an accelerator term which can be viewed as an extension of Nesterov momentum; 𝐏1[t]superscriptsubscript𝐏1delimited-[]𝑡\mathbf{P}_{1}^{[t]}bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT is the preconditioner that plays a similar role as step size in the gradient descent; 𝐁[t]superscript𝐁delimited-[]𝑡\mathbf{B}^{[t]}bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT is a balancing term between 𝐱^[t]superscript^𝐱delimited-[]𝑡\hat{\mathbf{x}}^{[t]}over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT and 𝐲^[t]superscript^𝐲delimited-[]𝑡\hat{\mathbf{y}}^{[t]}over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT. If 𝐁[t]=𝟎superscript𝐁delimited-[]𝑡0\mathbf{B}^{[t]}=\mathbf{0}bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = bold_0, then 𝐱[t+1]superscript𝐱delimited-[]𝑡1\mathbf{x}^{[t+1]}bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT only depends on 𝐱[t]superscript𝐱delimited-[]𝑡\mathbf{x}^{[t]}bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT and if 𝐁[t]=𝟏superscript𝐁delimited-[]𝑡1\mathbf{B}^{[t]}=\mathbf{1}bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = bold_1, then 𝐱[t+1]superscript𝐱delimited-[]𝑡1\mathbf{x}^{[t+1]}bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT only depends on 𝐲[t]superscript𝐲delimited-[]𝑡\mathbf{y}^{[t]}bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT explicitly. Note that ISTA is a special case of update rule (17). When 𝐁[t]=𝐀[t]=𝐛1[t]=𝐛2[t]=𝟎superscript𝐁delimited-[]𝑡superscript𝐀delimited-[]𝑡superscriptsubscript𝐛1delimited-[]𝑡superscriptsubscript𝐛2delimited-[]𝑡0\mathbf{B}^{[t]}=\mathbf{A}^{[t]}=\mathbf{b}_{1}^{[t]}=\mathbf{b}_{2}^{[t]}=% \mathbf{0}bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = bold_A start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = bold_0, (17) reduces to ISTA. Therefore, (17) provides more degrees of freedom and is able to enhance reconstruction performance.

To obtain a better balance between performance and efficiency, in our Csi-L2O decoding module, 𝐏1[t]superscriptsubscript𝐏1delimited-[]𝑡\mathbf{P}_{1}^{[t]}bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, 𝐁[t]superscript𝐁delimited-[]𝑡\mathbf{B}^{[t]}bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, and 𝐀[t]superscript𝐀delimited-[]𝑡\mathbf{A}^{[t]}bold_A start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT are implemented as diagonal matrices, i.e.,

𝐏1[t]=diag(𝐩[t]),𝐁[t]=diag(𝐛[t]),𝐀[t]=diag(𝐚[t]),formulae-sequencesuperscriptsubscript𝐏1delimited-[]𝑡diagsuperscript𝐩delimited-[]𝑡formulae-sequencesuperscript𝐁delimited-[]𝑡diagsuperscript𝐛delimited-[]𝑡superscript𝐀delimited-[]𝑡diagsuperscript𝐚delimited-[]𝑡\mathbf{P}_{1}^{[t]}=\mathrm{diag}(\mathbf{p}^{[t]}),~{}~{}\mathbf{B}^{[t]}=% \mathrm{diag}(\mathbf{b}^{[t]}),~{}~{}\mathbf{A}^{[t]}=\mathrm{diag}(\mathbf{a% }^{[t]}),bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = roman_diag ( bold_p start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) , bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = roman_diag ( bold_b start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) , bold_A start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = roman_diag ( bold_a start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ,

where 𝐩[t],𝐛[t],𝐚[t]2NaNt×1superscript𝐩delimited-[]𝑡superscript𝐛delimited-[]𝑡superscript𝐚delimited-[]𝑡superscript2subscript𝑁𝑎subscript𝑁𝑡1\mathbf{p}^{[t]},\mathbf{b}^{[t]},\mathbf{a}^{[t]}\in\mathbb{R}^{2N_{a}N_{t}% \times 1}bold_p start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_b start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT. The proximal operator is set a scaled soft-thresholding operator, which is expressed as

proxθ[t](𝐱[t])=sign(𝐱[t])max(𝟎,|𝐱[t]|θ[t]),subscriptproxsuperscript𝜃delimited-[]𝑡superscript𝐱delimited-[]𝑡signsuperscript𝐱delimited-[]𝑡0superscript𝐱delimited-[]𝑡superscript𝜃delimited-[]𝑡\operatorname{prox}_{\theta^{[t]}}(\mathbf{x}^{[t]})=\operatorname{sign}\left(% \mathbf{x}^{[t]}\right)\max\left(\mathbf{0},\left|\mathbf{x}^{[t]}\right|-% \mathbf{\theta}^{[t]}\right),roman_prox start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) = roman_sign ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) roman_max ( bold_0 , | bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT | - italic_θ start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ,

where θ[t]superscript𝜃delimited-[]𝑡\theta^{[t]}italic_θ start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT denotes the soft-thresholding parameter in the t𝑡titalic_t-th iteration. Update rule (17) then becomes:

𝐱^[t]superscript^𝐱delimited-[]𝑡\displaystyle\hat{\mathbf{x}}^{[t]}over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT =𝐱[t]𝐩[t]f(𝐱[t]),absentsuperscript𝐱delimited-[]𝑡direct-productsuperscript𝐩delimited-[]𝑡𝑓superscript𝐱delimited-[]𝑡\displaystyle=\mathbf{x}^{[t]}-\mathbf{p}^{[t]}\odot\nabla f(\mathbf{x}^{[t]}),= bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_p start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ⊙ ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) , (19)
𝐲^[t]superscript^𝐲delimited-[]𝑡\displaystyle\hat{\mathbf{y}}^{[t]}over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT =𝐲[t]𝐩[t]f(𝐲[t]),absentsuperscript𝐲delimited-[]𝑡direct-productsuperscript𝐩delimited-[]𝑡𝑓superscript𝐲delimited-[]𝑡\displaystyle=\mathbf{y}^{[t]}-\mathbf{p}^{[t]}\odot\nabla f(\mathbf{y}^{[t]}),= bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_p start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ⊙ ∇ italic_f ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ,
𝐱[t+1]superscript𝐱delimited-[]𝑡1\displaystyle\mathbf{x}^{[t+1]}bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT =proxθ[t]((𝟏𝐛[t])𝐱^[t]+𝐛[t]𝐲^[t]𝐛1[t]),absentsubscriptproxsuperscript𝜃delimited-[]𝑡direct-product1superscript𝐛delimited-[]𝑡superscript^𝐱delimited-[]𝑡direct-productsuperscript𝐛delimited-[]𝑡superscript^𝐲delimited-[]𝑡superscriptsubscript𝐛1delimited-[]𝑡\displaystyle=\operatorname{prox}_{\theta^{[t]}}\Big{(}(\mathbf{1}-\mathbf{b}^% {[t]})\odot\hat{\mathbf{x}}^{[t]}+\mathbf{b}^{[t]}\odot\hat{\mathbf{y}}^{[t]}-% \mathbf{b}_{1}^{[t]}\Big{)},= roman_prox start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ( bold_1 - bold_b start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ⊙ over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT + bold_b start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ⊙ over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ,
𝐲[t+1]superscript𝐲delimited-[]𝑡1\displaystyle\mathbf{y}^{[t+1]}bold_y start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT =𝐱[t+1]+𝐚[t](𝐱[t+1]𝐱[t])+𝐛2[t].absentsuperscript𝐱delimited-[]𝑡1direct-productsuperscript𝐚delimited-[]𝑡superscript𝐱delimited-[]𝑡1superscript𝐱delimited-[]𝑡superscriptsubscript𝐛2delimited-[]𝑡\displaystyle=\mathbf{x}^{[t+1]}+\mathbf{a}^{[t]}\odot(\mathbf{x}^{[t+1]}-% \mathbf{x}^{[t]})+\mathbf{b}_{2}^{[t]}.= bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT + bold_a start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ⊙ ( bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) + bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT .
TABLE I: The computational complexity of different methods.
Methods Csi-L2O CsiNet TransNet Deep Unfolding
Encoder Complexity O(NaNtM)𝑂subscript𝑁𝑎subscript𝑁𝑡𝑀O(N_{a}N_{t}M)italic_O ( italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M ) O(NaNtKen2CinCout+NaNtM)𝑂subscript𝑁𝑎subscript𝑁𝑡superscriptsubscript𝐾en2subscript𝐶insubscript𝐶outsubscript𝑁𝑎subscript𝑁𝑡𝑀O(N_{a}N_{t}K_{\text{en}}^{2}C_{\text{in}}C_{\text{out}}+N_{a}N_{t}M)italic_O ( italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT en end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT in end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT out end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M ) O(2(2Na2den+12Naden2O(2(2N_{a}^{2}d_{\text{en}}+\frac{1}{2}N_{a}d_{\text{en}}^{2}italic_O ( 2 ( 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT en end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT en end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT O(NaNtM)𝑂subscript𝑁𝑎subscript𝑁𝑡𝑀O(N_{a}N_{t}M)italic_O ( italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M )
+denNaNt)+NaNtM)+d_{\text{en}}N_{a}N_{t})+N_{a}N_{t}M)+ italic_d start_POSTSUBSCRIPT en end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M )
Decoder Complexity O(TL2O(2Cfi+CLSTM))𝑂subscript𝑇L2O2subscript𝐶subscript𝑓𝑖subscript𝐶LSTMO(T_{\text{L2O}}(2C_{f_{i}}+C_{\text{LSTM}}))italic_O ( italic_T start_POSTSUBSCRIPT L2O end_POSTSUBSCRIPT ( 2 italic_C start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT LSTM end_POSTSUBSCRIPT ) ) O(2(i=13NaNtKde,i2Cin,iCout,i)O(2(\sum_{i=1}^{3}N_{a}N_{t}K_{\text{de,i}}^{2}C_{\text{in,i}}C_{\text{out,i}})italic_O ( 2 ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT de,i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT in,i end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT out,i end_POSTSUBSCRIPT ) O(2(4Na2dde+Nadde2O(2(4N_{a}^{2}d_{\text{de}}+N_{a}d_{\text{de}}^{2}italic_O ( 2 ( 4 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT de end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT de end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT O(TDU(2CST+CISTA))𝑂subscript𝑇DU2subscript𝐶𝑆𝑇subscript𝐶ISTAO(T_{\text{DU}}(2C_{ST}+C_{\text{ISTA}}))italic_O ( italic_T start_POSTSUBSCRIPT DU end_POSTSUBSCRIPT ( 2 italic_C start_POSTSUBSCRIPT italic_S italic_T end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT ISTA end_POSTSUBSCRIPT ) )
+NaNtM)+N_{a}N_{t}M)+ italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M ) +ddeNaNt)+2NaNtM)+d_{\text{de}}N_{a}N_{t})+2N_{a}N_{t}M)+ italic_d start_POSTSUBSCRIPT de end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M )

III-D2 Neural Network Design

To generate the most appropriate decoding algorithm, the optimization parameters 𝐩[t]superscript𝐩delimited-[]𝑡\mathbf{p}^{[t]}bold_p start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, 𝐚[t]superscript𝐚delimited-[]𝑡\mathbf{a}^{[t]}bold_a start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, 𝐛[t]superscript𝐛delimited-[]𝑡\mathbf{b}^{[t]}bold_b start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, 𝐛1[t]superscriptsubscript𝐛1delimited-[]𝑡\mathbf{b}_{1}^{[t]}bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, and 𝐛2[t]superscriptsubscript𝐛2delimited-[]𝑡\mathbf{b}_{2}^{[t]}bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT are not selected mannually but learned from a large amount of data. Note that 𝐩[t],𝐚[t],𝐛[t],𝐛1[t],𝐛2[t]2NaNt×1superscript𝐩delimited-[]𝑡superscript𝐚delimited-[]𝑡superscript𝐛delimited-[]𝑡superscriptsubscript𝐛1delimited-[]𝑡superscriptsubscript𝐛2delimited-[]𝑡superscript2subscript𝑁𝑎subscript𝑁𝑡1\mathbf{p}^{[t]},\mathbf{a}^{[t]},\mathbf{b}^{[t]},\mathbf{b}_{1}^{[t]},% \mathbf{b}_{2}^{[t]}\in\mathbb{R}^{2N_{a}N_{t}\times 1}bold_p start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_b start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT. In FDD massive MIMO systems, Nasubscript𝑁𝑎N_{a}italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are large. If a black-box neural network is adopted to model these optimization parameters, the training of the giant and unstructured neural network will be very difficult. In addition, for FDD massive MIMO CSI feedback, the compression ratio needs to be adjusted according to the dynamic communication environment. A reconstruction algorithm that enjoys good generalization ability is thus greatly in need. By taking these two aspects into consideration, we design an element-wise L2O mechanism. In contrast to traditional deep unfolding methods that directly set optimization parameters 𝐩[t]superscript𝐩delimited-[]𝑡\mathbf{p}^{[t]}bold_p start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, 𝐚[t]superscript𝐚delimited-[]𝑡\mathbf{a}^{[t]}bold_a start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, 𝐛[t]superscript𝐛delimited-[]𝑡\mathbf{b}^{[t]}bold_b start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, 𝐛1[t]superscriptsubscript𝐛1delimited-[]𝑡\mathbf{b}_{1}^{[t]}bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, and 𝐛2[t]superscriptsubscript𝐛2delimited-[]𝑡\mathbf{b}_{2}^{[t]}bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT trainable, we model them as the output of an element-wise LSTM that greatly improves the generalization ability to different problem scale. The element-wise LSTM is parameterized by learnable parameters ϕLSTMsubscriptitalic-ϕLSTM\phi_{\text{LSTM}}italic_ϕ start_POSTSUBSCRIPT LSTM end_POSTSUBSCRIPT and takes the current estimate 𝐱[t]superscript𝐱delimited-[]𝑡\mathbf{x}^{[t]}bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT and the gradient f(𝐱[t])𝑓superscript𝐱delimited-[]𝑡\nabla f(\mathbf{x}^{[t]})∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) as the input:

𝐜[t],𝐞[t]=LSTM(𝐱[t],f(𝐱[t]),𝐞[t1];ϕLSTM),superscript𝐜delimited-[]𝑡superscript𝐞delimited-[]𝑡LSTMsuperscript𝐱delimited-[]𝑡𝑓superscript𝐱delimited-[]𝑡superscript𝐞delimited-[]𝑡1subscriptitalic-ϕLSTM\displaystyle\mathbf{c}^{[t]},\mathbf{e}^{[t]}=\mathrm{LSTM}\big{(}\mathbf{x}^% {[t]},\nabla f(\mathbf{x}^{[t]}),\mathbf{e}^{[t-1]};\phi_{\text{LSTM}}\big{)},bold_c start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_e start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = roman_LSTM ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) , bold_e start_POSTSUPERSCRIPT [ italic_t - 1 ] end_POSTSUPERSCRIPT ; italic_ϕ start_POSTSUBSCRIPT LSTM end_POSTSUBSCRIPT ) , (20)
𝐩[t],𝐚[t],𝐛[t],𝐛1[t],𝐛2[t]=MLP(𝐜[t];ϕMLP),superscript𝐩delimited-[]𝑡superscript𝐚delimited-[]𝑡superscript𝐛delimited-[]𝑡superscriptsubscript𝐛1delimited-[]𝑡superscriptsubscript𝐛2delimited-[]𝑡MLPsuperscript𝐜delimited-[]𝑡subscriptitalic-ϕMLP\displaystyle\mathbf{p}^{[t]},\mathbf{a}^{[t]},\mathbf{b}^{[t]},\mathbf{b}_{1}% ^{[t]},\mathbf{b}_{2}^{[t]}=\mathrm{MLP}(\mathbf{c}^{[t]};\phi_{\text{MLP}}),bold_p start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_b start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = roman_MLP ( bold_c start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ; italic_ϕ start_POSTSUBSCRIPT MLP end_POSTSUBSCRIPT ) ,

where 𝐞[t]superscript𝐞delimited-[]𝑡\mathbf{e}^{[t]}bold_e start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT is the internal state of LSTM, 𝐞[0]superscript𝐞delimited-[]0\mathbf{e}^{[0]}bold_e start_POSTSUPERSCRIPT [ 0 ] end_POSTSUPERSCRIPT is randomly sampled from Gaussian distribution, and 𝐜[t]superscript𝐜delimited-[]𝑡\mathbf{c}^{[t]}bold_c start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT is the output of LSTM which is then fed into the MLP to generate the optimization parameters. Detailed procedure is demonstrated in Fig. 2(b). An “element-wise” LSTM means that the same network is shared across all coordinates of the input. Specifically, each coordinate of 𝐱[t]superscript𝐱delimited-[]𝑡\mathbf{x}^{[t]}bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT and f(𝐱[t])𝑓superscript𝐱delimited-[]𝑡\nabla f(\mathbf{x}^{[t]})∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) are fed into the LSTM in parallel. With this method, the single model can be applied to optimization problems of any scale and thus fits the variable compression ratio cases. It is common in classic optimization algorithms to take positive 𝐩[t]superscript𝐩delimited-[]𝑡\mathbf{p}^{[t]}bold_p start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT and 𝐚[t]superscript𝐚delimited-[]𝑡\mathbf{a}^{[t]}bold_a start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT. Therefore, we use an additional activation function to post-process 𝐩[t]superscript𝐩delimited-[]𝑡\mathbf{p}^{[t]}bold_p start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT and 𝐚[t]superscript𝐚delimited-[]𝑡\mathbf{a}^{[t]}bold_a start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, e.g., sigmoid function. (19) and (20) together define the L2O decoding scheme.

III-D3 Comparison with Deep Unfolding

Key differences between Csi-L2O method and deep unfolding methods include the way of parameterization and the existence of a convergence guarantee. On the one hand, different from the element-wise LSTM parameterization (20), deep unfolding methods make optimization parameters learnable and directly optimize them from data. For example, instead of using neural network to generate 𝐩[t],𝐚[t],𝐛[t],𝐛1[t],𝐛2[t]superscript𝐩delimited-[]𝑡superscript𝐚delimited-[]𝑡superscript𝐛delimited-[]𝑡superscriptsubscript𝐛1delimited-[]𝑡superscriptsubscript𝐛2delimited-[]𝑡\mathbf{p}^{[t]},\mathbf{a}^{[t]},\mathbf{b}^{[t]},\mathbf{b}_{1}^{[t]},% \mathbf{b}_{2}^{[t]}bold_p start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_b start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT, one can directly turn the step size and soft-threshold parameters trainable. However, this direct parameterization introduces several limitations. It hampers the model’s ability to capture dynamics between iterations and leads to a tendency to memorize specific datasets rather than generalizing. Additionally, direct parameterization requires that the dimensions of the learnable parameters match the scale of the problem, which restricts the model’s applicability to optimization problems of different scales during inference. This constraint prevents deep unfolding methods from generalizing effectively to various compression ratio cases. On the other hand, since deep unfolding algorithms are built by fixed and finite layers, it is difficult to scale to different number of iterations. When the number of layers is different during training and testing, it is hard to ensure convergence of deep unfolding.

IV Convergence and Complexity Analysis

In this section, we first emphasize the importance and the proof of the convergence for the proposed update rule. Then, the computational complexity analysis of the proposed method and the comparison with other benchmarks are demonstrated.

IV-A Convergence Analysis

Conventional deep unfolding method typically lacks convergence guarantee, making it difficult to scale to variable number of layers during inference [18]. The deployment of different number of layers from training will result in performance fluctuation. In this subsection, we will prove the convergence of the proposed update rule, i.e., 𝐝[t]()superscript𝐝delimited-[]𝑡\mathbf{d}^{[t]}(\cdot)bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( ⋅ ). The convergence guarantee will help us improve the reliablity of the proposed method and determine the appropriate number of layers during inference.

Let 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the fixed point of Problem (10). We then have the following theorem.

Theorem 1.

For any 𝐱argmin𝐱2NaNtF(𝐱)superscript𝐱subscriptargmin𝐱superscript2subscript𝑁𝑎subscript𝑁𝑡𝐹𝐱\mathbf{x}^{*}\in\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{2N_{a}N_{t% }}}F(\mathbf{x})bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_F ( bold_x ),

limtsubscript𝑡\displaystyle\lim_{t\to\infty}roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT 𝐝[t](𝐱,f(𝐱),𝐱,f(𝐱),𝐱,f(𝐱))=𝟎,superscript𝐝delimited-[]𝑡superscript𝐱𝑓superscript𝐱superscript𝐱𝑓superscript𝐱superscript𝐱𝑓superscript𝐱0\displaystyle\mathbf{d}^{[t]}(\mathbf{x}^{*},\nabla f(\mathbf{x}^{*}),\mathbf{% x}^{*},-\nabla f(\mathbf{x}^{*}),\mathbf{x}^{*},\nabla f(\mathbf{x}^{*}))=% \mathbf{0},bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , - ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) = bold_0 , (21)
𝐦(𝐱,𝐱,,𝐱)=𝐱.𝐦superscript𝐱superscript𝐱superscript𝐱superscript𝐱\displaystyle\mathbf{m}(\mathbf{x}^{*},\mathbf{x}^{*},\cdots,\mathbf{x}^{*})=% \mathbf{x}^{*}.bold_m ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ⋯ , bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . (22)

For any sequences {𝐱[t],𝐲[t]}t=0superscriptsubscriptsuperscript𝐱delimited-[]𝑡superscript𝐲delimited-[]𝑡𝑡0\{\mathbf{x}^{[t]},\mathbf{y}^{[t]}\}_{t=0}^{\infty}{ bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT generated by (14) and (15), there exists one 𝐱argmin𝐱2NaNtF(𝐱)superscript𝐱subscriptargmin𝐱superscript2subscript𝑁𝑎subscript𝑁𝑡𝐹𝐱\mathbf{x}^{*}\in\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{2N_{a}N_{t% }}}F(\mathbf{x})bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_F ( bold_x ) such that

limt𝐱[t]=limt𝐲[t]=𝐱.subscript𝑡superscript𝐱delimited-[]𝑡subscript𝑡superscript𝐲delimited-[]𝑡superscript𝐱\lim_{t\to\infty}\mathbf{x}^{[t]}=\lim_{t\to\infty}\mathbf{y}^{[t]}=\mathbf{x}% ^{*}.roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . (23)
Proof:

Please refer to Appendix A. ∎

Eqn. (21) shows that the proposed update rule 𝐝[t]()superscript𝐝delimited-[]𝑡\mathbf{d}^{[t]}(\cdot)bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( ⋅ ) guarantees 𝐱[t+1]=𝐱superscript𝐱delimited-[]𝑡1superscript𝐱\mathbf{x}^{[t+1]}=\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT = bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as long as 𝐱[t]=𝐱superscript𝐱delimited-[]𝑡superscript𝐱\mathbf{x}^{[t]}=\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. This means that if 𝐱[t]superscript𝐱delimited-[]𝑡\mathbf{x}^{[t]}bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT is a solution, the next iteration is also fixed. (21) and (22) together guarantee the convergence of the proposed parameterization update rule.

IV-B Complexity Analysis

The encoder-side computational complexity of the proposed Csi-L2O and that of other baselines are illustrated in Table I. Since there is a linear projection at the encoder, the encoder complexity of the proposed Csi-L2O is O(NaNtM)𝑂subscript𝑁𝑎subscript𝑁𝑡𝑀O(N_{a}N_{t}M)italic_O ( italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M ), which grows linearly with the number of antennas and the dimension of the codeword. The encoder complexity of CsiNet is O(NaNtKen2CinCout+NaNtM)𝑂subscript𝑁𝑎subscript𝑁𝑡superscriptsubscript𝐾en2subscript𝐶insubscript𝐶outsubscript𝑁𝑎subscript𝑁𝑡𝑀O(N_{a}N_{t}K_{\text{en}}^{2}C_{\text{in}}C_{\text{out}}+N_{a}N_{t}M)italic_O ( italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT en end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT in end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT out end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M ), where Kensubscript𝐾enK_{\text{en}}italic_K start_POSTSUBSCRIPT en end_POSTSUBSCRIPT, Cinsubscript𝐶inC_{\text{in}}italic_C start_POSTSUBSCRIPT in end_POSTSUBSCRIPT, and Coutsubscript𝐶outC_{\text{out}}italic_C start_POSTSUBSCRIPT out end_POSTSUBSCRIPT denote the dimension of the convolutional kernel, the input and output channel number, respectively. As the encoding module of CsiNet consists of both convolutional kernels and fully connected layers, the computational complexity of the CsiNet is higher than that of the proposed Csi-L2O. On the other hand, the encoder complexity of TransNet is O(2(2Na2d+12Naden2+denNaNt)+NaNtM)𝑂22superscriptsubscript𝑁𝑎2𝑑12subscript𝑁𝑎superscriptsubscript𝑑en2subscript𝑑ensubscript𝑁𝑎subscript𝑁𝑡subscript𝑁𝑎subscript𝑁𝑡𝑀O(2(2N_{a}^{2}d+\frac{1}{2}N_{a}d_{\text{en}}^{2}+d_{\text{en}}N_{a}N_{t})+N_{% a}N_{t}M)italic_O ( 2 ( 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT en end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_d start_POSTSUBSCRIPT en end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M ), where densubscript𝑑end_{\text{en}}italic_d start_POSTSUBSCRIPT en end_POSTSUBSCRIPT denotes the encoder-side self-attention dimension. The complexity mainly comes from two attention-based encoding blocks and fully connected layers. Although transformer-based autoencoder achieves SOTA performance, it puts prohibitive computational burdens for resource-constrained devices. The encoder-side complexity of deep unfolding methods, including ISTA-Net and TiLISTA, are both O(NaNtM)𝑂subscript𝑁𝑎subscript𝑁𝑡𝑀O(N_{a}N_{t}M)italic_O ( italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M ) since they use a linear projection at the encoder. According to the complexity analysis, there’s a guarantee that the proposed method will achieve much higher computational efficiency compared to SOTA method, TransNet. The computational complexity reduction is more obvious when Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and/or Nasubscript𝑁𝑎N_{a}italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is large, which is indeed the situation that future wirless systems will meet [2].

The decoder-side computational complexity of different methods are also shown in Table I. The decoder complexity of the proposed Csi-L2O is O(TL2O(2Cfi+CLSTM))𝑂subscript𝑇L2O2subscript𝐶subscript𝑓𝑖subscript𝐶LSTMO(T_{\text{L2O}}(2C_{f_{i}}+C_{\text{LSTM}}))italic_O ( italic_T start_POSTSUBSCRIPT L2O end_POSTSUBSCRIPT ( 2 italic_C start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT LSTM end_POSTSUBSCRIPT ) ), where TL2Osubscript𝑇L2OT_{\text{L2O}}italic_T start_POSTSUBSCRIPT L2O end_POSTSUBSCRIPT denotes the number of layers in the decoder, Cfisubscript𝐶subscript𝑓𝑖C_{f_{i}}italic_C start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the complexity of the sparse transformation function ft()subscript𝑓𝑡f_{t}(\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ), and CLSTMsubscript𝐶LSTMC_{\text{LSTM}}italic_C start_POSTSUBSCRIPT LSTM end_POSTSUBSCRIPT denotes the complexity of LSTM, respectively. The decoder complexity of deep unfolding method exhibits a similar structure, i.e., O(TDU(2CST+CISTA))𝑂subscript𝑇DU2subscript𝐶STsubscript𝐶ISTAO(T_{\text{DU}}(2C_{\text{ST}}+C_{\text{ISTA}}))italic_O ( italic_T start_POSTSUBSCRIPT DU end_POSTSUBSCRIPT ( 2 italic_C start_POSTSUBSCRIPT ST end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT ISTA end_POSTSUBSCRIPT ) ), where TDUsubscript𝑇DUT_{\text{DU}}italic_T start_POSTSUBSCRIPT DU end_POSTSUBSCRIPT denotes the number of layers, CSTsubscript𝐶STC_{\text{ST}}italic_C start_POSTSUBSCRIPT ST end_POSTSUBSCRIPT is the complexity of the sparse transformation, and CISTAsubscript𝐶ISTAC_{\text{ISTA}}italic_C start_POSTSUBSCRIPT ISTA end_POSTSUBSCRIPT denotes the complexity of each iteration in ISTA, respectively. Besides, the decoder complexity of CsiNet is O(2(i=13NaNtKde,i2Cin,iCout,i)+NaNtM)𝑂2superscriptsubscript𝑖13subscript𝑁𝑎subscript𝑁𝑡superscriptsubscript𝐾de,i2subscript𝐶in,isubscript𝐶out,isubscript𝑁𝑎subscript𝑁𝑡𝑀O(2(\sum_{i=1}^{3}N_{a}N_{t}K_{\text{de,i}}^{2}C_{\text{in,i}}C_{\text{out,i}}% )+N_{a}N_{t}M)italic_O ( 2 ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT de,i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT in,i end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT out,i end_POSTSUBSCRIPT ) + italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M ), where Kde,isubscript𝐾de,iK_{\text{de,i}}italic_K start_POSTSUBSCRIPT de,i end_POSTSUBSCRIPT, Cin,isubscript𝐶in,iC_{\text{in,i}}italic_C start_POSTSUBSCRIPT in,i end_POSTSUBSCRIPT, and Cout,isubscript𝐶out,iC_{\text{out,i}}italic_C start_POSTSUBSCRIPT out,i end_POSTSUBSCRIPT denote the dimension of the convolutional kernel, the input and output channel number of the i𝑖iitalic_i-th layer in the CNN, respectively. The decoder complexity of TransNet is O(2(4Na2dde+Nadde2+ddeNaNt)+2NaNtM)𝑂24superscriptsubscript𝑁𝑎2subscript𝑑desubscript𝑁𝑎superscriptsubscript𝑑de2subscript𝑑desubscript𝑁𝑎subscript𝑁𝑡2subscript𝑁𝑎subscript𝑁𝑡𝑀O(2(4N_{a}^{2}d_{\text{de}}+N_{a}d_{\text{de}}^{2}+d_{\text{de}}N_{a}N_{t})+2N% _{a}N_{t}M)italic_O ( 2 ( 4 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT de end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT de end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_d start_POSTSUBSCRIPT de end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_M ), where ddesubscript𝑑ded_{\text{de}}italic_d start_POSTSUBSCRIPT de end_POSTSUBSCRIPT denotes the decoder-side self-attention dimension. Although the direct comparison of decoder-side computational complexity among different methods is difficult, we will show the exact values for different approaches via simulations in Section V-C.

V Simulation Results

In this section, we demonstrate the performance of the proposed Csi-L2O network for CSI feedback. We first introduce the dataset generation, training settings, and evaluation metrics. The performance comparison of the proposed approach with several representative baselines are then demonstrated. Next, we discuss the computational complexity and convergence behavior of different DL-based CSI feedback methods. The bit-level performance is also demonstrated, where a quantization module is added to generate zero one bit streams. Finally, the multiple rate feedback scenarios are considered, which validates the superior generalization ability of the proposed Csi-L2O to different compression ratios.

V-A Simulation Setup

V-A1 Data Generation

Following the experimental setting in [9], two types of channel matrices are generated according to the COST 2100 models [33], i.e., the indoor picocellular scenario working at the 5.3 GHz band and the outdoor rural scenario working at the 300 MHz band. The BS is equipped with the uniform linear array with Nt=32subscript𝑁𝑡32N_{t}=32italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 32 and the number of subcarriers is 1024. The original 2×1024×3221024322\times 1024\times 322 × 1024 × 32 CSI matrix is transformed into the angular-delay domain and truncated to the first 32 rows, forming the 2×32×32232322\times 32\times 322 × 32 × 32 matrix 𝐇′′superscript𝐇′′\mathbf{H}^{\prime\prime}bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT.

V-A2 Training Settings

The training, validation, and test datasets contain 100,000, 30,000, and 20,000 samples, respectively. The Adam optimizer is used for trainable weight updates [34]. Kaiming initialization is used as the neural network initialization approach. We train the neural network for 1000 epochs with a mini-batch size of 200 and a learning rate of 0.0001. The loss function in (11) is used as the unsupervised loss where β𝛽\betaitalic_β is set 0.01. ft()subscript𝑓𝑡f_{t}(\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) is a three-layer MLP with hidden units [128, 128, 256] and fi()subscript𝑓𝑖f_{i}(\cdot)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) exhibits a reverse structure, i.e., a three-layer MLP with hidden units [256, 128, 128]. The top 51 elements are retained in the top G𝐺Gitalic_G activation of sparse transformation. A two-layer LSTM with hidden size being two is adopted as the element-wise LSTM in the L2O decoding module. A single-layer MLP with 20 input size and 20 output size generates the intermediate parameters, which is then fed into five dstinct single-layer MLPs to output optimization parameters in element-wise L2O.

V-A3 Evaluation Metric

The normalized mean squared error (NMSE) between the recovered channel and the true channel is used to evaluate the performance, which is given by

NMSE=𝔼{𝐇′′𝐇^′′22𝐇′′22}.NMSE𝔼superscriptsubscriptnormsuperscript𝐇′′superscript^𝐇′′22superscriptsubscriptnormsuperscript𝐇′′22\text{NMSE}=\mathbb{E}\left\{\frac{||\mathbf{H}^{\prime\prime}-\hat{\mathbf{H}% }^{\prime\prime}||_{2}^{2}}{||\mathbf{H}^{\prime\prime}||_{2}^{2}}\right\}.NMSE = blackboard_E { divide start_ARG | | bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT - over^ start_ARG bold_H end_ARG start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | | bold_H start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG } . (24)

In addition, the number of FLOPs is used to measure the time complexity of the learning model, and the number of trainable parameters is adopted as a metric to measure the space complexity [11]. All the simulations are done using the existing DL platform PyTorch. The number of FLOPs and trainable parameters are calculated using the thop package [35] for PyTorch.

V-B Performance Comparison

To illustrate the effectiveness of the proposed CSI feedback design, we adopt five benchmarks for comparison:

  • ISTA: A classical CS algorithm without learning component.

  • MS4L2O [21]: A mathematical inspired L2O framework is directly implemeted on CSI feedback problem.

  • CsiNet [9]: An exploratory fully data-driven CSI feedback scheme that enjoys low time and space complexity.

  • TransNet [11]: A transformer-based method that achieves SOTA performance but induces heavy computational costs.

  • TiLISTA [19]: An ISTA-based deep unfolding method for CSI feedback where a sparse auto-encoder is utilized to learn the sparse transformation in the spatial domain.

TABLE II: The encoder-side FLOPs and trainable parameters number of different methods.
Compression Ratio 1/8181/81 / 8 1/161161/161 / 16 1/321321/321 / 32 1/641641/641 / 64
Numbers FLOPs Params FLOPs Params FLOPs Params FLOPs Params
ISTA 0.524 M 0 0.262 M 0 0.131 M 0 0.066 M 0
MS4L2O 0.524 M 0 0.262 M 0 0.131 M 0 0.066 M 0
CsiNet 0.561 M 0.524 M 0.299 M 0.262 M 0.168 M 0.131 M 0.102 M 0.066 M
TiLISTA 0.524 M 0.524 M 0.262 M 0.262 M 0.131 M 0.131 M 0.066 M 0.066 M
TransNet 17.334 M 0.789 M 17.072 M 0.526 M 16.941 M 0.395 M 16.876 M 0.330 M
Proposed 0.524 M 0.524 M 0.262 M 0.262 M 0.131 M 0.131 M 0.066 M 0.066 M
TABLE III: The decoder-side FLOPs and trainable parameters number of different methods.
Compression Ratio 1/8181/81 / 8 1/161161/161 / 16 1/321321/321 / 32 1/641641/641 / 64
Numbers FLOPs Params FLOPs Params FLOPs Params FLOPs Params
ISTA 10.486 M 0 5.243 M 0 2.621 M 0 1.311 M 0
MS4L2O 20.978 M 0.004 M 10.492 M 0.004 M 5.249 M 0.004 M 2.628 M 0.004 M
CsiNet 3.809 M 0.527 M 3.547 M 0.265 M 3.416 M 0.134 M 3.351 M 0.069 M
TiLISTA 10.813 M 0.033 M 5.571 M 0.033 M 2.949 M 0.033 M 1.638 M 0.033 M
TransNet 17.883 M 1.315 M 17.359 M 0.791 M 17.097 M 0.530 M 16.966 M 0.398 M
Proposed 22.125 M 0.119 M 11.639 M 0.119 M 6.396 M 0.119 M 3.201 M 0.119 M
Refer to caption
Figure 4: NMSE achieved by different methods versus compression ratios in an indoor scenario.

Fig. 4 plots the NMSE achieved by the proposed scheme and the five baseline methods versus the compression ratios in indoor scenario. The traditional ISTA performs the worst because the CSI after DFT transformation is not sparse enough. It is shown that all the learning-based methods outperform the ISTA method, indicating that DL approaches have the ability to effectively compress and reconstruct CSI. Among the five learning-based methods, the proposed Csi-L2O scheme achieves the best performance for all investigated values of compression ratios. For example, when the compression ratio is 1/161161/161 / 16 the proposed Csi-L2O outperforms SOTA TransNet 3.88 dB. It is also observed that the proposed Csi-L2O design outperforms the MS4L2O to a large margin, and the performance gain is more obvious when the compression ratio is large. This indicates the effectiveness of the proposed learnable sampling matrix at the encoder and the angular domain sparse transformation function at the decoder.

Refer to caption
Figure 5: NMSE achieved by different methods versus compression ratios in an outdoor scenario.

In Fig. 5, we demonstrate the CSI recovery accuracy achieved by different methods versus the compression ratios in outdoor scenario. As can be observed in Fig. 5, while the ISTA, MS4L2O, CsiNet, and TiLISTA methods entail a prominent performance loss, our proposed L2O-based method still captures the trend of the SOTA and achieves a comparable performance. This indicates that even for the complicated communication environment, the proposed feedback scheme is still effectively learned from data thanks to the powerful learning capability of LSTM and MLP.

V-C Complexity Comparison

We then show the number of FLOPs and number of trainable parameters of different methods at the encoder-side in Table II under different compression ratios. Due to two consecutive attention-based encoding blocks and fully connected layers, the TransNet entails the highest time complexity and space complexity, which hinders their applications in practice especially for resource-constraint devices. For example, when the compression ratio is 1/161161/161 / 16, the proposed Csi-L2O only requires 1.5%percent1.51.5\%1.5 % number of FLOPs of TransNet. Since ISTA, MS4L2O, and TiLISTA all employ a simple linear projection at the encoder, and thus they all enjoy high computational efficiency. ISTA and MS4L2O utilize a Guassian random matrix as the sampling matrix and thus the number of trainable parameters are zeros. In addition to one fully connected layer, CsiNet also adopts convolutional kernels, making the number of FLOPs and number of trainable parameters slightly higher than TiLISTA. It can be observed that the proposed Csi-L2O method has exactly the same encoder-side complexity as TiLISTA, which is the lowest among all the baselines. Furthermore, the decoder-side computational complexity is demonstrated in Table III. It is observed that the computational complexity of Csi-L2O is less than TransNet when compression ratio is no less than 1/161161/161 / 16. The complexity reduction is more obvious for low compression ratios, indicating the superiority of the proposed method when the number of feedback bits is very limited. Besides, the number of trainable parameters of Csi-L2O is lower than TransNet both at the encoder and decoder, resulting in less memory cost.

TABLE IV: The bit-level CSI feedback NMSE (in dB) of different methods in an indoor scenario.
Compression Ratio 1/4141/41 / 4 1/8181/81 / 8
Quantization Level No Quant B = 3 B = 4 B = 5 B = 6 No Quant B = 3 B = 4 B = 5 B = 6
ISTA -4.27 -1.39 -1.71 -2.39 -2.98 -3.10 -0.89 -1.01 -1.43 -2.07
MS4L2O -5.96 -2.17 -2.89 -3.77 -4.05 -5.61 -2.03 -2.55 -3.42 -3.91
CsiNet -17.36 -9.89 -11.97 -13.25 -14.63 -13.47 -5.15 -6.03 -8.49 -10.31
TransNet -32.38 -19.32 -23.51 -27.00 -28.97 -22.91 -10.17 -12.88 -15.64 -19.04
TiLISTA -32.18 -18.73 -22.09 -26.31 -28.08 -20.71 -9.05 -11.63 -14.47 -17.10
Proposed -34.74 -21.77 -24.96 -27.83 -30.97 -26.25 -15.41 -17.94 -20.88 -23.85
Compression Ratio 1/161161/161 / 16 1/321321/321 / 32
Quantization Level No Quant B = 3 B = 4 B = 5 B = 6 No Quant B = 3 B = 4 B = 5 B = 6
ISTA -1.47 -0.40 -0.68 -0.94 -1.25 -0.52 -0.28 -0.33 -0.41 -0.49
MS4L2O -4.11 -1.05 -1.97 -3.03 -3.91 -1.62 -0.58 -0.71 -1.14 -1.50
CsiNet -8.65 -4.65 -5.70 -6.91 -8.43 -6.24 -4.02 -5.31 -5.78 -6.12
TransNet -15.00 -9.70 -11.86 -14.10 -14.87 -10.49 -8.89 -9.48 -9.92 -10.26
TiLISTA -13.73 -9.11 -11.02 -13.31 -14.01 -9.50 -7.73 -8.11 -8.71 -9.08
Proposed -18.88 -13.94 -15.98 -17.79 -18.61 -13.43 -11.90 -12.40 -12.99 -13.31
Refer to caption
Figure 6: NMSE achieved by different methods versus the number of iterations in an indoor scenario when compression ratio is 1/161161/161 / 16.

V-D Convergence

Fig. 6 illustrates the performance comparison among several methods with different numbers of iterations. It is demonstrated that although the ISTA and MS4L2O methods converge quickly, the reconstruction accuracy is still poor. In comparison, the proposed method has significant performance gain in terms of accuracy. Since the fully data-driven baselines, i.e., CsiNet and TransNet, use explicit neural networks and the outputs are acquired through one forward propagation, they do not have the concept of iterations. It can be observed from Fig. 6 that the proposed Csi-L2O converges within 11 iterations and running the proposed method with 7 iterations outperforms TransNet. It is also shown that although TiLISTA achieves a comparable performance as TransNet in 10 iterations, it does not guarantee to converge and fluctuates severely. When TiLISTA is trained for 20 iterations, the final NMSE, i.e., 10.8910.89-10.89- 10.89 dB, is even worse than that of 10 iterations, i.e., 13.7313.73-13.73- 13.73 dB, because deeper deep unfolding algorithm is harder to be trained. Therefore, we plot the convergence curve for 10-iteration TiLISTA.

V-E Bit Level Quantization

In this subsection, we compare the reconstruction accuracy of different methods in bit level CSI feedback. When the encoding and decoding modules are fixed, in practice, the quantization module is introduced to quantize the compressed codeword into zero one bit streams [36]. In Table IV, we compare the bit-level CSI feedback performance of different methods under different compression ratios in indoor scenario. Non-uniform Lloyd-Max quantizer is adopted as the quantization module [37]. In Table IV, B𝐵Bitalic_B denotes the number of quantization bits. As we can observe, the reconstruction accuracy increases with the increase of quantization bits. Particularly, Csi-L2O with B=6𝐵6B=6italic_B = 6 even exhibits a similar performance as the original Csi-L2O without quantization. When the compression ratio is low, e.g., compression ratio is 1/321321/321 / 32, the performance loss due to the quantization is marginal.

In practical scenarios, the compression ratio and quantization bits B𝐵Bitalic_B together determine the overhead of CSI feedback. For example, if the feedback bitstream contains 1536 bits, we can have two choices, i.e., compression ratio is 1/4141/41 / 4 and the number of quantization bits is 3, or compression ratio is 1/8181/81 / 8 and the number of quantization bits is 6. The NMSE of the former at the indoor scenario is 21.7721.77-21.77- 21.77 dB, while that of the latter is 23.8523.85-23.85- 23.85 dB. This provides a guidance for the practical deployment that, even if the length of feedback bitstream is fixed, suitable compression ratio and quantization bits have to be selected jointly to achieve the optimal performance.

Refer to caption
Figure 7: NMSE achieved by different methods versus compression ratios in an indoor scenario.

V-F Multiple-Rate Feedback Scenarios

In practice, the compression ratio has to be adjusted according to the dynamic environments and varying coherence time [22], named multiple-rate CSI feedback. Fig. 7 shows the NMSE performance of the proposed method for multiple-rate CSI feedback. The proposed method in Fig. 7 is trained when compression ratio is 1/161161/161 / 16 and directly test for other settings. The CsiNet in Fig. 7 is retrained each time the compression ratio changes. Two baselines specially designed for multiple feedback rate cases are also compared, i.e., SM-CsiNet+ and PM-CsiNet+ [22]. SM-CsiNet+ is a serial manner multi-rate CSI feedback method where different compression ratios share the first a few layers of neural network, and the output of high compression ratio part is the input of low compression ratio part. PM-CsiNet+ is a parallel manner multi-rate CSI feedback method where the output of low compression ratio is a part of the output of high compression ratio. It is demonstrated in Fig. 7 that the proposed Csi-L2O achieves the best multi-rate feedback reconstruction accuracy among all the baselines when compression ratios are above 1/641641/641 / 64. This verifies that the proposal has good generalization ability. Once trained, the proposed Csi-L2O can be directly implemented to different compression ratios without additional training.

TABLE V: The encoder-side FLOPs and trainable parameters number of different methods in multiple feedback rate scenarios.
Methods Number of FLOPs Number of Trainable Parameters
SM-CsiNet+ 1.638 M 1.222 M
PM-CsiNet+ 1.466 M 1.649 M
Proposed 0.262 M 0.262 M

Table V then compares the encoder-side computational complexity of the proposed method with SM-CsiNet+ and PM-CsiNet+. Note that in multiple rate feedback case, all the considered three methods have fixed complexity, i.e., the complexity is independent on the compression ratios. It is demonstrated in Table V that since SM-CsiNet+ adopts the deepest neural network for compression, it has the highest time complexity. The number of FLOPs of PM-CsiNet+ is lower than that of SM-CsiNet+ because of the layer reuse in the parallel structure. The proposed method achieves the lowest number of FLOPs, nearly 16%percent1616\%16 % of SM-CsiNet+, showing the high computational efficiency. For the space complexity, since the proposed Csi-L2O only adopts a simple linear layer, it has the least number of trainable parameters.

VI Conclusions

In this paper, we developed a model-driven DL-based method, Csi-L2O, for CSI feedback in FDD massive MIMO systems. In contrast to the existing DL-based CSI feedback paradigm, i.e., fully data-driven methods, we proposed an innovative way to amalgamate domain knowledge with DL. In particular, the codeword is generated via a learnable linear projection at the user side, while the full CSI is reconstructed at the BS side using an element-wise parameterized update rule. The proposal features an encoder with extremely low complexity, offers performance that rivals SOTA solutions, and has the flexibility to adjust to multiple feedback rates without necessitating the retraining of the neural network. Simulation results clearly demonstrated that the proposed Csi-L2O achieves an excellent performance. It is intriguing to extend our proposed Csi-L2O to other challenging communication applications, such as multi-cells massive MIMO systems [38], CSI feedback in movable antenna systems [39], and CSI feedback with time variant channels [40].

Appendix A Proof of Theorem 1

Proof:

Before the proofs of Theorem 1, we first introduce a lemma proved in [21, Lemma 1] to facilitate our proof.

Lemma 1.

For any operator 𝐨𝒟C(m×n)𝐨subscript𝒟𝐶superscript𝑚𝑛\mathbf{o}\in\mathcal{D}_{C}(\mathbb{R}^{m\times n})bold_o ∈ caligraphic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT ) and any 𝐱[1],𝐲[1],𝐱[2],𝐲[2],,𝐱[m],𝐲[m]nsuperscript𝐱delimited-[]1superscript𝐲delimited-[]1superscript𝐱delimited-[]2superscript𝐲delimited-[]2superscript𝐱delimited-[]𝑚superscript𝐲delimited-[]𝑚superscript𝑛\mathbf{x}^{[1]},\mathbf{y}^{[1]},\mathbf{x}^{[2]},\mathbf{y}^{[2]},\cdots,% \mathbf{x}^{[m]},\mathbf{y}^{[m]}\in\mathbb{R}^{n}bold_x start_POSTSUPERSCRIPT [ 1 ] end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT [ 1 ] end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT [ 2 ] end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT [ 2 ] end_POSTSUPERSCRIPT , ⋯ , bold_x start_POSTSUPERSCRIPT [ italic_m ] end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT [ italic_m ] end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, there exist matrices 𝐉1,𝐉2,,𝐉mn×nsubscript𝐉1subscript𝐉2subscript𝐉𝑚superscript𝑛𝑛\mathbf{J}_{1},\mathbf{J}_{2},\cdots,\mathbf{J}_{m}\in\mathbb{R}^{n\times n}bold_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_J start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT such that

𝐨(𝐱[1],𝐱[2],,𝐱[m])𝐨(𝐲[1],𝐲[2],,𝐲[m])𝐨superscript𝐱delimited-[]1superscript𝐱delimited-[]2superscript𝐱delimited-[]𝑚𝐨superscript𝐲delimited-[]1superscript𝐲delimited-[]2superscript𝐲delimited-[]𝑚\displaystyle\mathbf{o}(\mathbf{x}^{[1]},\mathbf{x}^{[2]},\cdots,\mathbf{x}^{[% m]})-\mathbf{o}(\mathbf{y}^{[1]},\mathbf{y}^{[2]},\cdots,\mathbf{y}^{[m]})bold_o ( bold_x start_POSTSUPERSCRIPT [ 1 ] end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT [ 2 ] end_POSTSUPERSCRIPT , ⋯ , bold_x start_POSTSUPERSCRIPT [ italic_m ] end_POSTSUPERSCRIPT ) - bold_o ( bold_y start_POSTSUPERSCRIPT [ 1 ] end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT [ 2 ] end_POSTSUPERSCRIPT , ⋯ , bold_y start_POSTSUPERSCRIPT [ italic_m ] end_POSTSUPERSCRIPT ) (25)
=j=1m𝐉j(𝐱[j]𝐲[j]),absentsuperscriptsubscript𝑗1𝑚subscript𝐉𝑗superscript𝐱delimited-[]𝑗superscript𝐲delimited-[]𝑗\displaystyle=\sum_{j=1}^{m}\mathbf{J}_{j}(\mathbf{x}^{[j]}-\mathbf{y}^{[j]}),= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT - bold_y start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT ) ,

and

𝐉1nC,𝐉2nC,,𝐉mnC.formulae-sequencenormsubscript𝐉1𝑛𝐶formulae-sequencenormsubscript𝐉2𝑛𝐶normsubscript𝐉𝑚𝑛𝐶\|\mathbf{J}_{1}\|\leq\sqrt{n}C,\quad\|\mathbf{J}_{2}\|\leq\sqrt{n}C,\quad% \cdots,\quad\|\mathbf{J}_{m}\|\leq\sqrt{n}C.∥ bold_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ≤ square-root start_ARG italic_n end_ARG italic_C , ∥ bold_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ ≤ square-root start_ARG italic_n end_ARG italic_C , ⋯ , ∥ bold_J start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ ≤ square-root start_ARG italic_n end_ARG italic_C . (26)

To prove Theorem 1, we denote

𝐝^[t]=𝐝[t](𝐱,f(𝐱),𝐱,f(𝐱),𝐱,f(𝐱)).superscript^𝐝delimited-[]𝑡superscript𝐝delimited-[]𝑡superscript𝐱𝑓superscript𝐱superscript𝐱𝑓superscript𝐱superscript𝐱𝑓superscript𝐱\hat{\mathbf{d}}^{[t]}=\mathbf{d}^{[t]}(\mathbf{x}^{*},\nabla f(\mathbf{x}^{*}% ),\mathbf{x}^{*},-\nabla f(\mathbf{x}^{*}),\mathbf{x}^{*},\nabla f(\mathbf{x}^% {*})).over^ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT = bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , - ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) .

Then (15) can be written as

𝐱[t+1]=superscript𝐱delimited-[]𝑡1absent\displaystyle\mathbf{x}^{[t+1]}=bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT = 𝐱[t]𝐝[t](𝐱[t],f(𝐱[t]),𝐱[t+1],𝐠[t+1],𝐲[t],f(𝐲[t]))superscript𝐱delimited-[]𝑡superscript𝐝delimited-[]𝑡superscript𝐱delimited-[]𝑡𝑓superscript𝐱delimited-[]𝑡superscript𝐱delimited-[]𝑡1superscript𝐠delimited-[]𝑡1superscript𝐲delimited-[]𝑡𝑓superscript𝐲delimited-[]𝑡\displaystyle\mathbf{x}^{[t]}-\mathbf{d}^{[t]}(\mathbf{x}^{[t]},\nabla f(% \mathbf{x}^{[t]}),\mathbf{x}^{[t+1]},\mathbf{g}^{[t+1]},\mathbf{y}^{[t]},% \nabla f(\mathbf{y}^{[t]}))bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) , bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT , bold_g start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , ∇ italic_f ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) )
+𝐝[t](𝐱,f(𝐱),𝐱,f(𝐱),𝐱,f(𝐱))𝐝^[t].superscript𝐝delimited-[]𝑡superscript𝐱𝑓superscript𝐱superscript𝐱𝑓superscript𝐱superscript𝐱𝑓superscript𝐱superscript^𝐝delimited-[]𝑡\displaystyle+\mathbf{d}^{[t]}(\mathbf{x}^{*},\nabla f(\mathbf{x}^{*}),\mathbf% {x}^{*},-\nabla f(\mathbf{x}^{*}),\mathbf{x}^{*},\nabla f(\mathbf{x}^{*}))-% \hat{\mathbf{d}}^{[t]}.+ bold_d start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , - ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) - over^ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT .

Applying Lemma 1, we have

𝐱[t+1]=𝐱[t]superscript𝐱delimited-[]𝑡1superscript𝐱delimited-[]𝑡\displaystyle\mathbf{x}^{[t+1]}=\mathbf{x}^{[t]}bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT = bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT 𝐉1[t](𝐱[t]𝐱)𝐉2[t](𝐱[t+1]𝐱)superscriptsubscript𝐉1delimited-[]𝑡superscript𝐱delimited-[]𝑡superscript𝐱superscriptsubscript𝐉2delimited-[]𝑡superscript𝐱delimited-[]𝑡1superscript𝐱\displaystyle-\mathbf{J}_{1}^{[t]}(\mathbf{x}^{[t]}-\mathbf{x}^{*})-\mathbf{J}% _{2}^{[t]}(\mathbf{x}^{[t+1]}-\mathbf{x}^{*})- bold_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - bold_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
𝐉3[t](𝐲[t]𝐱)𝐝^[t]superscriptsubscript𝐉3delimited-[]𝑡superscript𝐲delimited-[]𝑡superscript𝐱superscript^𝐝delimited-[]𝑡\displaystyle-\mathbf{J}_{3}^{[t]}(\mathbf{y}^{[t]}-\mathbf{x}^{*})-\hat{% \mathbf{d}}^{[t]}- bold_J start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - over^ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT
𝐉4[t](f(𝐱[t])f(𝐱))superscriptsubscript𝐉4delimited-[]𝑡𝑓superscript𝐱delimited-[]𝑡𝑓superscript𝐱\displaystyle-\mathbf{J}_{4}^{[t]}(\nabla f(\mathbf{x}^{[t]})-\nabla f(\mathbf% {x}^{*}))- bold_J start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) - ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) )
𝐉5[t](𝐠[t+1]+f(𝐱))superscriptsubscript𝐉5delimited-[]𝑡superscript𝐠delimited-[]𝑡1𝑓superscript𝐱\displaystyle-\mathbf{J}_{5}^{[t]}(\mathbf{g}^{[t+1]}+\nabla f(\mathbf{x}^{*}))- bold_J start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_g start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT + ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) )
𝐉6[t](f(𝐲[t])f(𝐱)),superscriptsubscript𝐉6delimited-[]𝑡𝑓superscript𝐲delimited-[]𝑡𝑓superscript𝐱\displaystyle-\mathbf{J}_{6}^{[t]}(\nabla f(\mathbf{y}^{[t]})-\nabla f(\mathbf% {x}^{*})),- bold_J start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( ∇ italic_f ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) - ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ,

where matrices 𝐉j[t](1j6)superscriptsubscript𝐉𝑗delimited-[]𝑡1𝑗6\mathbf{J}_{j}^{[t]}(1\leq j\leq 6)bold_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( 1 ≤ italic_j ≤ 6 ) satisfy

𝐉j[t]2NaNtC,j=1,2,3,4,5,6.formulae-sequencenormsuperscriptsubscript𝐉𝑗delimited-[]𝑡2subscript𝑁𝑎subscript𝑁𝑡𝐶for-all𝑗123456\|\mathbf{J}_{j}^{[t]}\|\leq\sqrt{2N_{a}N_{t}}C,\quad\forall j=1,2,3,4,5,6.∥ bold_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∥ ≤ square-root start_ARG 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_C , ∀ italic_j = 1 , 2 , 3 , 4 , 5 , 6 .

Then, we perform some calculations and obtain

𝐱[t+1]=𝐱[t]superscript𝐱delimited-[]𝑡1superscript𝐱delimited-[]𝑡\displaystyle\mathbf{x}^{[t+1]}=\mathbf{x}^{[t]}bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT = bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT 𝐉1[t](𝐱[t]𝐱)𝐉2[t](𝐱[t+1]𝐱)superscriptsubscript𝐉1delimited-[]𝑡superscript𝐱delimited-[]𝑡superscript𝐱superscriptsubscript𝐉2delimited-[]𝑡superscript𝐱delimited-[]𝑡1superscript𝐱\displaystyle-\mathbf{J}_{1}^{[t]}(\mathbf{x}^{[t]}-\mathbf{x}^{*})-\mathbf{J}% _{2}^{[t]}(\mathbf{x}^{[t+1]}-\mathbf{x}^{*})- bold_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - bold_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
𝐉3[t](𝐲[t]𝐱)𝐝^ksuperscriptsubscript𝐉3delimited-[]𝑡superscript𝐲delimited-[]𝑡superscript𝐱subscript^𝐝𝑘\displaystyle-\mathbf{J}_{3}^{[t]}(\mathbf{y}^{[t]}-\mathbf{x}^{*})-\hat{% \mathbf{d}}_{k}- bold_J start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - over^ start_ARG bold_d end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
(𝐉4[t]𝐉5[t]+𝐉6[t])(f(𝐱[t])f(𝐱))superscriptsubscript𝐉4delimited-[]𝑡superscriptsubscript𝐉5delimited-[]𝑡superscriptsubscript𝐉6delimited-[]𝑡𝑓superscript𝐱delimited-[]𝑡𝑓superscript𝐱\displaystyle-(\mathbf{J}_{4}^{[t]}-\mathbf{J}_{5}^{[t]}+\mathbf{J}_{6}^{[t]})% (\nabla f(\mathbf{x}^{[t]})-\nabla f(\mathbf{x}^{*}))- ( bold_J start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_J start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT + bold_J start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ( ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) - ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) )
(𝐉5[t]𝐉6[t])(f(𝐱[t])f(𝐱))superscriptsubscript𝐉5delimited-[]𝑡superscriptsubscript𝐉6delimited-[]𝑡𝑓superscript𝐱delimited-[]𝑡𝑓superscript𝐱\displaystyle-(\mathbf{J}_{5}^{[t]}-\mathbf{J}_{6}^{[t]})(\nabla f(\mathbf{x}^% {[t]})-\nabla f(\mathbf{x}^{*}))- ( bold_J start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_J start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ( ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) - ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) )
𝐉5[t](𝐠[t+1]+f(𝐱))𝐉6[t](f(𝐲[t])f(𝐱))superscriptsubscript𝐉5delimited-[]𝑡superscript𝐠delimited-[]𝑡1𝑓superscript𝐱superscriptsubscript𝐉6delimited-[]𝑡𝑓superscript𝐲delimited-[]𝑡𝑓superscript𝐱\displaystyle-\mathbf{J}_{5}^{[t]}(\mathbf{g}^{[t+1]}+\nabla f(\mathbf{x}^{*})% )-\mathbf{J}_{6}^{[t]}(\nabla f(\mathbf{y}^{[t]})-\nabla f(\mathbf{x}^{*}))- bold_J start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_g start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT + ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) - bold_J start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( ∇ italic_f ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) - ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) )
=𝐱[t]absentsuperscript𝐱delimited-[]𝑡\displaystyle=\mathbf{x}^{[t]}= bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT 𝐉1[t](𝐱[t]𝐱)𝐉2[t](𝐱[t+1]𝐱)superscriptsubscript𝐉1delimited-[]𝑡superscript𝐱delimited-[]𝑡superscript𝐱superscriptsubscript𝐉2delimited-[]𝑡superscript𝐱delimited-[]𝑡1superscript𝐱\displaystyle-\mathbf{J}_{1}^{[t]}(\mathbf{x}^{[t]}-\mathbf{x}^{*})-\mathbf{J}% _{2}^{[t]}(\mathbf{x}^{[t+1]}-\mathbf{x}^{*})- bold_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - bold_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
𝐉3[t](𝐲[t]𝐱)𝐝^[t]superscriptsubscript𝐉3delimited-[]𝑡superscript𝐲delimited-[]𝑡superscript𝐱superscript^𝐝delimited-[]𝑡\displaystyle-\mathbf{J}_{3}^{[t]}(\mathbf{y}^{[t]}-\mathbf{x}^{*})-\hat{% \mathbf{d}}^{[t]}- bold_J start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - over^ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT
(𝐉4[t]𝐉5[t]+𝐉6[t])(f(𝐱[t])f(𝐱))superscriptsubscript𝐉4delimited-[]𝑡superscriptsubscript𝐉5delimited-[]𝑡superscriptsubscript𝐉6delimited-[]𝑡𝑓superscript𝐱delimited-[]𝑡𝑓superscript𝐱\displaystyle-(\mathbf{J}_{4}^{[t]}-\mathbf{J}_{5}^{[t]}+\mathbf{J}_{6}^{[t]})% (\nabla f(\mathbf{x}^{[t]})-\nabla f(\mathbf{x}^{*}))- ( bold_J start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_J start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT + bold_J start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ( ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) - ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) )
(𝐉5[t]𝐉6[t])f(𝐱[t])𝐉5[t]𝐠[t+1]𝐉6[t]f(𝐲[t]).superscriptsubscript𝐉5delimited-[]𝑡superscriptsubscript𝐉6delimited-[]𝑡𝑓superscript𝐱delimited-[]𝑡superscriptsubscript𝐉5delimited-[]𝑡superscript𝐠delimited-[]𝑡1superscriptsubscript𝐉6delimited-[]𝑡𝑓superscript𝐲delimited-[]𝑡\displaystyle-(\mathbf{J}_{5}^{[t]}-\mathbf{J}_{6}^{[t]})\nabla f(\mathbf{x}^{% [t]})-\mathbf{J}_{5}^{[t]}~{}\mathbf{g}^{[t+1]}-\mathbf{J}_{6}^{[t]}\nabla f(% \mathbf{y}^{[t]}).- ( bold_J start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_J start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) - bold_J start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT bold_g start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT - bold_J start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∇ italic_f ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) .

Given any 𝐁[t]2NaNt×2NaNtsuperscript𝐁delimited-[]𝑡superscript2subscript𝑁𝑎subscript𝑁𝑡2subscript𝑁𝑎subscript𝑁𝑡\mathbf{B}^{[t]}\in\mathbb{R}^{2N_{a}N_{t}\times 2N_{a}N_{t}}bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, as defined in (16), let

𝐏1[t]superscriptsubscript𝐏1delimited-[]𝑡\displaystyle\mathbf{P}_{1}^{[t]}bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT =𝐉5[t],absentsuperscriptsubscript𝐉5delimited-[]𝑡\displaystyle=\mathbf{J}_{5}^{[t]},= bold_J start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ,
𝐏2[t]superscriptsubscript𝐏2delimited-[]𝑡\displaystyle\mathbf{P}_{2}^{[t]}bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT =𝐉6[t],absentsuperscriptsubscript𝐉6delimited-[]𝑡\displaystyle=\mathbf{J}_{6}^{[t]},= bold_J start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ,
𝐛1[t]superscriptsubscript𝐛1delimited-[]𝑡\displaystyle\mathbf{b}_{1}^{[t]}bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT =𝐉1[t](𝐱[t]𝐱)+𝐉2[t](𝐱[t+1]𝐱)absentsuperscriptsubscript𝐉1delimited-[]𝑡superscript𝐱delimited-[]𝑡superscript𝐱superscriptsubscript𝐉2delimited-[]𝑡superscript𝐱delimited-[]𝑡1superscript𝐱\displaystyle=\mathbf{J}_{1}^{[t]}(\mathbf{x}^{[t]}-\mathbf{x}^{*})+\mathbf{J}% _{2}^{[t]}(\mathbf{x}^{[t+1]}-\mathbf{x}^{*})= bold_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + bold_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
+𝐉3[t](𝐲[t]𝐱)+𝐝^[t]superscriptsubscript𝐉3delimited-[]𝑡superscript𝐲delimited-[]𝑡superscript𝐱superscript^𝐝delimited-[]𝑡\displaystyle\quad+\mathbf{J}_{3}^{[t]}(\mathbf{y}^{[t]}-\mathbf{x}^{*})+\hat{% \mathbf{d}}^{[t]}+ bold_J start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + over^ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT
+(𝐉4[t]𝐉5[t]+𝐉6[t])(f(𝐱[t])f(𝐱))superscriptsubscript𝐉4delimited-[]𝑡superscriptsubscript𝐉5delimited-[]𝑡superscriptsubscript𝐉6delimited-[]𝑡𝑓superscript𝐱delimited-[]𝑡𝑓superscript𝐱\displaystyle\quad+(\mathbf{J}_{4}^{[t]}-\mathbf{J}_{5}^{[t]}+\mathbf{J}_{6}^{% [t]})(\nabla f(\mathbf{x}^{[t]})-\nabla f(\mathbf{x}^{*}))+ ( bold_J start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_J start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT + bold_J start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ( ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) - ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) )
+𝐁[t](𝐲[t]𝐱[t]).superscript𝐁delimited-[]𝑡superscript𝐲delimited-[]𝑡superscript𝐱delimited-[]𝑡\displaystyle\quad+\mathbf{B}^{[t]}(\mathbf{y}^{[t]}-\mathbf{x}^{[t]}).+ bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) .

Then we have

𝐱[t+1]=superscript𝐱delimited-[]𝑡1absent\displaystyle\mathbf{x}^{[t+1]}=bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT = 𝐱[t](𝐏1[t]𝐏2[t])f(𝐱[t])𝐏2[t]f(𝐲[t])superscript𝐱delimited-[]𝑡superscriptsubscript𝐏1delimited-[]𝑡superscriptsubscript𝐏2delimited-[]𝑡𝑓superscript𝐱delimited-[]𝑡superscriptsubscript𝐏2delimited-[]𝑡𝑓superscript𝐲delimited-[]𝑡\displaystyle~{}\mathbf{x}^{[t]}-(\mathbf{P}_{1}^{[t]}-\mathbf{P}_{2}^{[t]})% \nabla f(\mathbf{x}^{[t]})-\mathbf{P}_{2}^{[t]}\nabla f(\mathbf{y}^{[t]})bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - ( bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) - bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∇ italic_f ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT )
𝐏1[t]𝐠[t+1]+𝐁[t](𝐲[t]𝐱[t])𝐛1[t],superscriptsubscript𝐏1delimited-[]𝑡superscript𝐠delimited-[]𝑡1superscript𝐁delimited-[]𝑡superscript𝐲delimited-[]𝑡superscript𝐱delimited-[]𝑡superscriptsubscript𝐛1delimited-[]𝑡\displaystyle-\mathbf{P}_{1}^{[t]}~{}\mathbf{g}^{[t+1]}+\mathbf{B}^{[t]}(% \mathbf{y}^{[t]}-\mathbf{x}^{[t]})-\mathbf{b}_{1}^{[t]},- bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT bold_g start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT + bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) - bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ,

which exactly echos with (16). The upper bounds of 𝐉j[t](1j6)superscriptsubscript𝐉𝑗delimited-[]𝑡1𝑗6\mathbf{J}_{j}^{[t]}(1\leq j\leq 6)bold_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ( 1 ≤ italic_j ≤ 6 ) imply that 𝐏1[t],𝐏2[t]superscriptsubscript𝐏1delimited-[]𝑡superscriptsubscript𝐏2delimited-[]𝑡\mathbf{P}_{1}^{[t]},\mathbf{P}_{2}^{[t]}bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT , bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT are bounded, i.e.,

𝐏1[t]2NaNtC,𝐏2[t]2NaNtC,formulae-sequencenormsuperscriptsubscript𝐏1delimited-[]𝑡2subscript𝑁𝑎subscript𝑁𝑡𝐶normsuperscriptsubscript𝐏2delimited-[]𝑡2subscript𝑁𝑎subscript𝑁𝑡𝐶\|\mathbf{P}_{1}^{[t]}\|\leq\sqrt{2N_{a}N_{t}}C,\quad\|\mathbf{P}_{2}^{[t]}\|% \leq\sqrt{2N_{a}N_{t}}C,∥ bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∥ ≤ square-root start_ARG 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_C , ∥ bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∥ ≤ square-root start_ARG 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_C ,

and 𝐛1[t]superscriptsubscript𝐛1delimited-[]𝑡\mathbf{b}_{1}^{[t]}bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT is controlled by

𝐛1[t]normsuperscriptsubscript𝐛1delimited-[]𝑡absent\displaystyle\|\mathbf{b}_{1}^{[t]}\|\leq∥ bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∥ ≤ 2NaNtC(𝐱[t]𝐱+𝐱[t+1]𝐱\displaystyle~{}\sqrt{2N_{a}N_{t}}C\Big{(}\|\mathbf{x}^{[t]}-\mathbf{x}^{*}\|+% \|\mathbf{x}^{[t+1]}-\mathbf{x}^{*}\|square-root start_ARG 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_C ( ∥ bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ + ∥ bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ (27)
+𝐲[t]𝐱)+𝐝^[t]+𝐁[t]𝐲[t]𝐱[t]\displaystyle+\|\mathbf{y}^{[t]}-\mathbf{x}^{*}\|\Big{)}+\|\hat{\mathbf{d}}^{[% t]}\|+\|\mathbf{B}^{[t]}\|\|\mathbf{y}^{[t]}-\mathbf{x}^{[t]}\|+ ∥ bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ) + ∥ over^ start_ARG bold_d end_ARG start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∥ + ∥ bold_B start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∥ ∥ bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∥
+32NaNtCf(𝐱[t])f(𝐱).32subscript𝑁𝑎subscript𝑁𝑡𝐶norm𝑓superscript𝐱delimited-[]𝑡𝑓superscript𝐱\displaystyle+3\sqrt{2N_{a}N_{t}}C\|\nabla f(\mathbf{x}^{[t]})-\nabla f(% \mathbf{x}^{*})\|.+ 3 square-root start_ARG 2 italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_C ∥ ∇ italic_f ( bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ) - ∇ italic_f ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ .

Since we set 𝐛1[t]0normsuperscriptsubscript𝐛1delimited-[]𝑡0\|\mathbf{b}_{1}^{[t]}\|\to 0∥ bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∥ → 0 as t𝑡t\to\inftyitalic_t → ∞, according to (27), we have

𝐱[t]𝐱0,𝐱[t+1]𝐱0,formulae-sequencenormsuperscript𝐱delimited-[]𝑡superscript𝐱0normsuperscript𝐱delimited-[]𝑡1superscript𝐱0\displaystyle\|\mathbf{x}^{[t]}-\mathbf{x}^{*}\|\to 0,~{}~{}\|\mathbf{x}^{[t+1% ]}-\mathbf{x}^{*}\|\to 0,∥ bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ → 0 , ∥ bold_x start_POSTSUPERSCRIPT [ italic_t + 1 ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ → 0 , (28)
𝐲[t]𝐱0,𝐲[t]𝐱[t]0,formulae-sequencenormsuperscript𝐲delimited-[]𝑡superscript𝐱0normsuperscript𝐲delimited-[]𝑡superscript𝐱delimited-[]𝑡0\displaystyle\|\mathbf{y}^{[t]}-\mathbf{x}^{*}\|\to 0,~{}~{}\|\mathbf{y}^{[t]}% -\mathbf{x}^{[t]}\|\to 0,∥ bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ → 0 , ∥ bold_y start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT [ italic_t ] end_POSTSUPERSCRIPT ∥ → 0 ,

The proof for Theorem 1 is thus completed. ∎

References

  • [1] F. Boccardi, R. W. Heath, A. Lozano, T. L. Marzetta, and P. Popovski, “Five disruptive technology directions for 5G,” IEEE Commun. Mag., vol. 52, no. 2, pp. 74–80, Feb. 2014.
  • [2] Z. Wang, J. Zhang, H. Du, D. Niyato, S. Cui, B. Ai, M. Debbah, K. B. Letaief, and H. V. Poor, “A tutorial on extremely large-scale MIMO for 6G: Fundamentals, signal processing, and applications,” IEEE Commun. Surv. & Tut., pp. 1–1, to appear, 2024.
  • [3] C.-X. Wang, X. You, X. Gao, X. Zhu, Z. Li, C. Zhang, H. Wang, Y. Huang, Y. Chen, H. Haas, J. S. Thompson, E. G. Larsson, M. D. Renzo, W. Tong, P. Zhu, X. Shen, H. V. Poor, and L. Hanzo, “On the road to 6G: Visions, requirements, key technologies, and testbeds,” IEEE Commun. Surv. & Tut., vol. 25, no. 2, pp. 905–974, Feb. 2023.
  • [4] J.-C. Shen, J. Zhang, K.-C. Chen, and K. B. Letaief, “High-dimensional CSI acquisition in massive MIMO: Sparsity-inspired approaches,” IEEE Systems Journal, vol. 11, no. 1, pp. 32–40, Mar. 2017.
  • [5] X. Rao and V. K. N. Lau, “Distributed compressive CSIT estimation and feedback for FDD multi-user massive MIMO systems,” IEEE Trans. Signal Process., vol. 62, no. 12, pp. 3261–3271, June 2014.
  • [6] Y. Ma, W. Yu, X. Yu, J. Zhang, S. Song, and K. B. Letaief, “Lightweight and flexible deep equilibrium learning for CSI feedback in FDD massive MIMO,” in IEEE Int. Conf. Mach. Learn. Commun. Netw. (ICMLCN),, Stockholm, Sweden, May 2024.
  • [7] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y.-J. A. Zhang, “The roadmap to 6G: AI empowered wireless networks,” IEEE Commun. Mag., vol. 57, no. 8, pp. 84–90, Aug. 2019.
  • [8] J. Guo, C.-K. Wen, S. **, and G. Y. Li, “Overview of deep learning-based CSI feedback in massive MIMO systems,” IEEE Trans. Commun., vol. 70, no. 12, pp. 8017–8045, Dec. 2022.
  • [9] C.-K. Wen, W.-T. Shih, and S. **, “Deep learning for massive MIMO CSI feedback,” IEEE Wireless Commun. Lett., vol. 7, no. 5, pp. 748–751, Oct. 2018.
  • [10] Z. Cao, W.-T. Shih, J. Guo, C.-K. Wen, and S. **, “Lightweight convolutional neural networks for CSI feedback in massive MIMO,” IEEE Commun. Letters, vol. 25, no. 8, pp. 2624–2628, Aug. 2021.
  • [11] Y. Cui, A. Guo, and C. Song, “TransNet: Full attention network for CSI feedback in FDD massive MIMO system,” IEEE Wireless Commun. Lett., vol. 11, no. 5, pp. 903–907, May 2022.
  • [12] R. Tang, A. Adhikari, and J. Lin, “FLOPs as a direct optimization objective for learning sparse neural networks,” in Proc. Advances Neural Inf. Process. Syst., Montreal, Canada, Dec. 2018, pp. 1–5.
  • [13] J. Zhang, B. Chen, R. Xiong, and Y. Zhang, “Physics-inspired compressive sensing: Beyond deep unrolling,” IEEE Signal Process. Mag., vol. 40, no. 1, pp. 58–72, Jan. 2023.
  • [14] Y. Ma, Y. Shen, X. Yu, J. Zhang, S. H. Song, and K. B. Letaief, “Learn to communicate with neural calibration: Scalability and generalization,” IEEE Trans. Wireless Commun., vol. 21, no. 11, pp. 9947–9961, Nov. 2022.
  • [15] V. Monga, Y. Li, and Y. C. Eldar, “Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing,” IEEE Signal Process. Mag., vol. 38, no. 2, pp. 18–44, Mar. 2021.
  • [16] H. He, S. **, C. Wen, F. Gao, G. Y. Li, and Z. Xu, “Model-driven deep learning for physical layer communications,” IEEE Wireless Commun., vol. 26, no. 5, pp. 77–83, Oct. 2019.
  • [17] H. He, C.-K. Wen, S. **, and G. Y. Li, “Model-driven deep learning for MIMO detection,” IEEE Trans. Signal Process., vol. 68, pp. 1702–1715, Feb. 2020.
  • [18] W. Yu, Y. Shen, H. He, X. Yu, S. Song, J. Zhang, and K. B. Letaief, “An adaptive and robust deep learning framework for THz ultra-massive MIMO channel estimation,” IEEE J. Sel. Topics Signal Process., pp. 1–16, July 2023.
  • [19] Y. Wang, X. Chen, H. Yin, and W. Wang, “Learnable sparse transformation-based massive MIMO CSI recovery network,” IEEE Commun. Lett., vol. 24, no. 7, pp. 1468–1471, July 2020.
  • [20] Z. Hu, G. Liu, Q. Xie, J. Xue, D. Meng, and D. Gündüz, “A learnable optimization and regularization approach to massive MIMO CSI feedback,” IEEE Trans. Wireless Commun., vol. 23, no. 1, pp. 104–116, Jan. 2024.
  • [21] J. Liu, X. Chen, Z. Wang, W. Yin, and H. Cai, “Towards constituting mathematical structures for learning to optimize,” in Proc. 40th Int. Conf. Mach. Learn. (ICML), Honolulu, Hawaii, USA, July 2023.
  • [22] J. Guo, C.-K. Wen, S. **, and G. Y. Li, “Convolutional neural network-based multiple-rate compressive sensing for massive mimo csi feedback: Design, simulation, and analysis,” IEEE Trans. Wireless Commun., vol. 19, no. 4, pp. 2827–2840, Apr. 2020.
  • [23] T. Chen, X. Chen, W. Chen, H. Heaton, J. Liu, Z. Wang, and W. Yin, “Learning to optimize: A primer and a benchmark,” J. Mach. Learn. Res., vol. 23, no. 189, pp. 1–59, June 2022.
  • [24] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas, “Learning to learn by gradient descent by gradient descent,” in Proc. Adv. Neural Inf. Process. Syst., Barcelona, Spain, Dec. 2016, p. 3988–3996.
  • [25] S. Wang, A. Pathania, and T. Mitra, “Neural network inference on mobile SoCs,” IEEE Design & Test, vol. 37, no. 5, pp. 50–57, Jan. 2020.
  • [26] J. Guo, L. Wang, F. Li, and J. Xue, “CSI feedback with model-driven deep learning of massive MIMO systems,” IEEE Commun. Lett., vol. 26, no. 3, pp. 547–551, Mar. 2022.
  • [27] J. Zhang and B. Ghanem, “ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing,” in Proc. IEEE Conf. Computer Vision Pattern Recognition, Salt Lake City, UT, USA, June 2018, pp. 1828–1837.
  • [28] E. J. Candes, J. K. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Commun. Pure Applied Mathematics, vol. 59, no. 8, pp. 1207–1223, 2006.
  • [29] B. Bah and J. Tanner, “Improved bounds on restricted isometry constants for Gaussian matrices,” SIAM J. Matrix Anal. Appl., vol. 31, no. 5, pp. 2882–2898, 2010.
  • [30] K. Lv, S. Jiang, and J. Li, “Learning gradient descent: Better generalization and longer horizons,” in Proc. Int. Conf. Mach. Learn. (ICML), Sydney, Australia, Aug. 2017, pp. 2247–2255.
  • [31] D. Bertsekas, Convex optimization algorithms.   Athena Scientific, 2015.
  • [32] R. T. Rockafellar, “Monotone operators and the proximal point algorithm,” SIAM J. Control and Optimization, vol. 14, no. 5, pp. 877–898, 1976.
  • [33] L. Liu, C. Oestges, J. Poutanen, K. Haneda, P. Vainikainen, F. Quitin, F. Tufvesson, and P. D. Doncker, “The COST 2100 MIMO channel model,” IEEE Wireless Commun., vol. 19, no. 6, pp. 92–99, Dec. 2012.
  • [34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations, May 2014.
  • [35] GitHub, “Thop: Pytorch-OpCounter,” https://github.com/Lyken17/pytorch-OpCounter.
  • [36] T. Chen, J. Guo, S. **, C.-K. Wen, and G. Y. Li, “A novel quantization method for deep learning-based massive MIMO CSI feedback,” in 2019 IEEE Global Conf. Signal Inf. Process. (GlobalSIP), Ottawa, ON, Canada, Jan. 2019, pp. 1–5.
  • [37] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, Mar. 1982.
  • [38] Y. Ma, X. Yu, J. Zhang, S. Song, and K. B. Letaief, “Augmented deep unfolding for downlink beamforming in multi-cell massive MIMO with limited feedback,” in Proc. 2022 IEEE Global Commun. Conf., Rio de Janeiro, Brazil, Dec. 2022, pp. 1721–1726.
  • [39] Z. Xiao, S. Cao, L. Zhu, Y. Liu, B. Ning, X.-G. Xia, and R. Zhang, “Channel estimation for movable antenna communication systems: A framework based on compressed sensing,” IEEE Trans. Wireless Commun., pp. 1–1, to appear, 2024.
  • [40] Z. Liu, M. del Rosario, and Z. Ding, “A Markovian model-driven deep learning framework for massive MIMO CSI feedback,” IEEE Trans Wireless Commun., vol. 21, no. 2, pp. 1214–1228, Feb. 2022.