Reconstructing Human Pose from Inertial Measurements:
A Generative Model-based Compressive Sensing Approach

Nguyen Quang Hieu, Dinh Thai Hoang, Diep N. Nguyen, and Mohammad Abu Alsheikh This work was supported in part by the Australian Research Council under grants DE200100863 and DE210100651. N. Q. Hieu is with School of Information Technology and Systems, University of Canberra, Canberra, ACT 2617, Australia, and also with School of Electrical Data Engineering, University of Technology, Sydney, NSW 2007, Australia, emails: ([email protected]; [email protected])D. T. Hoang and D. N. Nguyen are with School of Electrical and Data Engineering, University of Technology Sydney, NSW 2007, Australia, emails: ([email protected]; [email protected]).M. A. Alsheikh is with Faculty of Science & Technology, University of Canberra, Canberra, ACT 2617, Australia, email: [email protected].
Abstract

The ability to sense, localize, and estimate the 3D position and orientation of the human body is critical in virtual reality (VR) and extended reality (XR) applications. This becomes more important and challenging with the deployment of VR/XR applications over the next generation of wireless systems such as 5G and beyond. In this paper, we propose a novel framework that can reconstruct the 3D human body pose of the user given sparse measurements from Inertial Measurement Unit (IMU) sensors over a noisy wireless environment. Specifically, our framework enables reliable transmission of compressed IMU signals through noisy wireless channels and effective recovery of such signals at the receiver, e.g., an edge server. This task is very challenging due to the constraints of transmit power, recovery accuracy, and recovery latency. To address these challenges, we first develop a deep generative model at the receiver to recover the data from linear measurements of IMU signals. The linear measurements of the IMU signals are obtained by a linear projection with a measurement matrix based on the compressive sensing theory. The key to the success of our framework lies in the novel design of the measurement matrix at the transmitter, which can not only satisfy power constraints for the IMU devices but also obtain a highly accurate recovery for the IMU signals at the receiver. This can be achieved by extending the set-restricted eigenvalue condition of the measurement matrix and combining it with an upper bound for the power transmission constraint. Our framework can achieve robust performance for recovering 3D human poses from noisy compressed IMU signals. Additionally, our pre-trained deep generative model achieves signal reconstruction accuracy comparable to an optimization-based approach, i.e., Lasso, but is an order of magnitude faster.

Index Terms:
Compressive sensing, generative models, inertial measurement units, human pose estimation, edge computing.

I Introduction

I-A Motivation

The ability to estimate human body movements plays a key role in emerging human-computer interaction paradigms such as virtual reality (VR) and extended reality (XR) [1]. By correctly estimating the 3D position and orientation of the human body, VR/XR applications such as gaming, virtual offices, and smart factories can offer a more interactive and immersive experience for users. Highly accurate solutions for estimating 3D human movements usually rely on images or videos, which typically require multi-camera calibrated systems [1, 2]. However, the multi-camera systems are limited to capturing human outdoor activities (e.g., due to sensitive information conveyed in the images/videos) and severely degraded with poor lightning conditions [1]. Specifically, for VR/XR applications deployed over wireless systems, e.g., 5G and beyond, leveraging such images and videos from multi-camera systems for human body estimation purposes is costly in terms of bandwidth utilization and computing efficiency [3, 4]. This demands a more effective approach to achieve highly accurate estimation of human body movements in VR/XR applications deployed over wireless systems [5].

Fortunately, the inertial measurement unit (IMU) (i.e., accelerometer, gyroscope, and magnetometer) offers a promising solution to this problem. The systems based on IMU do not suffer from limitations in camera-based systems. The IMU sensors can track human movements by measuring the acceleration and orientation of human body parts, e.g., head orientation or arm/leg movement, regardless of image sensitivity information and lightning conditions, making them more suitable for indoor and outdoor VR/XR applications [6, 7]. As the IMU sensors are typically worn on the body, e.g., wrists, head, or ankles, the information measured from the IMU can help to track the movement of the body segments relative to each other. For example, utilizing IMU information such as the orientation of VR headsets can help the system better predict the user preferences in VR streaming applications [8, 9]. Moreover, acceleration readings from the IMU sensors can help to track user step count, thereby increasing the accuracy of outdoor pedestrian localization[10]. Furthermore, combining IMU information with a kinematic model of the human body can simulate the entire body movement of the user in a complete positioning and sensing system [7, 11]. With such enormous potential, the IMU sensors have been widely deployed as a standard setting inside mobile phones, tablets, VR headsets, and VR controllers.

I-B Related Works

Unlike solutions for reconstructing movements of independent parts of the human body, e.g., head or arm, estimating a full body movement of the user usually requires a set of IMU sensors placed on different parts of the body or attached to a suit [7]. With a set of IMU sensors, ranging from 3 to 17 sensors, the full body movements can be fully reconstructed with the help of optimization-based techniques [12, 6] and learning-based techniques [13, 14, 15]. In [12], a Kalman Filter was utilized to correct the kinematics of the 3D human model, given the joint uncertainties of sensor noise, angular velocity, and acceleration of the IMU sensors. In [6], the authors proposed a new optimization approach based on exponential map**, which transforms the orientation and acceleration values into equivalent energy functions. After carefully calibrating between the IMU sensors’ coordinate frames and the 3D human body’s coordinate frames, the optimization objective can be formulated as minimizing the set of energy functions over the entire sequence of collected data.

Different from offline optimization approaches [12, 6], learning-based approaches can achieve real-time estimation based on pre-trained deep learning models [13, 14, 15]. In [13], the authors proposed a deep learning approach based on a recurrent neural network that trains on the entire sequence of data at the training phase. During the testing phase, the pre-trained model can estimate the corrected body pose of the user in a shorter time window. In [14], the authors extended this idea by using a recurrent neural network combined with a physics-aware motion optimizer that enhances the tracking accuracy for a longer time window. The authors in [15] reported similar advantages of using a physics-aware motion optimizer with fewer IMU sensors being used.

I-C Challenges and Proposed Solutions

Although there has been significant effort in improving the precision of human pose estimation in 3D environments with IMU sensors[12, 7, 6, 13, 14, 15], there is a lack of human pose estimation approach for VR/XR applications deployed over wireless networks, where the estimation ability can be strictly constrained by channel quality, power transmission, and tradeoff between latency and accuracy of the solution. Deploying the human pose estimation frameworks over the wireless networks is a non-trivial problem as the transmitted data is more exposed to channel noise, e.g., due to channel quality and channel interference. At this point, the existing works overlook the presence of noise in the received IMU data as the noise may significantly degrade the reconstruction accuracy of the human pose estimation task, resulting in a poor quality of experience for the user in a virtual environment. In addition, the current approaches do not consider the potential redundancy of the IMU data before transmitting it to the receiver. The potential redundancy of information from each IMU sensor, such as orientation and acceleration values at high frequencies (see Fig. 2), can further reduce the number of data samples that need to be transmitted. As a result, exploiting the data redundancy can enhance channel utilization and reduce power consumption for the wireless systems [16].

To the best of our knowledge, there is a lack of studies that address the above problems, i.e., the noisy IMU data transmission and the potential redundancy of the IMU data in reconstructing human pose over wireless systems. To address these problems, we propose a novel framework based on compressive sensing and generative modeling. On the one hand, the compressive sensing technique is utilized to down-sample the IMU signals before transmitting the signals over the wireless channel [17]. Based on rigorous down-sampling (or projection) techniques, the compressive sensing framework is promising to reduce the redundant IMU signals. As a result, this approach can not only reduce the energy consumption associated with IMU data acquisition and transmission but also enhance channel utilization as the transmitted data is in a more compressed form. On the other hand, a generative model (e.g., a variational auto-encoder [18]) deployed at the receiver can help the receiver make a robust estimation of the human body pose (e.g., through data denoising and data recovery capabilities), given the noisy compressed IMU measurements transmitted over a wireless channel. Unlike the optimization-based techniques [12, 6] and learning-based techniques [13, 14, 15], our proposed generative model can handle noisy data more effectively and also can exploit the potential sparsity patterns in the data. To summarize, the main contributions of our work are as follows:

  • We propose an innovative framework based on compressive sensing and generative modeling techniques for human pose estimation from IMU sensors toward VR/XR applications deployed over wireless systems. The proposed framework can accurately recover the original IMU signals from noisy compressed signals transmitted over a wireless channel. The combination of compressive sensing and generative modeling generates potential benefits to the system such as enhancing channel utilization, effective data sensing, and power-efficient wireless communications.

  • We develop a novel design for the measurement matrix at the transmitter, which helps to transform the high-dimensional signal into a compressed form. Our measurement matrix design extends the set-restricted eigenvalue condition of existing generative model-based compressive sensing approaches to a general setting, which considers the impacts of a wireless communication channel on data recovery. With rigorous analysis, we prove that our proposed measurement matrix enables our proposed framework to outperform other deep learning and optimization approaches, in terms of accuracy and latency of signal reconstruction process.

  • We show that the proposed framework can achieve signal reconstruction accuracy comparable to the optimization-based approach, i.e., Lasso, with an order of magnitude faster than Lasso. The fast reconstruction ability of the generative model makes it a promising solution for VR/XR applications with stringent latency requirements.

  • We demonstrate a practical use case of the generative model that can generate missing IMU signals, thus creating synthetic body movements for the users without using input IMU signals. This ability of the generative model is very useful for potential VR/XR applications over wireless systems as the missing input data usually happens due to the lossy nature of the wireless environment.

Refer to caption
Figure 1: An illustration of our proposed system model. A set of synchronized IMU sensors produces a sequence of data, e.g., orientation and acceleration, and compressive sensing down-samples the data sequence into a shorter sequence. The down-sampled sequence of IMU data is transmitted over a noisy channel. The receiver uses a deep generative model to recover the original data sequence from received signals.

The organization of the paper is as follows. Section II describes the overview of the system model and preliminaries of compressive sensing and generative modeling. In Section III, we formulate the problem as a reconstruction error minimization problem, subject to a power transmission constraint. In Section IV, we extensively evaluate the performance of the proposed framework with other baselines, such as an optimization-based approach and a deep learning-based approach. We also show that the proposed generative model can generate missing IMU data features, thus directly creating smooth synthetic body movements for the users. Finally, Section V concludes the paper.

II System Overview and Preliminaries

The proposed system model is illustrated in Fig. 1. At the transmitter side, the user’s body is equipped with a set of 17 IMU sensors placed on standard positions as in commercial systems [7]. The set of synchronized IMU sensors produces a sequence of data, e.g., orientation and acceleration, which is usually aggregated together at a central IMU node (e.g., IMU sensor placed on the user’s spine). Compressive sensing down-samples the data sequence into a shorter sequence through matrix multiplication. The down-sampled data sequence is transmitted over a wireless noisy channel, e.g., a Gaussian channel. On the receiver side, the edge server uses a deep generative model to recover the original data sequence from the noisy down-sampled data sequence. From the recovered IMU data and a kinematic human body model (e.g., SMPL [19]), the generative model can further generate the 3D avatar model with the corrected pose.

As described, the proposed framework consists of two main components that are (i) compressive sensing for the transmitter-receiver communication and (ii) a generative model for recovering the signals at the receiver, i.e., the edge server. In the following, we describe the fundamentals of compressive sensing, generative models, and generative model-based compressive sensing for an end-to-end learning system.

II-A Compressive Sensing

As illustrated in Fig. 1, we have a sequence of data is a real-valued, finite-length one-dimensional signal 𝐱nsuperscript𝐱superscript𝑛\mathbf{x}^{*}\in\mathbb{R}^{n}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. With compressing sensing, we want to down-sample the signal 𝐱superscript𝐱\mathbf{x^{*}}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT before transmitting it to the receiver. For that, we have a measurement matrix 𝐀m×n𝐀superscript𝑚𝑛\mathbf{A}\in\mathbb{R}^{m\times n}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT to make a linear projection from a higher dimensional vector 𝐱nsuperscript𝐱superscript𝑛\mathbf{x}^{*}\in\mathbb{R}^{n}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to a lower dimensional vector 𝐲m𝐲superscript𝑚\mathbf{y}\in\mathbb{R}^{m}bold_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT (m<n𝑚𝑛m<nitalic_m < italic_n). Usually, n𝑛nitalic_n is referred to as the length of the original vector and m𝑚mitalic_m is the number of measurements from that vector. In particular, the m𝑚mitalic_m-dimensional signal being transmitted over the channel is:

𝐲=𝐀𝐱.𝐲superscript𝐀𝐱\mathbf{y}=\mathbf{Ax^{*}}.bold_y = bold_Ax start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . (1)

The received signal at the receiver can be corrupted by noise. In the case of a Gaussian channel, the received signal at the receiver is [20, 17]:

𝐲^=𝐀𝐱+𝜼,^𝐲superscript𝐀𝐱𝜼\mathbf{\hat{y}}=\mathbf{Ax^{*}}+\boldsymbol{\eta},over^ start_ARG bold_y end_ARG = bold_Ax start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + bold_italic_η , (2)

where 𝜼m𝜼superscript𝑚\boldsymbol{\eta}\in\mathbb{R}^{m}bold_italic_η ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a Gaussian noise vector with zero mean and σNsubscript𝜎𝑁\sigma_{N}italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT standard deviation, i.e., element ηisubscript𝜂𝑖\eta_{i}italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i=1,2,,m𝑖12𝑚i=1,2,\ldots,mitalic_i = 1 , 2 , … , italic_m) of 𝜼𝜼\boldsymbol{\eta}bold_italic_η follows a Gaussian distribution ηi𝒩(0,σN2)similar-tosubscript𝜂𝑖𝒩0superscriptsubscript𝜎𝑁2\eta_{i}\sim\mathcal{N}(0,\sigma_{N}^{2})italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). As observed from equation (2), the signal 𝐲^m^𝐲superscript𝑚\mathbf{\hat{y}}\in\mathbb{R}^{m}over^ start_ARG bold_y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a compressed form of 𝐱nsuperscript𝐱superscript𝑛\mathbf{x}^{*}\in\mathbb{R}^{n}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. To recover the signal 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from the received signal 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG, the receiver needs to solve the following quadratically constrained optimization problem [20]:

𝒫0::subscript𝒫0absent\displaystyle\mathcal{P}_{0}:\quadcaligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : min𝐱𝐱1,subscript𝐱subscriptnorm𝐱1\displaystyle\min_{\mathbf{x}}\|\mathbf{x}\|_{1},roman_min start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ∥ bold_x ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (3a)
subject to 𝐀𝐱𝐲^2𝜼2,subscriptnorm𝐀𝐱^𝐲2subscriptnorm𝜼2\displaystyle\|\mathbf{Ax}-\mathbf{\hat{y}}\|_{2}\leq\|\boldsymbol{\eta}\|_{2},∥ bold_Ax - over^ start_ARG bold_y end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ bold_italic_η ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (3b)

where the term 𝐱psubscriptnorm𝐱𝑝\|\mathbf{x}\|_{p}∥ bold_x ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT denotes the lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm (p=0,1,2,𝑝012p=0,1,2,\ldotsitalic_p = 0 , 1 , 2 , …) of the vector 𝐱𝐱\mathbf{x}bold_x, i.e., [20]

𝐱p=(j=1n|xj|p)1/p.subscriptnorm𝐱𝑝superscriptsuperscriptsubscript𝑗1𝑛superscriptsubscript𝑥𝑗𝑝1𝑝\|\mathbf{x}\|_{p}=(\sum_{j=1}^{n}|x_{j}|^{p})^{1/p}.∥ bold_x ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT . (4)

The problem 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in (3) forms an underdetermined system, that is, a system in which there are multiple solutions to the system. To guarantee the unique recovery of the signal, compressive sensing relies on the two main assumptions about the signal 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and the measurement matrix 𝐀𝐀\mathbf{A}bold_A. First, the signal 𝐱superscript𝐱\mathbf{x^{*}}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT has sparsity property. Second, the measurement matrix 𝐀𝐀\mathbf{A}bold_A satisfies specific conditions that are either Restricted Isometry Property (RIP) or Restricted Eigenvalue Condition (REC). The definitions of sparsity, RIP, and REC are as follows [20, 17].

Definition 1 (Sparsity).

The support of a vector 𝐱n𝐱superscript𝑛\mathbf{x}\in\mathbb{R}^{n}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the index set of its nonzero entries, i.e.,

sup(𝐱):={j{1,2,,n}:xj0}.assignsupremum𝐱conditional-set𝑗12𝑛subscript𝑥𝑗0\sup(\mathbf{x}):=\big{\{}j\in\{1,2,\ldots,n\}:x_{j}\neq 0\big{\}}.roman_sup ( bold_x ) := { italic_j ∈ { 1 , 2 , … , italic_n } : italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ 0 } .

The vector 𝐱n𝐱superscript𝑛\mathbf{x}\in\mathbb{R}^{n}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is called k𝑘kitalic_k-sparse if at most k𝑘kitalic_k of its entries are nonzero, i.e., if

𝐱0=𝐜𝐚𝐫𝐝(sup(𝐱))k,subscriptnorm𝐱0𝐜𝐚𝐫𝐝supremum𝐱𝑘\|\mathbf{x}\|_{0}=\mathbf{card}\big{(}\sup(\mathbf{x})\big{)}\leq k,∥ bold_x ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_card ( roman_sup ( bold_x ) ) ≤ italic_k ,

where 𝐜𝐚𝐫𝐝()𝐜𝐚𝐫𝐝\mathbf{card}(\cdot)bold_card ( ⋅ ) is the cardinality (number of elements) and 0\|\cdot\|_{0}∥ ⋅ ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is l0subscript𝑙0l_{0}italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm.

In practice, the sparsity property of the interested signal 𝐱𝐱\mathbf{x}bold_x is usually relaxed to nearly k𝑘kitalic_k-sparse, meaning that there are nk𝑛𝑘n-kitalic_n - italic_k entries of the vector 𝐱𝐱\mathbf{x}bold_x are approximately zero. In Fig. 2, we illustrate the nearly k𝑘kitalic_k-sparse acceleration data from the IMU dataset in [13]. Note that we use the Fast Fourier Transform (FFT) in Fig. 2 for illustration purposes only, and our proposed learning algorithm will not utilize the FFT.

Refer to caption
Figure 2: Illustration of acceleration reading from an IMU sensor placed on the left wrist of the user (top figure) and the Fast Fourier Transform (FFT) of the x-axis acceleration data (bottom figure). The FFT reveals nearly k𝑘kitalic_k-sparse property of the IMU signal in which a few low-frequency coefficients have dominant values. As a result, the redundancy of the data can be approximated by considering the k𝑘kitalic_k largest coefficients and assuming the rest coefficients are zero.
Definition 2 (Restricted Isometry Property).

Let Sknsubscript𝑆𝑘superscript𝑛S_{k}\subset\mathbb{R}^{n}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be the set of k𝑘kitalic_k-sparse vectors. For some parameter δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), a matrix 𝐀m×n𝐀superscript𝑚𝑛\mathbf{A}\in\mathbb{R}^{m\times n}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT is said to satisfy RIP(k,δ)RIP𝑘𝛿\text{RIP}(k,\delta)RIP ( italic_k , italic_δ ) if 𝐱Skfor-all𝐱subscript𝑆𝑘\forall\mathbf{x}\in S_{k}∀ bold_x ∈ italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT,

(1δ)𝐱2𝐀𝐱2(1+δ)𝐱2.1𝛿subscriptnorm𝐱2subscriptnorm𝐀𝐱21𝛿subscriptnorm𝐱2(1-\delta)\|\mathbf{x}\|_{2}\leq\|\mathbf{Ax}\|_{2}\leq(1+\delta)\|\mathbf{x}% \|_{2}.( 1 - italic_δ ) ∥ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ bold_Ax ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ( 1 + italic_δ ) ∥ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .
Definition 3 (Restricted Eigenvalue Condition).

Let Sknsubscript𝑆𝑘superscript𝑛S_{k}\subset\mathbb{R}^{n}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be the set of k𝑘kitalic_k-sparse vectors. For some parameter γ>0𝛾0\gamma>0italic_γ > 0, a matrix 𝐀m×n𝐀superscript𝑚𝑛\mathbf{A}\in\mathbb{R}^{m\times n}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT is said to satisfy REC(k,γ)REC𝑘𝛾\text{REC}(k,\gamma)REC ( italic_k , italic_γ ) if 𝐱Skfor-all𝐱subscript𝑆𝑘\forall\mathbf{x}\in S_{k}∀ bold_x ∈ italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT,

𝐀𝐱2γ𝐱2.subscriptnorm𝐀𝐱2𝛾subscriptnorm𝐱2\|\mathbf{Ax}\|_{2}\geq\gamma\|\mathbf{x}\|_{2}.∥ bold_Ax ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_γ ∥ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Intuitively, RIP implies that 𝐀𝐀\mathbf{A}bold_A approximately preserves Euclidean norms (l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norms) for sparse vectors, and REC implies that sparse vectors are far from the nullspace of 𝐀𝐀\mathbf{A}bold_A [21]. Given the sparsity of 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and RIP/REC property of the chosen matrix 𝐀𝐀\mathbf{A}bold_A, it has been shown that the recovered signal 𝐱^^𝐱\mathbf{\hat{x}}over^ start_ARG bold_x end_ARG is the unique solution of the problem 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in (3), i.e., 𝐱^𝐱^𝐱superscript𝐱\mathbf{\hat{x}}\approx\mathbf{x}^{*}over^ start_ARG bold_x end_ARG ≈ bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [20]. As the sparsity of the signal depends on the natural domain of the signal, the solution for 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT depends on two aspects that are (i) the choice of measurement matrix 𝐀𝐀\mathbf{A}bold_A and (ii) the choice of recovery method, i.e., optimization solver for 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In conventional compressive sensing methods, the common choices for such aspects are (i) a Gaussian matrix and (ii) a convex optimization solver like Lasso (Lease absolute shrinkage and selection operator) [22, 17]. Note that in this work, we do not explicitly analyze the sparsity of the IMU signal but rely on approximation methods, such as Lasso and generative models, to solve the optimization problem. In this way, we do not need to pay the extra cost of signal processing through transformations, such as Fourier transform or Wavelet transform, at the transmitter [17, 20, 23, 21].

By using the definition of the lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm in (4), 𝐱1subscriptnorm𝐱1\|\mathbf{x}\|_{1}∥ bold_x ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a convex function, and 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT minimization problem with quadratic constraint. The solution of 𝒫0subscript𝒫0\mathcal{P}_{0}caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is equivalent to the output of the Lasso, which consists of solving 𝒫1subscript𝒫1\mathcal{P}_{1}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, for some parameter τ0𝜏0\tau\geq 0italic_τ ≥ 0 [20]:

𝒫1::subscript𝒫1absent\displaystyle\mathcal{P}_{1}:\quadcaligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : min𝐱𝐀𝐱𝐲^2,subscript𝐱subscriptnorm𝐀𝐱^𝐲2\displaystyle\min_{\mathbf{x}}\|\mathbf{Ax}-\mathbf{\hat{y}}\|_{2},roman_min start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ∥ bold_Ax - over^ start_ARG bold_y end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (5a)
subject to 𝐱1τ.subscriptnorm𝐱1𝜏\displaystyle\|\mathbf{x}\|_{1}\leq\tau.∥ bold_x ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_τ . (5b)

In practice, the solution of Lasso is equivalent to solving the Lagrangian of the problem 𝒫1subscript𝒫1\mathcal{P}_{1}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT above, for some parameter λ0𝜆0\lambda\geq 0italic_λ ≥ 0, i.e., [20, Equation 3.1]:

min𝐱𝐀𝐱𝐲^22+λ𝐱1.subscript𝐱superscriptsubscriptnorm𝐀𝐱^𝐲22𝜆subscriptnorm𝐱1\min_{\mathbf{x}}\|\mathbf{Ax}-\mathbf{\hat{y}}\|_{2}^{2}+\lambda\|\mathbf{x}% \|_{1}.roman_min start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ∥ bold_Ax - over^ start_ARG bold_y end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ bold_x ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (6)

Intuitively, the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT penalty term λ𝐱1𝜆subscriptnorm𝐱1\lambda\|\mathbf{x}\|_{1}italic_λ ∥ bold_x ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in (6) enforces sparsity (Definition 1) by adding penalty proportional to the absolute values of the coefficients of 𝐱𝐱\mathbf{x}bold_x. As a result, the sparsity assumption (k𝑘kitalic_k-sparse or nearly k𝑘kitalic_k-sparse) in the structure of the signal has an impact on the performance of the Lasso solver. Recall that in this work, we do not explicitly analyze the sparsity of the IMU signal, which usually causes extra costs through the Fourier/Wavelet transform at the transmitter. In addition, the solutions relying on the sparsity assumption are known to yield poor recovery performance when the linear measurement is not sufficient (i.e., the value of m𝑚mitalic_m is too small), or the considered signal has a small number of dimensions (i.e., the value of n𝑛nitalic_n is not sufficiently large) [23, 21]. This motivates us to utilize generative models, such as variational auto-encoders (VAEs) [18] and generative adversarial networks (GANs) [24], as an alternative for the use of sparsity assumption with a convex optimization solver like Lasso.

II-B Generative Models

Generative models are a type of machine learning that can be used for modeling the complex distribution of large-scale datasets. In the context of compressive sensing, a generative model can be used to estimate the distribution of the input signals. After the generative model is trained with a training set, it can generate a new data sample that is similar to the samples drawn from the original set [25]. Intuitively, the generative model can learn and synthesize the underlying distribution of the high dimensional and complex data, which eliminates the sparsity assumption about the data structure of conventional compressive sensing techniques.

A generative model describes a probability density function p𝑝pitalic_p: 𝒳𝒳\mathcal{X}\rightarrow\mathbb{R}caligraphic_X → blackboard_R (𝒳𝒳\mathcal{X}caligraphic_X is a finite set) through an unobserved, or “latent”, variable 𝐳𝐳\mathbf{z}bold_z. The probability density function is then calculated by:

p(𝐱)=𝐳p(𝐱|𝐳)p(𝐳)𝑑𝐳,𝑝𝐱subscript𝐳𝑝conditional𝐱𝐳𝑝𝐳differential-d𝐳p(\mathbf{x})=\int_{\mathbf{z}}p(\mathbf{x|z})p(\mathbf{z})d\mathbf{z},italic_p ( bold_x ) = ∫ start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT italic_p ( bold_x | bold_z ) italic_p ( bold_z ) italic_d bold_z , (7)

where 𝐱𝒳for-all𝐱𝒳\forall\mathbf{x}\in\mathcal{X}∀ bold_x ∈ caligraphic_X, the probability p(𝐳)𝑝𝐳p(\mathbf{z})italic_p ( bold_z ) is the prior, and the forward probability p(𝐱|𝐳)𝑝conditional𝐱𝐳p(\mathbf{x|z})italic_p ( bold_x | bold_z ) is the likelihood [25]. In practice, this probability density function is usually parameterized by a model 𝜽𝜽\boldsymbol{\theta}bold_italic_θ (e.g., a deep neural network). In such a case, equation (7) can be rewritten as follows [25]:

p𝜽(𝐱)=𝐳p𝜽(𝐱|𝐳)p(𝐳)𝑑𝐳.subscript𝑝𝜽𝐱subscript𝐳subscript𝑝𝜽conditional𝐱𝐳𝑝𝐳differential-d𝐳p_{\boldsymbol{\theta}}(\mathbf{x})=\int_{\mathbf{z}}p_{\boldsymbol{\theta}}(% \mathbf{x|z})p(\mathbf{z})d\mathbf{z}.italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) = ∫ start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) italic_p ( bold_z ) italic_d bold_z . (8)

The integral in (8) cannot easily be computed as the likelihood p(𝐱|𝐳)𝑝conditional𝐱𝐳p(\mathbf{x|z})italic_p ( bold_x | bold_z ) is computationally expensive with conventional methods such as maximum likelihood, especially for large-scale datasets. In this work, we develop our generative model based on a popular class of generative models, which are called variational auto-encoders (VAEs), first introduced in [18]. As opposed to other generative models such as GANs [24], VAEs can generate more dispersed samples over the data and can learn complex data distributions [25]. In addition, VAEs are better for data inference, which is suitable for our generative model that wants to exploit the hidden “sparsity” patterns in the IMU data.

In VAEs, besides the likelihood parameterized by a decoder (deep neural network), the probability density function p𝜽(𝐱)subscript𝑝𝜽𝐱p_{\boldsymbol{\theta}}(\mathbf{x})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) is conditioned through an encoder parameterized by another deep neural network qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z|x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ). The encoder approximates the true but intractable posterior p𝜽(𝐳|𝐱)subscript𝑝𝜽conditional𝐳𝐱p_{\boldsymbol{\theta}}(\mathbf{z|x})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_z | bold_x ). To train a VAE, we optimize a variational lower bound on logp𝜽(𝐱)subscript𝑝𝜽𝐱\log p_{\boldsymbol{\theta}}(\mathbf{x})roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ), called evidence lower-bound (ELBO). It is defined as follows [25]:

𝜽,ϕ(𝐱)=𝐳qϕ(𝐳|𝐱)logp𝜽(𝐱|𝐳)p(𝐳)qϕ(𝐳|𝐱)d𝐳.subscript𝜽bold-italic-ϕ𝐱subscript𝐳subscript𝑞bold-italic-ϕconditional𝐳𝐱subscript𝑝𝜽conditional𝐱𝐳𝑝𝐳subscript𝑞bold-italic-ϕconditional𝐳𝐱𝑑𝐳\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})=\int_{\mathbf{% z}}q_{\boldsymbol{\phi}}(\mathbf{z|x})\log\frac{p_{\boldsymbol{\theta}}(% \mathbf{x|z})p(\mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z|x)}}d\mathbf{z}.caligraphic_L start_POSTSUBSCRIPT bold_italic_θ , bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x ) = ∫ start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) roman_log divide start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) italic_p ( bold_z ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_ARG italic_d bold_z . (9)

The newly introduced density function qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z|x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) is referred to as the variational (approximate) posterior with ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ defined as the variational parameters.

As the encoder qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z|x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) is used to approximate the posterior p𝜽(𝐳|𝐱)subscript𝑝𝜽conditional𝐳𝐱p_{\boldsymbol{\theta}}(\mathbf{z|x})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_z | bold_x ), exact sampling from the posterior is straightforward through an unbiased Monte Carlo estimate of \mathcal{L}caligraphic_L [25]:

^𝜽,ϕ(𝐱)=logp𝜽(𝐱|𝐳)p(𝐳)qϕ(𝐳|𝐱), where 𝐳qϕ(|𝐱).\mathcal{\hat{L}}_{\boldsymbol{\theta,\phi}}(\mathbf{x})=\log\frac{p_{% \boldsymbol{\theta}}(\mathbf{x|z})p(\mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf% {z|x)}},\text{ where }\mathbf{z}\leftarrow q_{\boldsymbol{\phi}}(\cdot|\mathbf% {x}).over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT bold_italic_θ bold_, bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x ) = roman_log divide start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) italic_p ( bold_z ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_ARG , where bold_z ← italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( ⋅ | bold_x ) . (10)

The notation 𝐳qϕ(|𝐱)\mathbf{z}\leftarrow q_{\boldsymbol{\phi}}(\cdot|\mathbf{x})bold_z ← italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( ⋅ | bold_x ) means that 𝐳𝐳\mathbf{z}bold_z is sampled from the approximate posterior distribution qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z|x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ). If the process used to generate 𝐳𝐳\mathbf{z}bold_z from qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z|x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) is differentiable with respect to ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ, the function ^^\mathcal{\hat{L}}over^ start_ARG caligraphic_L end_ARG can be differentiated with respect to 𝜽𝜽\boldsymbol{\theta}bold_italic_θ and ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ by using a stochastic gradient decent estimator. Once \mathcal{L}caligraphic_L in (9) is optimized, we can approximate the true probability density function p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) through the learned neural network with parameters 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, i.e., p𝜽(𝐱)p(𝐱)subscript𝑝𝜽𝐱𝑝𝐱p_{\boldsymbol{\theta}}(\mathbf{x})\approx p(\mathbf{x})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) ≈ italic_p ( bold_x ). In other words, we can generate the new data samples from the learned probability density function.

Refer to caption
Figure 3: The proposed CS-VAE learning algorithm with a novel measurement matrix at the transmitter and the generative model, i.e., a VAE, at the receiver. The transmitted signal at the transmitter is the m𝑚mitalic_m-dimensional vector 𝐲𝐲\mathbf{y}bold_y, which is a compressed version of the original n𝑛nitalic_n-dimensional vector 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. At the receiver, the VAE recovers the original signal, i.e., 𝐱^𝐱^𝐱superscript𝐱\mathbf{\hat{x}}\approx\mathbf{x}^{*}over^ start_ARG bold_x end_ARG ≈ bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, from a noisy and compressed measurement 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG.

II-C Generative Model-based Compressive Sensing

In the context of compressive sensing, the data sample 𝐱𝐱\mathbf{x}bold_x from the training set is, however, not fully observable, i.e., our model can only observe the noisy compressed or down-sampled version 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG. Replacing 𝐱𝐱\mathbf{x}bold_x with 𝐲=𝐀𝐱𝐲𝐀𝐱\mathbf{y}=\mathbf{Ax}bold_y = bold_Ax, the unbiased Monte Carlo estimation of the ELBO in (10) is rewritten as:

^𝜽,ϕ(𝐲)=logp𝜽(𝐀𝐱|𝐳)p(𝐳)qϕ(𝐳|𝐀𝐱), where 𝐳qϕ(|𝐲=𝐲^).\mathcal{\hat{L}}_{\boldsymbol{\theta,\phi}}(\mathbf{y})=\log\frac{p_{% \boldsymbol{\theta}}(\mathbf{Ax|z})p(\mathbf{z})}{q_{\boldsymbol{\phi}}(% \mathbf{z|Ax})},\text{ where }\mathbf{z}\leftarrow q_{\boldsymbol{\phi}}(\cdot% |\mathbf{y}=\mathbf{\hat{y}}).over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT bold_italic_θ bold_, bold_italic_ϕ end_POSTSUBSCRIPT ( bold_y ) = roman_log divide start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_Ax | bold_z ) italic_p ( bold_z ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_Ax ) end_ARG , where bold_z ← italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( ⋅ | bold_y = over^ start_ARG bold_y end_ARG ) . (11)

As observed from the above equation, the generative model cannot directly generate the data sample 𝐱𝐱\mathbf{x}bold_x from the compressed observation 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG without prior knowledge of the measurement matrix 𝐀𝐀\mathbf{A}bold_A. In other words, the measurement matrix 𝐀𝐀\mathbf{A}bold_A is assumed to be known at the generative model [23, 21]. Recall that the generative model is deployed at the receiver, therefore, the assumption about sharing prior information, e.g., a codebook, between the transmitter and receiver, is commonly used in the source and channel coding methods [26, 27]. Given the setting above, the solution of (5) is equivalent to the output of the generator G(𝐳)𝐺𝐳G(\mathbf{z})italic_G ( bold_z ) of the problem 𝒫2subscript𝒫2\mathcal{P}_{2}caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, for some υ0𝜐0\upsilon\geq 0italic_υ ≥ 0, as follows [21]:

𝒫2::subscript𝒫2absent\displaystyle\mathcal{P}_{2}:\quadcaligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : min𝐳𝐀G(𝐳)𝐲^2,subscript𝐳subscriptnorm𝐀𝐺𝐳^𝐲2\displaystyle\min_{\mathbf{z}}\|\mathbf{A}G(\mathbf{z})-\mathbf{\hat{y}}\|_{2},roman_min start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT ∥ bold_A italic_G ( bold_z ) - over^ start_ARG bold_y end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (12a)
subject to G(𝐳)1υ.subscriptnorm𝐺𝐳1𝜐\displaystyle\|G(\mathbf{z})\|_{1}\leq\upsilon.∥ italic_G ( bold_z ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_υ . (12b)

The generator G(𝐳)𝐺𝐳G(\mathbf{z})italic_G ( bold_z ) is defined as a function G:kn:𝐺superscript𝑘superscript𝑛G:\mathbb{R}^{k}\rightarrow\mathbb{R}^{n}italic_G : blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT map** a latent vector 𝐳𝐳\mathbf{z}bold_z to the mean of the conditional distribution p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳p_{\boldsymbol{\theta}}(\mathbf{x|z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ). Given the observation 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG as the input of the model, the latent vector 𝐳𝐳\mathbf{z}bold_z is obtained by sampling from the posterior distribution qϕ(|𝐲)q_{\boldsymbol{\phi}}(\cdot|\mathbf{y})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( ⋅ | bold_y ) in (11). After that, the generator G(𝐳)𝐺𝐳G(\mathbf{z})italic_G ( bold_z ) produces the output vector 𝐱^^𝐱\mathbf{\hat{x}}over^ start_ARG bold_x end_ARG from this latent vector 𝐳𝐳\mathbf{z}bold_z, i.e., G(𝐳)=𝐱^𝐺𝐳^𝐱G(\mathbf{z})=\mathbf{\hat{x}}italic_G ( bold_z ) = over^ start_ARG bold_x end_ARG. As a result, the minimizer 𝐳superscript𝐳\mathbf{z}^{*}bold_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of the optimization problem of 𝒫2subscript𝒫2\mathcal{P}_{2}caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in (12) makes G(𝐳)𝐱𝐺superscript𝐳superscript𝐱G(\mathbf{z}^{*})\approx\mathbf{x}^{*}italic_G ( bold_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≈ bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [23].

In comparison with (5), the variable vector 𝐱𝐱\mathbf{x}bold_x is now replaced by the generative function G(𝐳)𝐺𝐳G(\mathbf{z})italic_G ( bold_z ) in (12). As a result, when one can optimize the objective 𝒫2subscript𝒫2\mathcal{P}_{2}caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in (12), the generator G(𝐳)𝐺𝐳G(\mathbf{z})italic_G ( bold_z ) can generate the new samples which are similar to the original vector 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Recall that under the compressive sensing setting, our generative model does not observe the full observation of the signals as in the conventional setting of generative modeling, i.e., learning directly p𝜽(𝐱)p(𝐱)subscript𝑝𝜽𝐱𝑝𝐱p_{\boldsymbol{\theta}}(\mathbf{x})\approx p(\mathbf{x})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) ≈ italic_p ( bold_x ). In the compressive sensing setting, the generative model can only observe the noisy and compressed signal 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG. Therefore, the optimization objective defined in (12) is to indirectly optimize the generative model via the observation 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG, given the measurement matrix 𝐀𝐀\mathbf{A}bold_A. Thereafter, the space of signals that can be recovered with the generative model is given by the range of the generator function, i.e.,

SG={G(𝐳):𝐳k}.subscript𝑆𝐺conditional-set𝐺𝐳𝐳superscript𝑘S_{G}=\{G(\mathbf{z}):\mathbf{z}\in\mathbb{R}^{k}\}.italic_S start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = { italic_G ( bold_z ) : bold_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } . (13)

As the range of the signals is now transformed into the latent space 𝐳k𝐳superscript𝑘\mathbf{z}\in\mathbb{R}^{k}bold_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, the RIP and REC properties of the measurement matrix 𝐀𝐀\mathbf{A}bold_A no longer guarantee the accuracy of the recovered signals. With the generative model-based compressive sensing, the measurement matrix 𝐀𝐀\mathbf{A}bold_A is required to satisfy a Set-Restricted Eigenvalue Condition (S-REC), which is a generalized version of REC [23], i.e.,

Definition 4 (Set-Restricted Eigenvalue Condition).

Let Sn𝑆superscript𝑛S\subseteq\mathbb{R}^{n}italic_S ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, for some parameters γ>0𝛾0\gamma>0italic_γ > 0 and κ0𝜅0\kappa\geq 0italic_κ ≥ 0, a matrix 𝐀m×n𝐀superscript𝑚𝑛\mathbf{A}\in\mathbb{R}^{m\times n}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT is said to satisfy the S-REC(S,γ,κ)S-REC𝑆𝛾𝜅\text{S-REC}(S,\gamma,\kappa)S-REC ( italic_S , italic_γ , italic_κ ) if 𝐱1,𝐱2Sfor-allsubscript𝐱1subscript𝐱2𝑆\forall\mathbf{x}_{1},\mathbf{x}_{2}\in S∀ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_S,

𝐀(𝐱𝟏𝐱𝟐)2γ𝐱𝟏𝐱𝟐2κ.subscriptnorm𝐀subscript𝐱1subscript𝐱22𝛾subscriptnormsubscript𝐱1subscript𝐱22𝜅\|\mathbf{A(x_{1}-x_{2})}\|_{2}\geq\gamma\|\mathbf{x_{1}-x_{2}}\|_{2}-\kappa.∥ bold_A ( bold_x start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_γ ∥ bold_x start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_κ .

Intuitively, the S-REC property generalizes the REC property to an arbitrary set of vectors S𝑆Sitalic_S instead of considering the set of approximately sparse vectors Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [21]. This generalization makes S-REC a nice property for solving a compressive sensing problem with a stochastic gradient estimator via deep neural networks.

III Problem Formulation and Proposed Learning Algorithm

In this section, we utilize the generative model-based compressive sensing framework for our system model in Fig. 1. The presence of the communication channel between the transmitter and the receiver makes the reconstruction of IMU signals with generative model-based compressive sensing much more challenging. In particular, using the measurement matrices in [23] and [21] cannot guarantee the power constraint of the transmitter. Conventional normalization techniques like l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT normalization [28] are not applicable as they yield nonlinear projection from 𝐱𝐱\mathbf{x}bold_x to 𝐲𝐲\mathbf{y}bold_y, thus making the receiver cannot recover the original signal. For this, we propose a new measurement matrix that (i) ensures the power constraint for the transmitter and (ii) satisfies the S-REC property of generative model-based compressive sensing. The learning algorithm with the newly designed measurement matrix is described in the following.

The proposed learning process, which we refer to as “CS-VAE” (Compressive Sensing-based Variational Auto-Encoder), is illustrated in Fig. 3. At the transmitter, we have the vector 𝐱nsuperscript𝐱superscript𝑛\mathbf{x}^{*}\in\mathbb{R}^{n}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and a measurement matrix 𝐀m×n𝐀superscript𝑚𝑛\mathbf{A}\in\mathbb{R}^{m\times n}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT. The output of the measurement matrix is the signal 𝐲m𝐲superscript𝑚\mathbf{y}\in\mathbb{R}^{m}bold_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT in which m<n𝑚𝑛m<nitalic_m < italic_n. The signal 𝐲=𝐀𝐱𝐲superscript𝐀𝐱\mathbf{y=Ax^{*}}bold_y = bold_Ax start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is subjected to the power constraint at the transmitter, i.e., 1m𝐲22PT1𝑚superscriptsubscriptnorm𝐲22subscript𝑃𝑇\frac{1}{m}\|\mathbf{y}\|_{2}^{2}\leq P_{T}divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∥ bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, where PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is the transmission power constraint on a single channel use [27, 29]. Details of the power constraint for our framework are further discussed in Appendix A. In particular, the optimization problem of the proposed learning model is similar to (12) with an additional power constraint as follows:

𝒫3::subscript𝒫3absent\displaystyle\mathcal{P}_{3}:\quadcaligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT : min𝐳𝐀G(𝐳)𝐲^2,subscript𝐳subscriptnorm𝐀𝐺𝐳^𝐲2\displaystyle\min_{\mathbf{z}}\|\mathbf{A}G(\mathbf{z})-\mathbf{\hat{y}}\|_{2},roman_min start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT ∥ bold_A italic_G ( bold_z ) - over^ start_ARG bold_y end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (14a)
subject to 1m𝐲22PT,1𝑚superscriptsubscriptnorm𝐲22subscript𝑃𝑇\displaystyle\frac{1}{m}\|\mathbf{y}\|_{2}^{2}\leq P_{T},divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∥ bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , (14b)
G(𝐳)1υ.subscriptnorm𝐺𝐳1𝜐\displaystyle\|G(\mathbf{z})\|_{1}\leq\upsilon.∥ italic_G ( bold_z ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_υ . (14c)

The power constraint in (14b) poses additional challenges in designing the measurement matrix 𝐀𝐀\mathbf{A}bold_A to ensure that the recovered signal is unique and similar to the original signal. Specifically, this is a very challenging quadratically constrained problem [20], and designing a measurement matrix that satisfies the duo-constraint, i.e., S-REC property and power constraint, has not been investigated in the literature. Existing generative model-based compressive sensing approaches [23, 21, 30, 31] cannot be directly applied to this problem. To address this duo-constraint optimization problem 𝒫3subscript𝒫3\mathcal{P}_{3}caligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT in (14), we first design a new measurement matrix 𝐀𝐀\mathbf{A}bold_A in Proposition 1 that makes 𝐲=𝐀𝐱𝐲𝐀𝐱\mathbf{y=Ax}bold_y = bold_Ax satisfy the power constraint 1m𝐲22PT1𝑚superscriptsubscriptnorm𝐲22subscript𝑃𝑇\frac{1}{m}\|\mathbf{y}\|_{2}^{2}\leq P_{T}divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∥ bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. After the power constraint is eliminated, we use the Lagrangian of 𝒫3subscript𝒫3\mathcal{P}_{3}caligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT as a loss function to train the generative model in a similar manner as in [23].

The proposed measurement matrix for the problem 𝒫3subscript𝒫3\mathcal{P}_{3}caligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is stated as follows.

Proposition 1 (S-REC with power constraint).

The recovered signal obtained by the generative model-based compressive sensing method under the power constraint is guaranteed to be a unique solution if

  • 𝐀𝐀\mathbf{A}bold_A satisfies S-REC property, and

  • Each element Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (element j𝑗jitalic_j-th of the i𝑖iitalic_i-th row) of 𝐀𝐀\mathbf{A}bold_A is drawn i.i.d from a Gaussian distribution with zero mean and variance σa2=PTn2d2(dσx+μx)2superscriptsubscript𝜎𝑎2subscript𝑃𝑇superscript𝑛2superscript𝑑2superscript𝑑subscript𝜎𝑥subscript𝜇𝑥2\sigma_{a}^{2}=\frac{P_{T}}{n^{2}d^{2}(d\sigma_{x}+\mu_{x})^{2}}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, i.e.,

Aij𝒩(0,PTn2d2(dσx+μx)2),similar-tosubscript𝐴𝑖𝑗𝒩0subscript𝑃𝑇superscript𝑛2superscript𝑑2superscript𝑑subscript𝜎𝑥subscript𝜇𝑥2A_{ij}\sim\mathcal{N}\Big{(}0,\frac{P_{T}}{n^{2}d^{2}(d\sigma_{x}+\mu_{x})^{2}% }\Big{)},italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , divide start_ARG italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ,

where σx2superscriptsubscript𝜎𝑥2\sigma_{x}^{2}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and μxsubscript𝜇𝑥\mu_{x}italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT are the statistical variance and mean of the source signals 𝐱n𝐱superscript𝑛\mathbf{x}\in\mathbb{R}^{n}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, respectively, and d>0𝑑0d>0italic_d > 0 is a real number derived from the Chebyshev’s inequality.

The proof of Proposition 1 can be found in Appendix A.

Remark.

The normal distribution used to generate the random matrix 𝐀𝐀\mathbf{A}bold_A in Proposition 1 contains the mean and variance values of the source signals 𝐱𝐱\mathbf{x}bold_x. This assumption of knowing statistical variance and mean values of the source signals is common in source-channel coding schemes [26]. For example, in the case of i.i.d Gaussian source with power constraint P𝑃Pitalic_P, these values are σx2Psuperscriptsubscript𝜎𝑥2𝑃\sigma_{x}^{2}\approx Pitalic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≈ italic_P and μx=0subscript𝜇𝑥0\mu_{x}=0italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 0 [26, Section 9.1]. In the view of compressive sensing as a source-channel coding scheme, 𝐀𝐀\mathbf{A}bold_A can be considered as an encoding function [27]. In addition, as training deep learning models usually requires access to the training set for pre-processing and learning, the assumption of knowing the mean and variance values of the signals is more reasonable and practical than using the i.i.d Gaussian source. Another parameter in Proposition 1 is d>0𝑑0d>0italic_d > 0, a real number that restricts the random variable xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (element j𝑗jitalic_j-th of the source signal 𝐱𝐱\mathbf{x}bold_x) to the interval [μxdσx,μx+dσx]subscript𝜇𝑥𝑑subscript𝜎𝑥subscript𝜇𝑥𝑑subscript𝜎𝑥[\mu_{x}-d\sigma_{x},\mu_{x}+d\sigma_{x}][ italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ]. Given the measurement matrix 𝐀𝐀\mathbf{A}bold_A in Proposition 1, the recovered signal of the optimization problem 𝒫3subscript𝒫3\mathcal{P}_{3}caligraphic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is guaranteed to be the unique solution if (i) 𝐀𝐀\mathbf{A}bold_A satisfies the power constraint with probability greater than 11d211superscript𝑑21-\frac{1}{d^{2}}1 - divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG and (ii) 𝐀𝐀\mathbf{A}bold_A satisfies the S-REC property with probability greater than 12exp((2γmnmκ)24c𝐀22)12superscript2𝛾𝑚𝑛𝑚𝜅24𝑐superscriptsubscriptnorm𝐀221-2\exp\Big{(}\frac{(-2\gamma m\sqrt{n}-m\kappa)^{2}}{4c\|\mathbf{A}\|_{2}^{2}% }\Big{)}1 - 2 roman_exp ( divide start_ARG ( - 2 italic_γ italic_m square-root start_ARG italic_n end_ARG - italic_m italic_κ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_c ∥ bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ). Details of the parameters are discussed in Appendix A.

As the measurement matrix 𝐀𝐀\mathbf{A}bold_A is designed based on the Proposition 1, the power constraint in (14) is guaranteed and thus can be reduced. As a result, the optimization problem in (14) is equivalent to solving the following problem:

𝒫4:min𝐳𝐀G(𝐳)𝐲^22+λG(𝐳)1,:subscript𝒫4subscript𝐳superscriptsubscriptnorm𝐀𝐺𝐳^𝐲22𝜆subscriptnorm𝐺𝐳1\mathcal{P}_{4}:\min_{\mathbf{z}}\|\mathbf{A}G(\mathbf{z})-\mathbf{\hat{y}}\|_% {2}^{2}+\lambda\|G(\mathbf{z})\|_{1},caligraphic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT : roman_min start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT ∥ bold_A italic_G ( bold_z ) - over^ start_ARG bold_y end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_G ( bold_z ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (15)

where λ𝜆\lambdaitalic_λ is the Lagrange multiplier. As 𝐳𝐳\mathbf{z}bold_z is differentiable with respect to the generative model’s parameters (e.g., using reparameterization trick [25]), one can use the loss function based on (15) to train the generative model [21]. Once the generative model is trained to obtain the solution for (14), denoted by 𝐳superscript𝐳\mathbf{z}^{*}bold_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the reconstruction error can be bounded with probability 1eΩ(m)1superscript𝑒Ω𝑚1-e^{-\Omega(m)}1 - italic_e start_POSTSUPERSCRIPT - roman_Ω ( italic_m ) end_POSTSUPERSCRIPT by [23]:

G(𝐳)𝐱26min𝐳kG(𝐳)𝐱2+3𝜼2+2ϵ,subscriptnorm𝐺superscript𝐳superscript𝐱26subscript𝐳superscript𝑘subscriptnorm𝐺𝐳superscript𝐱23subscriptnorm𝜼22italic-ϵ\|G(\mathbf{z}^{*})-\mathbf{x}^{*}\|_{2}\leq 6\min_{\mathbf{z}\in\mathbb{R}^{k% }}\|G(\mathbf{z})-\mathbf{x}^{*}\|_{2}+3\|\boldsymbol{\eta}\|_{2}+2\epsilon,∥ italic_G ( bold_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 6 roman_min start_POSTSUBSCRIPT bold_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_G ( bold_z ) - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 3 ∥ bold_italic_η ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 italic_ϵ , (16)

where ϵitalic-ϵ\epsilonitalic_ϵ is an additive error term caused by the use of gradient decent-based optimizers.

Pseudo codes of our training algorithm are in Algorithm 1, which can be described as follows. We first initialize the measurement matrix 𝐀𝐀\mathbf{A}bold_A following the Proposition 1, together with random parameters for the inference network, i.e., encoder, qϕ(𝐳|𝐲)subscript𝑞bold-italic-ϕconditional𝐳𝐲q_{\boldsymbol{\phi}}(\mathbf{z|y})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_y ), and the generative model, i.e., decoder, p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳p_{\boldsymbol{\theta}}(\mathbf{x|z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) (i.e., lines 1-3 of the Algorithm 1). For each training loop, a batch of sample 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is i.i.d sampled from the training set (line 5). The input of the VAE’s encoder is 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG obtained by using (2) (line 6). The latent vector 𝐳𝐳\mathbf{z}bold_z is obtained in line 7, and the output of the VAE’s decoder is 𝐱^^𝐱\mathbf{\hat{x}}over^ start_ARG bold_x end_ARG in line 8. After that, the training loss L(𝐳)𝐿𝐳L(\mathbf{z})italic_L ( bold_z ) of the VAE can be computed as in line 9, which optimizes the problem 𝒫4subscript𝒫4\mathcal{P}_{4}caligraphic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT using the Adam optimization solver. After training, the pre-trained VAE can reconstruct signal 𝐱^=G(𝐳)^𝐱𝐺superscript𝐳\mathbf{\hat{x}}=G(\mathbf{z}^{*})over^ start_ARG bold_x end_ARG = italic_G ( bold_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) (𝐳superscript𝐳\mathbf{z}^{*}bold_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is fixed during testing with the test set) with reconstruction error bounded by (16) (line 14).

1 Input: Initialize measurement matrix 𝐀𝐀\mathbf{A}bold_A that satisfies Proposition 1.
2 Initialize encoder of the VAE with inference model qϕ(𝐳|𝐲)subscript𝑞bold-italic-ϕconditional𝐳𝐲q_{\boldsymbol{\phi}}(\mathbf{z|y})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_y ).
3 Initialize decoder of the VAE with generative model p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳p_{\boldsymbol{\theta}}(\mathbf{x|z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ).
4 for t = 0, 1, 2, \ldots do
5       Sample 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from the training set.
6       Obtain 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG from (2).
7       Obtain latent vector 𝐳qϕ(|𝐲)\mathbf{z}\leftarrow q_{\boldsymbol{\phi}}(\cdot|\mathbf{y})bold_z ← italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( ⋅ | bold_y ), with 𝐲=𝐲^𝐲^𝐲\mathbf{y}=\mathbf{\hat{y}}bold_y = over^ start_ARG bold_y end_ARG.
8       Obtain 𝐱^p𝜽(𝐱|𝐳)^𝐱subscript𝑝𝜽conditional𝐱𝐳\mathbf{\hat{x}}\leftarrow p_{\boldsymbol{\theta}}(\mathbf{x|z})over^ start_ARG bold_x end_ARG ← italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) at the output of the generator G(𝐳)𝐺𝐳G(\mathbf{z})italic_G ( bold_z ).
9       Compute the loss based on (15), i.e.,
L(𝐳)=𝐀G(𝐳)𝐲^22+λG(𝐳)1,𝐿𝐳superscriptsubscriptnorm𝐀𝐺𝐳^𝐲22𝜆subscriptnorm𝐺𝐳1L(\mathbf{z})=\|\mathbf{A}G(\mathbf{z})-\mathbf{\hat{y}}\|_{2}^{2}+\lambda\|G(% \mathbf{z})\|_{1},italic_L ( bold_z ) = ∥ bold_A italic_G ( bold_z ) - over^ start_ARG bold_y end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_G ( bold_z ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (17)
10       Update the neural network parameters by using backpropagation with the loss L(𝐳)𝐿𝐳L(\mathbf{z})italic_L ( bold_z ).
11 end for
12Output: 𝐳𝐳𝐳superscript𝐳\mathbf{z}\rightarrow\mathbf{z}^{*}bold_z → bold_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.
13 Reconstructed signal 𝐱^=G(𝐳)^𝐱𝐺superscript𝐳\mathbf{\hat{x}}=G(\mathbf{z}^{*})over^ start_ARG bold_x end_ARG = italic_G ( bold_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).
Reconstruction error: G(𝐳)𝐱2subscriptnorm𝐺superscript𝐳superscript𝐱2\|G(\mathbf{z}^{*})-\mathbf{x}^{*}\|_{2}∥ italic_G ( bold_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bounded by (16).
Algorithm 1 CS-VAE: Training VAE to reconstruct signals from noisy compressed measurements

IV Performance Evaluation

IV-A Dataset and Simulation Settings

IV-A1 Dataset and VAE’s Parameters

We use the IMU data from the DIP-IMU dataset [13], designed specifically for capturing 3D body human motion with calibrated IMU sensors. The dataset contains acceleration and orientation information of 17 IMU sensors placed on the participants. The entire dataset consists of 64 data sequences of 10 participants, equivalent to 330,178 frames of motion under various activities. The frames of motion are recorded at the rate of 60 frames per second. The activities can be divided into different categories that are upper body (e.g., arm raises, stretches, swings, and arm crossing), lower body (e.g., leg raises, squats, and lunges), locomotion (e.g., walking, sidesteps, and crossing legs), freestyle (e.g., jum** jacks, kicking, and boxing), and interaction (e.g., interacting with everyday objects such as keyboard, mobile phones, and grabbing objects). More details of action categories and their corresponding number of data frames can be found in Table 6 of reference [13].

The output of the j𝑗jitalic_j-th IMU is a combination of orientation information, denoted by 𝐨t(j)9subscriptsuperscript𝐨𝑗𝑡superscript9\mathbf{o}^{(j)}_{t}\in\mathbb{R}^{9}bold_o start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT, and acceleration information, denoted by 𝐚t(j)3subscriptsuperscript𝐚𝑗𝑡superscript3\mathbf{a}^{(j)}_{t}\in\mathbb{R}^{3}bold_a start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. One frame in the dataset at time step t𝑡titalic_t is denoted by 𝐱t=[𝐨t(1),𝐚t(1),𝐨t(2),𝐚t(2),,𝐨t(17),𝐚t(17)]subscript𝐱𝑡subscriptsuperscript𝐨1𝑡subscriptsuperscript𝐚1𝑡subscriptsuperscript𝐨2𝑡subscriptsuperscript𝐚2𝑡subscriptsuperscript𝐨17𝑡subscriptsuperscript𝐚17𝑡\mathbf{x}_{t}=\big{[}\mathbf{o}^{(1)}_{t},\mathbf{a}^{(1)}_{t},\mathbf{o}^{(2% )}_{t},\mathbf{a}^{(2)}_{t},\ldots,\mathbf{o}^{(17)}_{t},\mathbf{a}^{(17)}_{t}% \big{]}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ bold_o start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_o start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , bold_o start_POSTSUPERSCRIPT ( 17 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT ( 17 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]. As a result, one frame 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the dataset has 17×9+17×3=20417917320417\times 9+17\times 3=20417 × 9 + 17 × 3 = 204 features. We use the data sequences collected from 8 participants as the training set, and the data sequences collected from the other 2 participants as the test set. After removing all the data samples that have missing features (i.e., NaN values), the training set and test set contain 220,076 and 56,990 data samples, respectively. To stabilize the training process, we normalize the data 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the training and test sets within the range (1,1)11(-1,1)( - 1 , 1 ). Details of the 17 positions of the IMU sensors attached to the participants are described in [13]. For the reproducibility of the results in our paper, please refer to our GitHub repository at: https://github.com/hieunq95/compressive-sensing-imu.

By following the training process illustrated in Fig. 3, the desired signal 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is represented as 𝐱t204subscript𝐱𝑡superscript204\mathbf{x}_{t}\in\mathbb{R}^{204}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 204 end_POSTSUPERSCRIPT. Hereafter, we remove the time step t𝑡titalic_t notation for the sake of simplicity. The signal 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is then multiplied with the measurement matrix 𝐀𝐀\mathbf{A}bold_A to get the signal 𝐲𝐲\mathbf{y}bold_y with fewer features, i.e., m<204𝑚204m<204italic_m < 204. The signal 𝐲𝐲\mathbf{y}bold_y is then passed through a simulated channel with Gaussian noise 𝜼m𝜼superscript𝑚\boldsymbol{\eta}\in\mathbb{R}^{m}bold_italic_η ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. The noise vector 𝜼𝜼\boldsymbol{\eta}bold_italic_η has m𝑚mitalic_m elements, and each element follows a Gaussian distribution 𝒩(0,σN2)𝒩0superscriptsubscript𝜎𝑁2\mathcal{N}(0,\sigma_{N}^{2})caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

The noisy signal 𝐲^m^𝐲superscript𝑚\mathbf{\hat{y}}\in\mathbb{R}^{m}over^ start_ARG bold_y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is then used as the input of the VAE. The reconstructed signal at the output of the VAE is 𝐱^204^𝐱superscript204\mathbf{\hat{x}}\in\mathbb{R}^{204}over^ start_ARG bold_x end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 204 end_POSTSUPERSCRIPT. For this, we design the network architecture of the VAE as follows. The encoder is a fully connected network with an input layer having m𝑚mitalic_m neurons, one hidden layer having 64 neurons, and one latent layer having 10 neurons. The decoder is a fully connected network that has two hidden layers, each of which has 64 neurons. Finally, the output layer has 204 neurons, which are equivalent to the number of features generated from the original IMU signals. The activation function used for the hidden layers is ReLu and the activation function used for the output layer is Tanh. The ReLu activation function is selected due to its simplicity, computational efficiency, and ability to mitigate the vanishing gradient problem. Tanh is applied for the output layer as the reconstructed signal is being bounded within (-1, 1). Based on our experimental findings, we have determined that training the model for 50 epochs with a batch size of 60 leads to consistent and stable outcomes. We train the model for 50 epochs with a batch size of 60. We then use the trained model to evaluate the performance on the test set with the same batch size. All the parameter settings are described in Table I.

IV-A2 3D Avatar Model

Based on the reconstructed signals from the proposed CS-VAE model, we further transform the signals into a 3D human avatar. For this, we use the non-commercial Skinned Multi-Person Linear Model (SMPL) model in [19]. SMPL is a parametrized model of a 3D human body template that takes 72 pose parameters and 10 shape parameters, denoted by 𝐩72𝐩superscript72\mathbf{p}\in\mathbb{R}^{72}bold_p ∈ blackboard_R start_POSTSUPERSCRIPT 72 end_POSTSUPERSCRIPT and 𝐬10𝐬superscript10\mathbf{s}\in\mathbb{R}^{10}bold_s ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, respectively, and returns a mesh with 6,890 vertices in a 3D space. By adjusting the pose and shape parameters, i.e., 𝐩𝐩\mathbf{p}bold_p and 𝐬𝐬\mathbf{s}bold_s, we can animate the 3D avatars that mimic the physical shapes and movements of human users. Further details of the SMPL model can be found in [19].

To transform the IMU signals 𝐱^204^𝐱superscript204\mathbf{\hat{x}}\in\mathbb{R}^{204}over^ start_ARG bold_x end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 204 end_POSTSUPERSCRIPT into a pose parameter 𝐩72𝐩superscript72\mathbf{p}\in\mathbb{R}^{72}bold_p ∈ blackboard_R start_POSTSUPERSCRIPT 72 end_POSTSUPERSCRIPT, we use another VAE to learn the map** function F(𝐱^):20472:𝐹^𝐱superscript204superscript72F(\mathbf{\hat{x}}):\mathbb{R}^{204}\rightarrow\mathbb{R}^{72}italic_F ( over^ start_ARG bold_x end_ARG ) : blackboard_R start_POSTSUPERSCRIPT 204 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 72 end_POSTSUPERSCRIPT. This map** function helps us to transform IMU signals into the input of the SMPL model [13]. For example, the changes of acceleration and orientation of the IMU data (i.e., 𝐱^^𝐱\mathbf{\hat{x}}over^ start_ARG bold_x end_ARG varies) from the left wrist and left elbow of the user will make the 3D avatar move its left arm (i.e., 𝐩𝐩\mathbf{p}bold_p varies). In other words, given the reconstructed signals, we can create a human avatar in a virtual 3D space with a specific pose. Note that we keep the shape parameters 𝐬𝐬\mathbf{s}bold_s of the SMPL model as constant numbers for the sake of simplicity as body shape modeling is not our focus. In the following, we first present the performance evaluation of the signal reconstruction with our proposed framework. Secondly, we show that our reconstructed signals are more robust in creating 3D human avatar poses. Lastly, we show a simple but effective method to make smooth animated motions for the avatars without using input signals, thanks to the ability of the generative model.

IV-A3 Baseline Approaches

Notation IMU Parameters Values
n𝑛nitalic_n Dimension of signals 204204204204 features
m𝑚mitalic_m Number of linear [48:192]delimited-[]:48192[48:192][ 48 : 192 ]
measurements features
Frame per seconds 60 fps [32]
PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT Transmit power 0.10.10.10.1 Watt
[32, 33]
σNsubscript𝜎𝑁\sigma_{N}italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT Standard deviation of noise [1:500]delimited-[]:1500[1:500][ 1 : 500 ]
×104absentsuperscript104\times 10^{-4}× 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
d𝑑ditalic_d Parameter of 𝐀𝐀\mathbf{A}bold_A 2
in Proposition 1
Algorithmic Parameters Values
Learning rate 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
(Adam optimizer)
KL divergence weight 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
λ𝜆\lambdaitalic_λ l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT penalty 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
Number of training epochs 50
TABLE I: Parameter settings.

We evaluate the performance of our proposed framework, denoted by CS-VAE, and other baseline approaches. The considered baseline approaches are (i) Lasso (Lease absolute shrinkage and selection operator) [22] and (ii) DIP (Deep Inertial Poser) [13].

Lasso is a widely adopted algorithm for solving l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT penalized optimization problem in compressive sensing [20, 17]. It is a regression analysis technique that incorporates an l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT penalty term into the optimization problem. In particular, the solution of Lasso is obtained by solving the optimization problem in (5), which is equivalent to solving the Lagrangian in (6). As observed from equation (6), Lasso solves an optimization problem that involves the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT penalty term λ𝐱1𝜆subscriptnorm𝐱1\lambda\|\mathbf{x}\|_{1}italic_λ ∥ bold_x ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which is similar to the term λG(𝐳)1𝜆subscriptnorm𝐺𝐳1\lambda\|G(\mathbf{z})\|_{1}italic_λ ∥ italic_G ( bold_z ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in the loss function L(𝐳)𝐿𝐳L(\mathbf{z})italic_L ( bold_z ) in the proposed Algorithm 1. For a fair comparison, we use the same value for the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT penalty λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for both Lasso and CS-VAE approaches.

Note that by using the measurement matrix 𝐀𝐀\mathbf{A}bold_A that follows the Proposition 1, the setting of Lasso in (6) is now constrained with transmit power in equation (14b). In our later experiments, we empirically show that the power constraint in (14b) makes the optimization problem much more challenging, resulting in Lasso’s failures in reconstructing the original signals. For a more comprehensive evaluation, we also introduce a relaxed version of Lasso, denoted by the notation “Lasso w.o.PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT” (Lasso without power constraint PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT). In this relaxed version of Lasso, we remove the power constraint in (14b) and use Lasso to recover the original signals. This can be done by replacing 𝐀𝐀\mathbf{A}bold_A in (6) with another unconstrained measurement matrix 𝐁𝐁\mathbf{B}bold_B with Gaussian entries. We use a similar measurement matrix as in [23, 21] in which elements of 𝐁m×n𝐁superscript𝑚𝑛\mathbf{B}\in\mathbb{R}^{m\times n}bold_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT are Bij𝒩(0,1/m)similar-tosubscript𝐵𝑖𝑗𝒩01𝑚B_{ij}\sim\mathcal{N}(0,1/m)italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 / italic_m ). Without the power constraint, “Lasso w.o.PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT” is expected to produce the optimal results as upper bounds for our evaluation. As shown later, our proposed CS-VAE approach achieves comparable results to the optimal solutions obtained by “Lasso w.o.PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT”. Notably, the competitive results of our approach also come with an order of magnitude faster than Lasso in terms of decoding time, i.e., the time to find the solution for the optimization problem given a batch of input data.

DIP is a deep learning approach for reconstructing body pose from a fixed set of IMU sensors. The main idea of DIP and other approaches in this line of work, e.g., [13, 14], is to reconstruct the full body pose, e.g., using SMPL model, from measurements of IMU sensors placed on the important joints of the human body, i.e., head, spine, two wrists, and two knees. The common approaches of such frameworks are using recurrent neural networks to access the entire training set during the training time and using a shorter time window at the test time. The number of IMU sensors can be reduced to 3 sensors, e.g., sensors placed on the head and two wrists, but such an approach needs an extensive motion database and physics-based simulation engine [15].

In comparison with our proposed CS-VAE framework, the aforementioned works in [13, 14, 15] can be viewed as an underdetermined system in which the deep learning models try to recover the full body pose given signals from a few IMU sensors. As such, instead of using a matrix multiplication for downsampling the data, we can impose the training process of DIP by manually selecting the IMU sensors from the 17 IMU sensors in the dataset. Note that in the following results, we use the same VAE’s architecture and training loss in Fig. 3 for the DIP baseline approach, rather than using the recurrent neural network in [13]. The main reason is that the recurrent neural network needs to see the entire training set during the training process, which may yield advantages and inappropriate comparisons with CS-VAE and Lasso. With similar architecture and training with batches of signals, it is a more reasonable comparison for DIP against CS-VAE and Lasso. The details of simulation parameters are described in Table I.

IV-B Simulation Results

IV-B1 Impacts of the number of measurements

Refer to caption
Figure 4: Mean square error of reconstructed signals when the number of measurements m𝑚mitalic_m increases.

We evaluate the performance of the proposed framework when the number of measurements m𝑚mitalic_m increases from 48 measurements to 192 measurements, which are approximately 23%percent2323\%23 % and 94%percent9494\%94 % of the total number of the IMU orientation and acceleration features, respectively. We select the number of measurements in Fig. 4 to be divisible by 12. The only reason for this selection is that it is easier for the DIP baseline framework as the DIP framework needs to work with the set of IMU sensors in which each IMU has 12 features of measurements (9 features for the orientation and 3 features for the acceleration). Unlike DIP, the CS-VAE and Lasso approaches are flexible with any arbitrary number of measurements. The error bars in Figs. 4, 5, and 6 are equivalent to half of the standard deviations from the mean values.

As observed from Fig. 4, the Mean Square Error (MSE) values in most approaches decrease when the number of measurements increases, except for Lasso as it fails to recover the signals. This observation about Lasso shows that the power constraint makes the setting much more difficult to obtain the results. When we remove the power constraint, the relaxed optimization problem can be effectively solved with Lasso, as illustrated by the Lasso w.o.PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT baseline. Recall that we use Lasso w.o.PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT as an upper bound for comparison. The results show that our CS-VAE approach achieves the highest performance, i.e., low MSE values, which is closest to the performance of the upper bound solution, i.e., Lasso w.o.PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

The reason for the inferior performance of DIP to CS-VAE can be explained as follows. The linear projection of DIP from 204 measurements into a lower number of measurements only preserves the completeness of orientation and acceleration features. For example, the linear projection of DIP with m=120𝑚120m=120italic_m = 120 measurements makes it equivalent to the data from 10 IMU sensors in which each IMU preserves the full 12 features of orientation and acceleration. With compressive sensing technique in other approaches, the completeness of such 12 features no longer holds as the linear projection is performed by the measurement matrix 𝐀𝐀\mathbf{A}bold_A, which may preserve the features with the sparse values rather than kee** the full orientation and acceleration values of certain IMU sensors. We observe that our newly proposed design of the measurement matrix 𝐀𝐀\mathbf{A}bold_A is the key factor making the CS-VAE approach work well with noisy and sparse signals, given a simple neural network’s architecture of VAE.

IV-B2 Impacts of the channel noise

Refer to caption
Figure 5: Mean square error of reconstructed signals when channel noise power increases.

Next, we evaluate the performance of the approaches under different channel noise power values. As we consider the Gaussian channel, the noise’s power is equivalent to the variance of the Gaussian noise, i.e., σN2superscriptsubscript𝜎𝑁2\sigma_{N}^{2}italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in (2) [26]. We fix the power PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and increase the standard deviation of the noise from 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to 500×104500superscript104500\times 10^{-4}500 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to obtain the results in Fig. 5. As observed from Fig. 5, the proposed CS-VAE approach achieves better performance under most of the considered scenarios, compared to the Lasso and DIP approaches. Similar to the observation in the previous setting, Lasso fails to reconstruct the signals regardless of the noise level. The relaxed version of Lasso, i.e., Lasso w.o.PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT achieves the highest performance as it is not constrained by the power transmission.

We observe that under high noise’s power value, i.e., σN=500×104subscript𝜎𝑁500superscript104\sigma_{N}=500\times 10^{-4}italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = 500 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, CS-VAE performs worse than the DIP approach. The reason is that the design of matrix 𝐀𝐀\mathbf{A}bold_A can bound the signal 𝐲=𝐀𝐱𝐲𝐀𝐱\mathbf{y}=\mathbf{Ax}bold_y = bold_Ax with 1m𝐲22PT1𝑚superscriptsubscriptnorm𝐲22subscript𝑃𝑇{\frac{1}{m}\|\mathbf{y}\|_{2}^{2}\leq P_{T}}divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∥ bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, but this also restricts the signal 𝐲𝐲\mathbf{y}bold_y into suboptimal power region. With DIP, we adopt the power normalization strategy in [28] in which the value of 𝐲𝐲\mathbf{y}bold_y is normalized by its l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm 𝐲2subscriptnorm𝐲2\|\mathbf{y}\|_{2}∥ bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. As the results suggest, this power normalization can be effective with high noise levels but it becomes less robust with low noise values. This is also the reason we design a new measurement matrix 𝐀𝐀\mathbf{A}bold_A rather than following this power normalization scheme as it yields a nonlinear projection from 𝐱𝐱\mathbf{x}bold_x to 𝐲𝐲\mathbf{y}bold_y, which is in contrast to the linear projection idea in compressive sensing. Nevertheless, the CS-VAE achieves more robust and better performance than Lasso and DIP, and the results are closest to Lasso w.o.PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

IV-B3 Decoding latency with respect to the number of input samples

Refer to caption
Figure 6: Decoding time at the receiver when the number of input samples increases.

Next, we investigate the decoding time at the receiver under different sizes of the input samples in Fig. 6. The number of input samples 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG fed into the VAE is equivalent to the number of measurements m𝑚mitalic_m multiplied by the batch size. Note that in Fig. 6, we use the approximate values for the number of input samples in the x-axis for ease of illustration. The exact values are calculated by m×b𝑚𝑏m\times bitalic_m × italic_b, where m𝑚mitalic_m is the number of measurements and b𝑏bitalic_b is the batch size. For example, with 10×10310superscript10310\times 10^{3}10 × 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT input samples in Fig. 6, we use m=168𝑚168m=168italic_m = 168 and b=60𝑏60b=60italic_b = 60 to produce 10,0801008010,08010 , 080 input samples. These 10,0801008010,08010 , 080 input samples are actually 14×12×6014126014\times 12\times 6014 × 12 × 60 samples, where 14141414 is the number of IMU sensors, 12121212 is the number of features generated by each IMU sensor per frame, and 60606060 is the number of frames per second. In other words, a sequence of 10×10310superscript10310\times 10^{3}10 × 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT input samples is equivalent to data generated by the IMU sensors in one second. Similarly, a sequence of 70×10370superscript10370\times 10^{3}70 × 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT input samples is equivalent to data generated by the IMU sensors in 7 seconds.

The transmission of a batch of data samples is equivalent to the case when we collect a batch of samples after a period of time and then transmit them to the receiver. For Lasso and Lasso w.o.PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, this batch of samples can be effectively used as the input of the optimization solver as the Lasso model can handle matrix-like input data. The decoding time is measured as the single forward operation from the input to the output of the pre-trained VAE models to obtain the reconstructed signals. With Lasso, the decoding time is measured as the time to find the solution for the optimization problem. We use the same central processing unit (CPU) to compute the decoding time for all approaches.

As observed from Fig. 6, the decoding time values of the CS-VAE and DIP approaches are significantly lower than that of the Lasso approaches. The main reason is that the search space of Lasso’s solver increases as the number of input samples increases. In contrast, the decoding time values of CS-VAE and DIP slowly increase with the number of input samples because the pre-trained VAE models only need to forward the received signals through the deep neural network layers to obtain the reconstructed signals. Notably, the cost of finding high accuracy for reconstructed signals of Lasso w.o.PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT makes it significantly slower than the deep learning approaches.

It is worth mentioning that the CS-VAE and DIP approaches need to be pre-trained before being applied to get the results in Fig. 6 while the Lasso approach does not have this pre-training process. However, the pre-training process can be greatly facilitated through modern graphics processing unit (GPU) training. The observation in this experiment also suggests the real-time decoding capability of the CS-VAE approach with lower decoding latency. For example, with a number of input samples of 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, which is equivalent to a sequence data of one second, the decoding time of the CS-VAE approach is approximately 8×1028superscript1028\times 10^{-2}8 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT second. We observe that although sharing the same network architecture with the CS-VAE approach, the DIP approach experiences slightly higher decoding time. The main reason is that the implementation of the linear projection matrix of DIP requires a few extra matrix transformation steps and de-transformation steps. However, this difference in the decoding time is not significant.

IV-B4 3D pose estimation from reconstructed signals

Refer to caption
Figure 7: Reconstructed 3D poses from noisy and compressed m=168𝑚168m=168italic_m = 168 measurements and noise σN=10×104subscript𝜎𝑁10superscript104\sigma_{N}=10\times 10^{-4}italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = 10 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT.

In Fig. 7, we draw random poses from the reconstructed signals in the test set. The poses are generated from the SMPL model and the pre-trained VAE model (modeled as a function F(𝐱^):20472:𝐹^𝐱superscript204superscript72F(\mathbf{\hat{x}}):\mathbb{R}^{204}\rightarrow\mathbb{R}^{72}italic_F ( over^ start_ARG bold_x end_ARG ) : blackboard_R start_POSTSUPERSCRIPT 204 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 72 end_POSTSUPERSCRIPT) as described in Section IV-A2. In this experiment, we reconstruct the poses from 168 measurements (i.e., 82% of the total signals) and the channel noise power σN=10×104subscript𝜎𝑁10superscript104\sigma_{N}=10\times 10^{-4}italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = 10 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. As observed from Fig. 7, our reconstructed poses are more accurate compared with the Lasso and DIP approaches. As Lasso w.o.PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is considered as the upper bound of all the signal reconstruction scenarios, Fig. 7 clearly shows that the reconstructed 3D poses by Lasso w.o.PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT (green poses in the fourth column) achieve the most similarity to the ground truth reference poses (red poses in the first column). The 3D poses obtained by CS-VAE can accurately mimic the body poses of the references with slight errors in arm movements, e.g., the pose in the second row. With Lasso, as suggested by the poor signal reconstruction performance, the 3D poses obtained by Lasso fail to mimic the reference poses. In our experiment, the received signals are down-sampled and contain additive noise and the models do not have access to the full training set as in [13], resulting in poor performance of the DIP approach.

IV-B5 Pose interpolation without input signals

Refer to caption
Figure 8: Interpolation between key poses (red avatars) without using input IMU signals. The gray avatars represent the poses generated by the VAE when the input IMU signals are missing, e.g., due to transmission loss.

We further use the pre-trained latent space and the decoder of the VAE to generate novel synthesis poses to animate the avatar. The ability to generate synthesis data is one of the most important features of generative models, which has not been well demonstrated in the literature of wireless systems. In particular, we consider a simple pose interpolation task as follows. Given two key poses, illustrated by the left pose and right pose in red color in Fig. 8, we aim to create a smooth transition between these two key poses by generating the immediate light gray colored poses. The need for creating the intermediate poses is important as the IMU signals might be lost during the transmission over the severe lossy wireless channels. In such a case, if the receiver cannot fill the missing poses or the transmitter does not retransmit the data, the user may experience motion sickness in the virtual 3D environment, thus decreasing the quality of experience.

Given the above setting, as illustrated in Fig. 8, we obtain the two key poses by reconstructing the IMU signals similar to the previous experiments. Let’s denote the received signals 𝐲^1subscript^𝐲1\mathbf{\hat{y}}_{1}over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐲^2subscript^𝐲2\mathbf{\hat{y}}_{2}over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT corresponding to the two key poses, respectively. The VAE’s encoder can map the signals into two corresponding vectors in the latent space that are 𝐳1=qϕ(𝐳|𝐲=𝐲^1)subscript𝐳1subscript𝑞bold-italic-ϕconditional𝐳𝐲subscript^𝐲1\mathbf{z}_{1}=q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{y}=\mathbf{\hat{y}}_{1})bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_y = over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and 𝐳2=qϕ(𝐳|𝐲=𝐲^2)subscript𝐳2subscript𝑞bold-italic-ϕconditional𝐳𝐲subscript^𝐲2\mathbf{z}_{2}=q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{y}=\mathbf{\hat{y}}_{2})bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_y = over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). As a result, the reconstructed signals corresponding to the two key poses are 𝐱^1=p𝜽(𝐱|𝐳=𝐳1)subscript^𝐱1subscript𝑝𝜽conditional𝐱𝐳subscript𝐳1\mathbf{\hat{x}}_{1}=p_{\boldsymbol{\theta}}(\mathbf{x|z}=\mathbf{z}_{1})over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z = bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and 𝐱^2=p𝜽(𝐱|𝐳=𝐳2)subscript^𝐱2subscript𝑝𝜽conditional𝐱𝐳subscript𝐳2\mathbf{\hat{x}}_{2}=p_{\boldsymbol{\theta}}(\mathbf{x|z}=\mathbf{z}_{2})over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z = bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). To simply interpolate between the two latent vectors 𝐳1subscript𝐳1\mathbf{z}_{1}bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐳2subscript𝐳2\mathbf{z}_{2}bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we use the intermediate latent vectors 𝐳j=ϱ𝐳2+(1ϱ)𝐳1subscript𝐳𝑗italic-ϱsubscript𝐳21italic-ϱsubscript𝐳1\mathbf{z}_{j}=\varrho\mathbf{z}_{2}+(1-\varrho)\mathbf{z}_{1}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_ϱ bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ( 1 - italic_ϱ ) bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT where ϱ[0,1]italic-ϱ01\varrho\in[0,1]italic_ϱ ∈ [ 0 , 1 ] is the interpolation parameter. Intuitively, ϱ=1italic-ϱ1\varrho=1italic_ϱ = 1 makes the intermediate latent vector 𝐳j=𝐳2subscript𝐳𝑗subscript𝐳2\mathbf{z}_{j}=\mathbf{z}_{2}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, resulting in the pose similar to the key pose on the right of the Fig. 8. Similarly, ϱ=0italic-ϱ0\varrho=0italic_ϱ = 0 imposes the vector 𝐳j=𝐳1subscript𝐳𝑗subscript𝐳1\mathbf{z}_{j}=\mathbf{z}_{1}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The arbitrary value of ϱ[0,1]italic-ϱ01\varrho\in[0,1]italic_ϱ ∈ [ 0 , 1 ] creates a latent vector that is a linear combination of the two vectors 𝐳1subscript𝐳1\mathbf{z}_{1}bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐳2subscript𝐳2\mathbf{z}_{2}bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. As a result, we can make a smooth transition of the avatar poses by simply increasing the value of ϱitalic-ϱ\varrhoitalic_ϱ, as shown in Fig. 8. As described, there is no need for the input signals when interpolation takes place in the learned latent space 𝐳𝐳\mathbf{z}bold_z. Similar linear interpolation techniques have been explored to create synthesized data in the various domains [34]. Our experiment shows the potential extension of the proposed framework to future VR/XR applications in conjunction with generative modeling where the synthesized data can be utilized.

V Conclusion

In this paper, we have developed a novel framework for 3D human pose estimation from IMU sensors with generative model-based compressive sensing. The proposed framework helps the IMU sensors reduce the amount of information exchanged with the receiver, thus further enhancing channel utilization and encoding-decoding efficiency. At the receiver’s side, we have employed a deep generative model, i.e., a VAE, that can recover the original signals from noisy compressed samples. With the ability of the generative model at the receiver, we have achieved an order of magnitude faster than Lasso in terms of decoding latency. We have further demonstrated that the proposed framework can learn a latent representation space and generate synthetic data samples, making it possible to fulfill missing data features (e.g., due to lossy transmissions) without using input data from the IMU sensors. Interesting findings suggest that the proposed generative model-based compressive sensing framework can achieve state-of-the-art performance in challenging scenarios with severe noise and fewer number of measurements, compared with other optimization and deep learning approaches. One potential research direction from this work is extending the designed measurement matrix for other complicated channel models, such as additive noise fading channel, Rayleigh fading channel, and Rician fading channel. Multiple-access channels with interference can also be further considered with additional constraints on the design of the measurement matrix.

Appendix A Proof of Proposition 1

The purpose of the Proposition 1 is to construct a measurement matrix that satisfies the S-REC property in Definition 4 and the power constraint in (14) so that the recovered signal of the generative model is unique and accurate with high probability. In the following, we prove that if each element Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (element j𝑗jitalic_j-th of the i𝑖iitalic_i-th row of matrix 𝐀𝐀\mathbf{A}bold_A) follows a normal distribution Aij𝒩(0,PTn2d2(dσx+μx)2)similar-tosubscript𝐴𝑖𝑗𝒩0subscript𝑃𝑇superscript𝑛2superscript𝑑2superscript𝑑subscript𝜎𝑥subscript𝜇𝑥2A_{ij}\sim\mathcal{N}\Big{(}0,\frac{P_{T}}{n^{2}d^{2}(d\sigma_{x}+\mu_{x})^{2}% }\Big{)}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , divide start_ARG italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ), the matrix 𝐀m×n𝐀superscript𝑚𝑛\mathbf{A}\in\mathbb{R}^{m\times n}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT satisfies the S-REC property and power constraint. The first part of this section proves the guarantee of the power constraint. The second part proves the S-REC property of the measurement matrix.

A-A Proof of 𝐲=𝐀𝐱𝐲𝐀𝐱\mathbf{y}=\mathbf{Ax}bold_y = bold_Ax guaranteeing the power constraint of a Gaussian channel

First, we need prove that the matrix 𝐀𝐀\mathbf{A}bold_A satisfies that power constraint

1m𝐲22PT.1𝑚superscriptsubscriptnorm𝐲22subscript𝑃𝑇\frac{1}{m}\|\mathbf{y}\|_{2}^{2}\leq P_{T}.divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∥ bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT . (18)

By using l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm definition in (4), the power constraint can be rewritten as [26, Equation 9.2, Chapter 9]:

1mi=1m|yi|2PT.1𝑚superscriptsubscript𝑖1𝑚superscriptsubscript𝑦𝑖2subscript𝑃𝑇\frac{1}{m}\sum_{i=1}^{m}|y_{i}|^{2}\leq P_{T}.divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT . (19)

Recall that we have a n𝑛nitalic_n-dimensional vector 𝐱=[x1,x2,,xj,,xn]𝐱superscriptsubscript𝑥1subscript𝑥2subscript𝑥𝑗subscript𝑥𝑛top\mathbf{x}=[x_{1},x_{2},\ldots,x_{j},\ldots,x_{n}]^{\top}bold_x = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and the m𝑚mitalic_m-dimensional vector 𝐲=[y1,y2,,yi,,ym]𝐲superscriptsubscript𝑦1subscript𝑦2subscript𝑦𝑖subscript𝑦𝑚top\mathbf{y}=[y_{1},y_{2},\ldots,y_{i},\ldots,y_{m}]^{\top}bold_y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where the superscript top\top denotes the transpose. The measurement matrix 𝐀m×n𝐀superscript𝑚𝑛\mathbf{A}\in\mathbb{R}^{m\times n}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT is defined by:

𝐀=[A11A12A1nA21A22A2nAm1Am2Amn].𝐀delimited-[]subscript𝐴11subscript𝐴12subscript𝐴1𝑛subscript𝐴21subscript𝐴22subscript𝐴2𝑛subscript𝐴𝑚1subscript𝐴𝑚2subscript𝐴𝑚𝑛\mathbf{A}=\left[\begin{array}[]{cccc}A_{11}&A_{12}&\cdots&A_{1n}\\ A_{21}&A_{22}&\cdots&A_{2n}\\ \vdots&\vdots&\ddots&\vdots\\ A_{m1}&A_{m2}&\cdots&A_{mn}\end{array}\right].bold_A = [ start_ARRAY start_ROW start_CELL italic_A start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL italic_A start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_A start_POSTSUBSCRIPT 1 italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_A start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL italic_A start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_A start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_A start_POSTSUBSCRIPT italic_m 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_A start_POSTSUBSCRIPT italic_m 2 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_A start_POSTSUBSCRIPT italic_m italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] .

Using matrix calculation, we have yi=j=1nAijxjsubscript𝑦𝑖superscriptsubscript𝑗1𝑛subscript𝐴𝑖𝑗subscript𝑥𝑗y_{i}=\sum_{j=1}^{n}A_{ij}x_{j}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with i=1,2,,m𝑖12𝑚i=1,2,\ldots,mitalic_i = 1 , 2 , … , italic_m. Following the definition of lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm in (4), we have

1m𝐲221𝑚superscriptsubscriptnorm𝐲22\displaystyle\frac{1}{m}\|\mathbf{y}\|_{2}^{2}divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∥ bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =1mi=1m|yi|2absent1𝑚superscriptsubscript𝑖1𝑚superscriptsubscript𝑦𝑖2\displaystyle=\frac{1}{m}\sum_{i=1}^{m}|y_{i}|^{2}= divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (20a)
=1mi=1m(j=1nAijxj)2absent1𝑚superscriptsubscript𝑖1𝑚superscriptsuperscriptsubscript𝑗1𝑛subscript𝐴𝑖𝑗subscript𝑥𝑗2\displaystyle=\frac{1}{m}\sum_{i=1}^{m}\Big{(}\sum_{j=1}^{n}A_{ij}x_{j}\Big{)}% ^{2}= divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (20b)
1mi=1m[(j=1nAij2)(j=1nxj2)],absent1𝑚superscriptsubscript𝑖1𝑚delimited-[]superscriptsubscript𝑗1𝑛superscriptsubscript𝐴𝑖𝑗2superscriptsubscript𝑗1𝑛superscriptsubscript𝑥𝑗2\displaystyle\leq\frac{1}{m}\sum_{i=1}^{m}\Big{[}\big{(}\sum_{j=1}^{n}A_{ij}^{% 2}\big{)}\big{(}\sum_{j=1}^{n}x_{j}^{2}\big{)}\Big{]},≤ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] , (20c)

where (20c) directly applies Cauchy–Schwarz inequality, i.e., (j=1nAijxj)2(j=1nAij2)(j=1nxj2)superscriptsuperscriptsubscript𝑗1𝑛subscript𝐴𝑖𝑗subscript𝑥𝑗2superscriptsubscript𝑗1𝑛superscriptsubscript𝐴𝑖𝑗2superscriptsubscript𝑗1𝑛superscriptsubscript𝑥𝑗2\Big{(}\sum_{j=1}^{n}A_{ij}x_{j}\Big{)}^{2}\leq\big{(}\sum_{j=1}^{n}A_{ij}^{2}% \big{)}\big{(}\sum_{j=1}^{n}x_{j}^{2}\big{)}( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). In the following, we derive the bounds for the two terms inside (20c) which are j=1nAij2superscriptsubscript𝑗1𝑛superscriptsubscript𝐴𝑖𝑗2\sum_{j=1}^{n}A_{ij}^{2}∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and j=1nxj2superscriptsubscript𝑗1𝑛superscriptsubscript𝑥𝑗2\sum_{j=1}^{n}x_{j}^{2}∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. For this purpose, we use Chebyshev’s inequality [26, Chapter 3, Equation 3.32], which can be stated as follows.

Definition 5 (Chebyshev’s inequality).

Let X𝑋Xitalic_X be a random variable with mean μ𝜇\muitalic_μ and variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. For any ε>0𝜀0\varepsilon>0italic_ε > 0,

(|Xμ|>ε)σ2ε2.𝑋𝜇𝜀superscript𝜎2superscript𝜀2\mathbb{P}\big{(}|X-\mu|>\varepsilon\big{)}\leq\frac{\sigma^{2}}{\varepsilon^{% 2}}.blackboard_P ( | italic_X - italic_μ | > italic_ε ) ≤ divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

In our setting, we are more interested in the central limits, i.e., the distances away from the mean values, of the random variables, i.e., xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Let’s ε=dσ𝜀𝑑𝜎\varepsilon=d\sigmaitalic_ε = italic_d italic_σ for real number d>0𝑑0d>0italic_d > 0, Chebyshev’s inequality can be rewritten as

(|Xμ|>dσ)1d2.𝑋𝜇𝑑𝜎1superscript𝑑2\mathbb{P}\big{(}|X-\mu|>d\sigma\big{)}\leq\frac{1}{d^{2}}.blackboard_P ( | italic_X - italic_μ | > italic_d italic_σ ) ≤ divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (21)

Equivalently, we can bound the absolute value of |Xμ|dσ𝑋𝜇𝑑𝜎|X-\mu|\leq d\sigma| italic_X - italic_μ | ≤ italic_d italic_σ with probability is at least 11d211superscript𝑑21-\frac{1}{d^{2}}1 - divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, i.e., (|Xμ|dσ)>11d2𝑋𝜇𝑑𝜎11superscript𝑑2\mathbb{P}\big{(}|X-\mu|\leq d\sigma\big{)}>1-\frac{1}{d^{2}}blackboard_P ( | italic_X - italic_μ | ≤ italic_d italic_σ ) > 1 - divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. By choosing the value of d𝑑ditalic_d, we can bound the value of X𝑋Xitalic_X within a certain distance away from its mean with known probability. For sufficiently large d𝑑ditalic_d, we have the following inequalities

μdσXμ+dσ,𝜇𝑑𝜎𝑋𝜇𝑑𝜎\mu-d\sigma\leq X\leq\mu+d\sigma,italic_μ - italic_d italic_σ ≤ italic_X ≤ italic_μ + italic_d italic_σ , (22)

with high probability.

Now, by using inequalities in (22) for random variables xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT associated with mean μxsubscript𝜇𝑥\mu_{x}italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and variance σx2superscriptsubscript𝜎𝑥2\sigma_{x}^{2}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT associated with zero-mean and variance σa2superscriptsubscript𝜎𝑎2\sigma_{a}^{2}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have the following inequalities

dσaAijdσa,𝑑subscript𝜎𝑎subscript𝐴𝑖𝑗𝑑subscript𝜎𝑎\displaystyle\quad-d\sigma_{a}\leq A_{ij}\leq d\sigma_{a},- italic_d italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≤ italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ italic_d italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , (23a)
μxdσxxjμx+dσx,subscript𝜇𝑥𝑑subscript𝜎𝑥subscript𝑥𝑗subscript𝜇𝑥𝑑subscript𝜎𝑥\displaystyle\mu_{x}-d\sigma_{x}\leq x_{j}\leq\mu_{x}+d\sigma_{x},italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , (23b)

with high probabilities. By taking the square of Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we have

Aij2dσa2,superscriptsubscript𝐴𝑖𝑗2𝑑superscriptsubscript𝜎𝑎2\displaystyle A_{ij}^{2}\leq d\sigma_{a}^{2},italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_d italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (24a)
xj2max[(dσx2+μx)2,(dσx+μx)2],superscriptsubscript𝑥𝑗2superscript𝑑superscriptsubscript𝜎𝑥2subscript𝜇𝑥2superscript𝑑subscript𝜎𝑥subscript𝜇𝑥2\displaystyle x_{j}^{2}\leq\max\big{[}(d\sigma_{x}^{2}+\mu_{x})^{2},(-d\sigma_% {x}+\mu_{x})^{2}\big{]},italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ roman_max [ ( italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ( - italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (24b)

with high probabilities. Taking the sum over n𝑛nitalic_n samples, we have

j=1nAij2ndσa2,superscriptsubscript𝑗1𝑛superscriptsubscript𝐴𝑖𝑗2𝑛𝑑superscriptsubscript𝜎𝑎2\sum_{j=1}^{n}A_{ij}^{2}\leq nd\sigma_{a}^{2},∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_n italic_d italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (25)

and

j=1nxj2max[n(dσx2+μx)2,n(dσx+μx)2],superscriptsubscript𝑗1𝑛superscriptsubscript𝑥𝑗2𝑛superscript𝑑superscriptsubscript𝜎𝑥2subscript𝜇𝑥2𝑛superscript𝑑subscript𝜎𝑥subscript𝜇𝑥2\sum_{j=1}^{n}x_{j}^{2}\leq\max\big{[}n(d\sigma_{x}^{2}+\mu_{x})^{2},n(-d% \sigma_{x}+\mu_{x})^{2}\big{]},∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ roman_max [ italic_n ( italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_n ( - italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (26)

with high probabilities. As we empirically observe from the dataset that μx>0subscript𝜇𝑥0\mu_{x}>0italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT > 0, the bound for the above equation can be simplified as

j=1nxj2n(dσx2+μx)2.superscriptsubscript𝑗1𝑛superscriptsubscript𝑥𝑗2𝑛superscript𝑑superscriptsubscript𝜎𝑥2subscript𝜇𝑥2\sum_{j=1}^{n}x_{j}^{2}\leq n(d\sigma_{x}^{2}+\mu_{x})^{2}.∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_n ( italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (27)

Replacing (25) and (27) in (20), we finally have

1m𝐲221𝑚superscriptsubscriptnorm𝐲22\displaystyle\frac{1}{m}\|\mathbf{y}\|_{2}^{2}divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∥ bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 1mi=1mnd2σa2n(dσx+μx)2absent1𝑚superscriptsubscript𝑖1𝑚𝑛superscript𝑑2superscriptsubscript𝜎𝑎2𝑛superscript𝑑subscript𝜎𝑥subscript𝜇𝑥2\displaystyle\leq\frac{1}{m}\sum_{i=1}^{m}nd^{2}\sigma_{a}^{2}n(d\sigma_{x}+% \mu_{x})^{2}≤ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_n italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n ( italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (28a)
=n2d2(dσx+μx)2σa2.absentsuperscript𝑛2superscript𝑑2superscript𝑑subscript𝜎𝑥subscript𝜇𝑥2superscriptsubscript𝜎𝑎2\displaystyle=n^{2}d^{2}(d\sigma_{x}+\mu_{x})^{2}\sigma_{a}^{2}.= italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (28b)

As we want to have a power constraint 1m𝐲22PT1𝑚superscriptsubscriptnorm𝐲22subscript𝑃𝑇\frac{1}{m}\|\mathbf{y}\|_{2}^{2}\leq P_{T}divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∥ bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, by choosing n2d2(dσx+μx)2σa2=PTsuperscript𝑛2superscript𝑑2superscript𝑑subscript𝜎𝑥subscript𝜇𝑥2superscriptsubscript𝜎𝑎2subscript𝑃𝑇n^{2}d^{2}(d\sigma_{x}+\mu_{x})^{2}\sigma_{a}^{2}=P_{T}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, we have the variance of Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is

σa2=PTn2d2(dσx+μx)2.superscriptsubscript𝜎𝑎2subscript𝑃𝑇superscript𝑛2superscript𝑑2superscript𝑑subscript𝜎𝑥subscript𝜇𝑥2\sigma_{a}^{2}=\frac{P_{T}}{n^{2}d^{2}(d\sigma_{x}+\mu_{x})^{2}}.italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (29)

The derivation above for σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT proves the Proposition 1. In other words, by generating the measurement matrix 𝐀𝐀\mathbf{A}bold_A from the normal distribution with zero-mean and variance σa2superscriptsubscript𝜎𝑎2\sigma_{a}^{2}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the transmit power 1m𝐲2PT1𝑚subscriptnorm𝐲2subscript𝑃𝑇\frac{1}{m}\|\mathbf{y}\|_{2}\leq P_{T}divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∥ bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT can be achieved with high probability.

A-B Proof of 𝐀𝐱𝐀𝐱\mathbf{Ax}bold_Ax satisfying S-REC property

Next, we prove that given the measurement matrix 𝐀𝐀\mathbf{A}bold_A that follows the Proposition 1 will satisfy the S-REC property in Definition 4. In particular, the matrix 𝐀𝐀\mathbf{A}bold_A is said to satisfy the S-REC(SG,γ,κ)S-RECsubscript𝑆𝐺𝛾𝜅\text{S-REC}(S_{G},\gamma,\kappa)S-REC ( italic_S start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_γ , italic_κ ), i.e., set-restricted eigenvalue condition of the set SGsubscript𝑆𝐺S_{G}italic_S start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT (defined in (13)) with the parameters γ>0𝛾0\gamma>0italic_γ > 0 and κ0𝜅0\kappa\geq 0italic_κ ≥ 0, if 𝐱1,𝐱2SGfor-allsubscript𝐱1subscript𝐱2subscript𝑆𝐺\forall\mathbf{x}_{1},\mathbf{x}_{2}\in S_{G}∀ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT,

𝐀(G(𝐳1)G(𝐳2))2γG(𝐳1)G(𝐳2)2κ.subscriptnorm𝐀𝐺subscript𝐳1𝐺subscript𝐳22𝛾subscriptnorm𝐺subscript𝐳1𝐺subscript𝐳22𝜅\|\mathbf{A}\big{(}G(\mathbf{z}_{1})-G(\mathbf{z}_{2})\big{)}\|_{2}\geq\gamma% \|G(\mathbf{z}_{1})-G(\mathbf{z}_{2})\|_{2}-\kappa.∥ bold_A ( italic_G ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_G ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_γ ∥ italic_G ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_G ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_κ . (30)

The range of the generator G(𝐳)𝐺𝐳G(\mathbf{z})italic_G ( bold_z ) can be easily bounded by the output layer of the deep neural network. In our experiments, we use the Tanh activation function as the output of the neural network. Therefore, we have a simple bound 1G(𝐳)11𝐺𝐳1-1\leq G(\mathbf{z})\leq 1- 1 ≤ italic_G ( bold_z ) ≤ 1, which is similar to the bound of processed signal vectors 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Let’s define a vector 𝐯=G(𝐳1)G(𝐳2)𝐯𝐺subscript𝐳1𝐺subscript𝐳2\mathbf{v}=G(\mathbf{z}_{1})-G(\mathbf{z}_{2})bold_v = italic_G ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_G ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with the bound of each element vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (j=1,2,,n𝑗12𝑛j=1,2,\ldots,nitalic_j = 1 , 2 , … , italic_n) of 𝐯𝐯\mathbf{v}bold_v is 2vj22subscript𝑣𝑗2-2\leq v_{j}\leq 2- 2 ≤ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ 2, (30) can be rewritten as

𝐀𝐯2γ𝐯2κ.subscriptnorm𝐀𝐯2𝛾subscriptnorm𝐯2𝜅\|\mathbf{Av}\|_{2}\geq\gamma\|\mathbf{v}\|_{2}-\kappa.∥ bold_Av ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_γ ∥ bold_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_κ . (31)

By using the definition of lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm in (4), we have

γ𝐯2𝛾subscriptnorm𝐯2\displaystyle\gamma\|\mathbf{v}\|_{2}italic_γ ∥ bold_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =γj=1nvj2absent𝛾superscriptsubscript𝑗1𝑛superscriptsubscript𝑣𝑗2\displaystyle=\gamma\sqrt{\sum_{j=1}^{n}v_{j}^{2}}= italic_γ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (32a)
γ4nabsent𝛾4𝑛\displaystyle\leq\gamma\sqrt{4n}≤ italic_γ square-root start_ARG 4 italic_n end_ARG (32b)
=2γn,absent2𝛾𝑛\displaystyle=2\gamma\sqrt{n},= 2 italic_γ square-root start_ARG italic_n end_ARG , (32c)

where the inequality in the second line is obtained by the bound 2vj22subscript𝑣𝑗2-2\leq v_{j}\leq 2- 2 ≤ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ 2. As a result, we have the following inequality for the term on the right-hand side of (30):

γ𝐯2κ2γnκ.𝛾subscriptnorm𝐯2𝜅2𝛾𝑛𝜅\gamma\|\mathbf{v}\|_{2}-\kappa\leq 2\gamma\sqrt{n}-\kappa.italic_γ ∥ bold_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_κ ≤ 2 italic_γ square-root start_ARG italic_n end_ARG - italic_κ . (33)

To find a possible lower bound for 𝐀𝐯2subscriptnorm𝐀𝐯2\|\mathbf{Av}\|_{2}∥ bold_Av ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we use the inequalities between the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm, and then find the probabilistic lower bound of the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm based on Bernstein inequality. In particular, we apply the following inequality (see equations (A.3) and (A.4) in Definition A.2 of [20]):

𝐀𝐯21m𝐀𝐯1.subscriptnorm𝐀𝐯21𝑚subscriptnorm𝐀𝐯1\|\mathbf{Av}\|_{2}\geq\frac{1}{\sqrt{m}}\|\mathbf{Av}\|_{1}.∥ bold_Av ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG ∥ bold_Av ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (34)

Using the definition of l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm in (4), (34) can be rewritten as

𝐀𝐯2subscriptnorm𝐀𝐯2\displaystyle\|\mathbf{Av}\|_{2}∥ bold_Av ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1mj=1n|Aijvj|absent1𝑚superscriptsubscript𝑗1𝑛subscript𝐴𝑖𝑗subscript𝑣𝑗\displaystyle\geq\frac{1}{\sqrt{m}}\sum_{j=1}^{n}\big{|}A_{ij}v_{j}\big{|}≥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | (35a)
1m|j=1nAijvj|,absent1𝑚superscriptsubscript𝑗1𝑛subscript𝐴𝑖𝑗subscript𝑣𝑗\displaystyle\geq\frac{1}{\sqrt{m}}\Big{|}\sum_{j=1}^{n}A_{ij}v_{j}\Big{|},≥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG | ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | , (35b)

where the inequality in the second line is obtained by using the generalized triangle inequality. Next, we use a probabilistic lower bound for (35b) which applies Bernstein inequality (see Theorem 7.27 - Chapter 7 of [20]), i.e., given the measurement matrix 𝐀𝐀\mathbf{A}bold_A, which has elements Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT are zero-mean sub-gaussian random variables, we have the following probabilistic lower bound of

(|j=1nAijvj|t)2exp(t24c𝐀22),superscriptsubscript𝑗1𝑛subscript𝐴𝑖𝑗subscript𝑣𝑗𝑡2superscript𝑡24𝑐superscriptsubscriptnorm𝐀22\mathbb{P}\Big{(}\big{|}\sum_{j=1}^{n}A_{ij}v_{j}\big{|}\geq t\Big{)}\leq 2% \exp\Big{(}{\frac{-t^{2}}{4c\|\mathbf{A}\|_{2}^{2}}}\Big{)},blackboard_P ( | ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≥ italic_t ) ≤ 2 roman_exp ( divide start_ARG - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_c ∥ bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (36)

for t>0for-all𝑡0\forall t>0∀ italic_t > 0, where c𝑐citalic_c is a subgaussian parameter. Let’s t=t0m𝑡subscript𝑡0𝑚t=t_{0}\sqrt{m}italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT square-root start_ARG italic_m end_ARG with t0>0for-allsubscript𝑡00\forall t_{0}>0∀ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0, (36) can be rewritten as

(|j=1nAijvj|mt0)2exp(t02m24c𝐀22)superscriptsubscript𝑗1𝑛subscript𝐴𝑖𝑗subscript𝑣𝑗𝑚subscript𝑡02superscriptsubscript𝑡02superscript𝑚24𝑐superscriptsubscriptnorm𝐀22\displaystyle\mathbb{P}\Big{(}\big{|}\sum_{j=1}^{n}A_{ij}v_{j}\big{|}\geq\sqrt% {m}t_{0}\Big{)}\leq 2\exp\Big{(}\frac{-t_{0}^{2}m^{2}}{4c\|\mathbf{A}\|_{2}^{2% }}\Big{)}blackboard_P ( | ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≥ square-root start_ARG italic_m end_ARG italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ 2 roman_exp ( divide start_ARG - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_c ∥ bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (37a)
\displaystyle\Rightarrow (1m|j=1nAijvj|t0)2exp(t02m24c𝐀22).1𝑚superscriptsubscript𝑗1𝑛subscript𝐴𝑖𝑗subscript𝑣𝑗subscript𝑡02superscriptsubscript𝑡02superscript𝑚24𝑐superscriptsubscriptnorm𝐀22\displaystyle\mathbb{P}\Big{(}\frac{1}{\sqrt{m}}\big{|}\sum_{j=1}^{n}A_{ij}v_{% j}\big{|}\geq t_{0}\Big{)}\leq 2\exp\Big{(}\frac{-t_{0}^{2}m^{2}}{4c\|\mathbf{% A}\|_{2}^{2}}\Big{)}.blackboard_P ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG | ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≥ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ 2 roman_exp ( divide start_ARG - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_c ∥ bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) . (37b)

Using the inequality in (35), (37b) becomes

(𝐀𝐯2t0)2exp(t02m24c𝐀22).subscriptnorm𝐀𝐯2subscript𝑡02superscriptsubscript𝑡02superscript𝑚24𝑐superscriptsubscriptnorm𝐀22\mathbb{P}\Big{(}\|\mathbf{Av}\|_{2}\geq t_{0}\Big{)}\leq 2\exp\Big{(}\frac{-t% _{0}^{2}m^{2}}{4c\|\mathbf{A}\|_{2}^{2}}\Big{)}.blackboard_P ( ∥ bold_Av ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ 2 roman_exp ( divide start_ARG - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_c ∥ bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) . (38)

By choosing t0=2γnκsubscript𝑡02𝛾𝑛𝜅t_{0}=2\gamma\sqrt{n}-\kappaitalic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2 italic_γ square-root start_ARG italic_n end_ARG - italic_κ, (31) can be written as

𝐀𝐯22γnκ,subscriptnorm𝐀𝐯22𝛾𝑛𝜅\|\mathbf{Av}\|_{2}\geq 2\gamma\sqrt{n}-\kappa,∥ bold_Av ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 2 italic_γ square-root start_ARG italic_n end_ARG - italic_κ , (39)

with probability 12exp((2γmnmκ)24c𝐀22)12superscript2𝛾𝑚𝑛𝑚𝜅24𝑐superscriptsubscriptnorm𝐀221-2\exp\Big{(}\frac{(-2\gamma m\sqrt{n}-m\kappa)^{2}}{4c\|\mathbf{A}\|_{2}^{2}% }\Big{)}1 - 2 roman_exp ( divide start_ARG ( - 2 italic_γ italic_m square-root start_ARG italic_n end_ARG - italic_m italic_κ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_c ∥ bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ). Applying inequality in (33), i.e., 2γnκγ𝐯2κ2𝛾𝑛𝜅𝛾subscriptnorm𝐯2𝜅2\gamma\sqrt{n}-\kappa\geq\gamma\|\mathbf{v}\|_{2}-\kappa2 italic_γ square-root start_ARG italic_n end_ARG - italic_κ ≥ italic_γ ∥ bold_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_κ, we have

𝐀𝐯2𝐯2κsubscriptnorm𝐀𝐯2subscriptnorm𝐯2𝜅\|\mathbf{Av}\|_{2}\geq\|\mathbf{v}\|_{2}-\kappa∥ bold_Av ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ ∥ bold_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_κ (40)

with probability is at least 12exp((2γmnmκ)24c𝐀22)12superscript2𝛾𝑚𝑛𝑚𝜅24𝑐superscriptsubscriptnorm𝐀221-2\exp\Big{(}\frac{(-2\gamma m\sqrt{n}-m\kappa)^{2}}{4c\|\mathbf{A}\|_{2}^{2}% }\Big{)}1 - 2 roman_exp ( divide start_ARG ( - 2 italic_γ italic_m square-root start_ARG italic_n end_ARG - italic_m italic_κ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_c ∥ bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ). As 𝐯𝐯\mathbf{v}bold_v is defined by 𝐯=G(𝐳1)G(𝐳2)𝐯𝐺subscript𝐳1𝐺subscript𝐳2\mathbf{v}=G(\mathbf{z}_{1})-G(\mathbf{z}_{2})bold_v = italic_G ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_G ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), finally, we have

𝐀(G(𝐳1)G(𝐳2))2γG(𝐳1)G(𝐳2)2κsubscriptnorm𝐀𝐺subscript𝐳1𝐺subscript𝐳22𝛾subscriptnorm𝐺subscript𝐳1𝐺subscript𝐳22𝜅\|\mathbf{A}\big{(}G(\mathbf{z}_{1})-G(\mathbf{z}_{2})\big{)}\|_{2}\geq\gamma% \|G(\mathbf{z}_{1})-G(\mathbf{z}_{2})\|_{2}-\kappa∥ bold_A ( italic_G ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_G ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_γ ∥ italic_G ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_G ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_κ (41)

with probability is at least 12exp((2γmnmκ)24c𝐀22)12superscript2𝛾𝑚𝑛𝑚𝜅24𝑐superscriptsubscriptnorm𝐀221-2\exp\Big{(}\frac{(-2\gamma m\sqrt{n}-m\kappa)^{2}}{4c\|\mathbf{A}\|_{2}^{2}% }\Big{)}1 - 2 roman_exp ( divide start_ARG ( - 2 italic_γ italic_m square-root start_ARG italic_n end_ARG - italic_m italic_κ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_c ∥ bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ). This proves the S-REC(SG,γ,κsubscript𝑆𝐺𝛾𝜅S_{G},\gamma,\kappaitalic_S start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_γ , italic_κ) property in (30).

By combining (41) and (29), the proof is now completed.

References

  • [1] W. Liu, Q. Bao, Y. Sun, and T. Mei, “Recent advances of monocular 2d and 3d human pose estimation: A deep learning perspective,” ACM Computing Surveys, vol. 55, no. 4, pp. 1–41, Nov. 2022.
  • [2] Vicon, “Vicon—award winning motion capture systems,” accessed: Sep. 2023. [Online]. Available: https://www.vicon.com/
  • [3] Y. Siriwardhana, P. Porambage, M. Liyanage, and M. Ylianttila, “A survey on mobile augmented reality with 5g mobile edge computing: Architectures, applications, and technical aspects,” IEEE Communications Surveys & Tutorials, vol. 23, no. 2, pp. 1160–1192, 2021.
  • [4] B. Sébire and H. Xu, “Extended reality for nr,” accessed: Sep. 2023. [Online]. Available: https://www.3gpp.org/technologies/xr-nr
  • [5] A. Behravan et al., “Positioning and sensing in 6g: Gaps, challenges, and opportunities,” IEEE Vehicular Technology Magazine, Dec. 2022.
  • [6] T. Von Marcard, B. Rosenhahn, M. J. Black, and G. Pons-Moll, “Sparse inertial poser: Automatic 3d human pose estimation from sparse imus,” in Computer Graphics Forum, vol. 36, no. 2, May 2017, pp. 349–360.
  • [7] Movella, “Superior xsens motion capture technology, optimized for human movement,” accessed: Sep. 2023. [Online]. Available: https://www.movella.com/products/wearables/xsens-mtw-awinda
  • [8] N. Q. Hieu, D. N. Nguyen, D. T. Hoang, and E. Dutkiewicz, “When virtual reality meets rate splitting multiple access: A joint communication and computation approach,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 5, pp. 1536–1548, Jan. 2023.
  • [9] Y. Zhang, P. Zhao, K. Bian, Y. Liu, L. Song, and X. Li, “Drl360: 360-degree video streaming with deep reinforcement learning,” in IEEE Conference on Computer Communications, Apr. 2019, pp. 1252–1260.
  • [10] Y. Jiang, Z. Li, and J. Wang, “Ptrack: Enhancing the applicability of pedestrian tracking with wearables,” IEEE Transactions on Mobile Computing, vol. 18, no. 2, pp. 431–443, May 2018.
  • [11] V. Guzov, A. Mir, T. Sattler, and G. Pons-Moll, “Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4318–4329.
  • [12] D. Roetenberg et al., “Xsens mvn: Full 6dof human motion tracking using miniature inertial sensors,” Xsens Motion Technologies BV, Tech. Rep, vol. 1, pp. 1–7, Apr. 2009.
  • [13] Y. Huang, M. Kaufmann, E. Aksan, M. J. Black, O. Hilliges, and G. Pons-Moll, “Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time,” ACM Transactions on Graphics, vol. 37, no. 6, pp. 1–15, Dec. 2018.
  • [14] X. Yi et al., “Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 167–13 178.
  • [15] A. Winkler, J. Won, and Y. Ye, “Questsim: Human motion tracking from sparse sensors with simulated avatars,” in SIGGRAPH Asia 2022 Conference Papers, 2022, pp. 1–8.
  • [16] L. Kong, D. Zhang, Z. He, Q. Xiang, J. Wan, and M. Tao, “Embracing big data with compressive sensing: A green approach in industrial wireless networks,” IEEE Communications Magazine, vol. 54, no. 10, pp. 53–59, Oct. 2016.
  • [17] J. W. Choi, B. Shim, Y. Ding, B. Rao, and D. I. Kim, “Compressed sensing for wireless communications: Useful tips and tricks,” IEEE Communications Surveys & Tutorials, vol. 19, no. 3, pp. 1527–1550, Feb. 2017.
  • [18] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations, Apr. 2014, pp. 1–14.
  • [19] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,” in Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023, pp. 851–866.
  • [20] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing.   Springer, 2013.
  • [21] M. Dhar, A. Grover, and S. Ermon, “Modeling sparse deviations for compressed sensing using generative models,” in International Conference on Machine Learning, Jul. 2018, pp. 1214–1223.
  • [22] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 58, no. 1, pp. 267–288, Jan. 1996.
  • [23] A. Bora, A. Jalal, E. Price, and A. G. Dimakis, “Compressed sensing using generative models,” in International Conference on Machine Learning, Jul. 2017, pp. 537–546.
  • [24] I. Goodfellow et al., “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014.
  • [25] D. P. Kingma and M. Welling, “An introduction to variational autoencoders,” Foundations and Trends in Machine Learning, vol. 12, no. 4, pp. 307–392, Nov. 2019.
  • [26] T. M. Cover and J. A. Thomas, Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing).   Wiley-Interscience, Jul. 2006.
  • [27] S. Feizi, M. Médard, and M. Effros, “Compressive sensing over networks,” in 48th Annual Allerton Conference on Communication, Control, and Computing, Sep. 2010, pp. 1129–1136.
  • [28] E. Bourtsoulatze, D. B. Kurka, and D. Gündüz, “Deep joint source-channel coding for wireless image transmission,” IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, May 2019.
  • [29] Y. M. Saidutta, A. Abdi, and F. Fekri, “Joint source-channel coding over additive noise analog channels using mixture of variational autoencoders,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 7, pp. 2000–2013, May 2021.
  • [30] A. Jalal, M. Arvinte, G. Daras, E. Price, A. G. Dimakis, and J. Tamir, “Robust compressed sensing mri with deep generative priors,” Advances in Neural Information Processing Systems, vol. 34, pp. 14 938–14 954, Dec. 2021.
  • [31] M. Mardani et al., “Deep generative adversarial neural networks for compressive sensing mri,” IEEE Transactions on Medical Imaging, vol. 38, no. 1, pp. 167–179, Jul. 2018.
  • [32] U. Myn, M. Link, and M. Awinda, “Xsens mvn user manual,” Movella, 2021, accessed: Sep. 2023. [Online]. Available: https://www.xsens.com/hubfs/Downloads/usermanual/MVN_User_Manual.pdf
  • [33] “Cc2591 2.4-ghz rf front end,” Texas Instruments, 2014, accessed: Sep. 2023. [Online]. Available: https://www.ti.com/lit/ds/symlink/cc2591.pdf?ts=1694948601537&ref_url=https%253A%252F%252Fwww.google.com%252F
  • [34] D. Berthelot, C. Raffel, A. Roy, and I. Goodfellow, “Understanding and improving interpolation in autoencoders via an adversarial regularizer,” in International Conference on Learning Representations, May 2019.
[Uncaptioned image] Nguyen Quang Hieu received the B.E. degree in Hanoi University of Science Technology, Vietnam in 2018. He is currently a Ph.D. student at School of Electrical and Data Engineering, University of Technology (UTS), Sydney, Australia. Before joining UTS, he was a research assistant at School of Computer Science and Engineering, Nanyang Technological University, Singapore. His research interest include wireless communications and machine learning.
[Uncaptioned image] Dinh Thai Hoang (M’16, SM’22) is currently a faculty member at the School of Electrical and Data Engineering, University of Technology Sydney, Australia. He received his Ph.D. in Computer Science and Engineering from the Nanyang Technological University, Singapore 2016. His research interests include emerging wireless communications and networking topics, especially machine learning applications in networking, edge computing, and cybersecurity. He has received several precious awards, including the Australian Research Council Discovery Early Career Researcher Award, IEEE TCSC Award for Excellence in Scalable Computing for Contributions on “Intelligent Mobile Edge Computing Systems” (Early Career Researcher), IEEE Asia-Pacific Board (APB) Outstanding Paper Award 2022, and IEEE Communications Society Best Survey Paper Award 2023. He is currently an Editor of IEEE TMC, IEEE TWC, IEEE TCCN, IEEE TVT, and IEEE COMST.
[Uncaptioned image] Diep N. Nguyen (Senior Member, IEEE) received the M.E. degree in electrical and computer engineering from the University of California at San Diego (UCSD), La Jolla, CA, USA, in 2008, and the Ph.D. degree in electrical and computer engineering from The University of Arizona (UA), Tucson, AZ, USA, in 2013. He is currently the Head of 5G/6G Wireless Communications and Networking Lab, Director of Agile Communications and Computing group, Faculty of Engineering and Information Technology, University of Technology Sydney (UTS), Sydney, NSW, Australia. Before joining UTS, he was a DECRA Research Fellow with Macquarie University, Macquarie Park, NSW, Australia, and a Member of the Technical Staff with Broadcom Corporation, CA, USA, and ARCON Corporation, Boston, MA, USA, and consulting the Federal Administration of Aviation, Washington, DC, USA, on turning detection of UAVs and aircraft, and the U.S. Air Force Research Laboratory, USA, on anti-jamming. His research interests include computer networking, wireless communications, and machine learning application, with emphasis on systems’ performance and security/privacy. Dr. Nguyen received several awards from LG Electronics, UCSD, UA, the U.S. National Science Foundation, and the Australian Research Council. He has served on the Editorial Boards of the IEEE Transactions on Mobile Computing, IEEE Communications Surveys & Tutorials (COMST), IEEE Open Journal of the Communications Society, and Scientific Reports (Nature’s).
[Uncaptioned image] Mohammad Abu Alsheikh (Senior Member, IEEE) received the B.Eng. degree in computer systems from Birzeit University, Palestine. He worked as a Software Engineer at a digital advertising start-up and Cisco. Previously, he was a Postdoctoral Researcher with the Massachusetts Institute of Technology (MIT), USA. He is currently an Associate Professor and an ARC DECRA Fellow with the University of Canberra (UC), ACT, Australia. He designs and creates novel privacy-preserving Internet of Things systems that leverage both machine learning and convex optimization with applications in people-centric sensing, human activity recognition, and smart cities. His Ph.D. research at Nanyang Technological University (NTU), Singapore, focused on optimizing wireless sensor network’s data collection.