CASTER: A Computer-Vision-Assisted Wireless Channel Simulator for Gesture Recognition

Zhenyu Ren, Guoliang Li, Chenqing Ji, Shuai Wang, Chao Yu, Rui Wang

Abstract

In this paper, a computer-vision-assisted simulation method is proposed to address the issue of training dataset acquisition for wireless hand gesture recognition. In the existing literature, in order to classify gestures via the wireless channel estimation, massive training samples should be measured in a consistent environment, consuming significant efforts. In the proposed CASTER simulator, however, the training dataset can be simulated via existing videos. Particularly, in the channel simulation, a gesture is represented by a sequence of snapshots, and the channel impulse response of each snapshot is calculated via tracing the rays scattered off a primitive-based hand model. Moreover, CASTER simulator relies on the existing video clips to extract the motion data of gestures. Thus, the massive measurements of wireless channel can be eliminated. The experiments first demonstrate an $83.0\%$ average recognition accuracy of simulation-to-reality inference in recognizing $5$ categories of gestures. Moreover, this accuracy can be boosted to $96.5\%$ via the method of transfer learning.

Index Terms:

Wireless hand gesture recognition, channel model, simulation-to-reality inference.

Refer to caption — Figure 1: Illustration of primitive-based hand model and channel simulation scenario.

I Introduction

Sensing is becoming one of the core services of the next-generation wireless systems. There have been a significant number of works on wireless sensing, particularly the machine-learning-based human motion recognition (HMR), via channel state information (CSI)[1, 2, 3, 4] or passive architecture[5, 6, 7]. In most of these works, a significant number of labeled wireless signals should be collected and processed for the training of motion recognition models, which might be infeasible in many applications. In this paper, we would like to show that it is possible to generate the above training dataset for hand gesture recognition via channel simulation, instead of real measurement.

In fact, there have been a number of works on the extension of sophisticated communication channel models, such that the effects of sensing target on the channel impulse response are incorporated. Hence, the channel simulation based on these models might be used for motion recognition. For instance, the Data-Driven Hybrid Channel (DAHC) model of IEEE 802.11bf specification [8, 9] divided wireless channel into two parts: the target-unrelated components and the target-related components. The existing methods of communication channel modeling can be applied on the former; whereas the primitive-based human body model [10] was utilized to compute the latter. A similar channel model was also used in [11] for the optimization of communication and sensing performance. The WiGig Tools [12], developed by National Institute of Standards and Technology (NIST), enriched existing quasi-deterministic channel ray-tracers with supplementary target-related rays (T-Rays), such that the consistent effects of human motion could be included. Moreover, the methods for simulating radar echo signals off the human body were proposed in [13, 14]. All the above works relied on the primitive-based human body model [10, 15], where the hand was modeled as a single ellipsoid. Thus, these methods cannot model fine-grained hand gestures.

In order to facilitate the machine-learning-based HMR with the above channel models, diversified motion data are required to drive the primitive-based human body model in the channel simulation. Depth cameras and wearable sensors were used in [13, 14] to obtain sufficient body motion data for channel simulation. Nevertheless, to the best of our knowledge, there is no study on the capture of hand gestures for channel simulation. Moreover, it is unknown if conventional monocular cameras, instead of depth cameras, could obtain the motion data with adequate accuracy in the applications of wireless HMR. Note that the monocular cameras are of lower cost, and it is much more convenient to obtain hand gesture video clips of monocular cameras from online sources.

In this paper, we would like to shed some light on the above issues by proposing a Computer-vision-Assisted wireless channel SimulaTor for gEsture Recognition, namely CASTER. The proposed CASTER simulator is composed of channel generator and video gesture catcher. In the channel generator, the target hand is modeled with $21$ primitives, and the channel impulse response is calculated by tracing the rays scattered off all the primitives. Based on the hand model, a gesture is represented by a sequence of snapshots, and the channel impulse responses for all the snapshots can be obtained respectively. In the video gesture catcher, trajectories of $21$ primitives in one gesture can be captured from videos of a conventional monocular camera. Thus, the catcher provides an efficient way to retrieve motion data for the channel generator. In order to demonstrate the high fidelity of the proposed CASTER simulator, we use the simulated dataset of channel impulse responses to train a gesture recognition model and use a passive sensing system[7] to measure the real channel for model testing. It is shown that an $83.0\%$ average recognition accuracy of simulation-to-reality inference can be achieved by recognizing $5$ categories of gestures. Moreover, this accuracy can be boosted to $96.5\%$ via the method of transfer learning[16], where the gesture recognition model trained via a simulated dataset is further fine-tuned with a small amount of unlabeled real measurements according to the adversarial discriminative domain adaptation (ADDA) method in [17]. The main advantages of the proposed CASTER simulator are summarized below:

•

Conventional measurement of training dataset for wireless HMR is replaced by channel simulation and gesture video recognition, saving the significant cost of real experiments.
•

In the proposed CASTER simulator, the locations of the signal transmitter, sensing receiver, target hand, and scattering clusters can be adjusted freely to adapt to heterogeneous sensing scenarios.

As a result, the proposed CASTER simulator has the potential to customize the gesture recognition models for heterogeneous scenarios without real measurements.

The remainder of this paper is organized as follows. The simulator framework is elaborated in Section II. The channel generator is presented in Section III, and the video gesture catcher is presented in Section IV. The performance of the CASTER simulator is evaluated in Section V. Finally, the conclusion is drawn in Section VI.

In this paper, we use the following notations: non-bold letters are used to denote scalar values, bold lowercase letters (e.g., $\mathbf{a}$ ) are used to denote column vectors, bold uppercase letters (e.g., $\mathbf{A}$ ) are used to denote matrices, $|{\mathbf{a}}|$ and $\mathbf{a}^{T}$ denote the L2-norm and transpose of vector $\mathbf{a}$ .

II Simulator Framework

The proposed CASTER simulator is developed with the primitive-based hand model. In order to extract high-fidelity channel impulse responses from existing videos, the CASTER simulator is composed of the channel generator and video gesture catcher. The former generates a sequence of channel impulse response snapshots given arbitrary hand gestures and arbitrary locations of the transmitter and receiver. The latter captures the parameters of real hand motions from existing videos as the former’s input. As a result, the CASTER simulator is able to provide datasets for the training of the hand gesture recognition model without real channel measurement.

As depicted in Fig. 1, the locations of the transmitter, receiver and the target hand can be arbitrary in the channel generator. A gesture is represented as a sequence of snapshots, with an interval of $\Delta t_{\text{s}}$ seconds. In each snapshot, the channel is assumed to be quasi-static, and the channel impulse response is calculated via the primitive-based method [10]. Particularly, the hand is modeled via $21$ keypoints (joints) and $21$ ellipsoids (primitives) connecting the keypoints. The non-line-of-sight (NLoS) channel components via the hand can be approximated by the $21$ rays respectively scattered off the centers of all primitives. Hence, the channel impulse response of one snapshot can be obtained by aggregating all the rays from the transmitter to the receiver, including the line-of-sight (LoS) ray, the NLoS ones scattered off the target hand, and the others scattered at the environment.

As a remark notice, the $21$ -keypoint hand model is widely recognized in the fields of computer vision and biomedical engineering [18]. The renowned hand models, such as openpose[19], mediaipipe[20], and MANO[21], are all based on this $21$ -keypoint representation. It could provide the same degrees of freedom in describing the complex hand and finger motions as explained in [18]: a human hand consists of $21$ joints, yielding $27$ degrees of freedom, which are the same as the $21$ -keypoint hand model.

Moreover, the proposed video gesture catcher first extracts the 3-dimensional (3D) coordinates of hand keypoints from each video frame in a local hand world coordination system via machine learning technique, converts the trajectories of the keypoints from the local hand world coordinate system to a global camera coordinate system and then eliminates the fake hops and jitters of trajectories via low-pass filtering. Finally, since the interval between two video frames, denoted as $\Delta t_{\text{v}}$ , is usually much larger than $\Delta t_{\text{s}}$ , an interpolation is necessary to fill a sufficient number of snapshots between two video frames. As a remark, the video clips for the gesture catcher can be recorded in arbitrary environment as long as the desired hand gestures can be identified by the gesture catcher. Hence, they could be obtained from massive online sources.

III Channel Generator

Without loss of generality, the generation of channel impulse response for the $t$ -th snapshot ( $\forall t$ ) is elaborated in this section. As shown in Fig. 1, the rays from the transmitter to the receiver can be categorized into two parts: target-unrelated components and target-related components. The former refers to the LoS ray and the NLoS rays scattered at the static environment, and the latter refers to the NLoS rays scattered off the target hand. Particularly, let $h(\tau,t)$ and $u(\tau,t)$ be the overall channel impulse response and target-related channel impulse response of the $t$ -th snapshot, $v(\tau)$ be time-invariant target-unrelated channel impulse response. Following the channel model in [9], we have

\displaystyle h(\tau,t)=u(\tau,t)+v(\tau),

(1)

where the generation of $u(\tau,t)$ and $v(\tau)$ is elaborated in the following parts respectively.

III-A Target-Related Channel Components

Let $\mathbf{p}_{\text{t}}$ and $\mathbf{p}_{\text{r}}$ be the coordinates of the transmitter and the receiver respectively, $\mathbf{p}_{i}(t)$ and $\mathbf{p}_{j}(t)$ be the coordinates of the two joints associated with the $n$ -th primitive in the $t$ -th snapshot ( $\forall n,t$ ). Hence, the center of the $n$ -th primitive is $\mathbf{p}_{n}^{c}(t)=[\mathbf{p}_{i}(t)+\mathbf{p}_{j}(t)]/2$ . As previously mentioned, each primitive is modeled as an ellipsoid, the length of the axis connecting the two joints is denoted as $2l_{n}(t)$ , where

l_{n}(t)=|\mathbf{p}_{i}(t)-\mathbf{p}_{j}(t)|/2.

(2)

Moreover, the lengths of the other two axes are identical, denoted as $2r_{n}(t)$ . Usually, $r_{n}(t)<l_{n}(t)$ , and we choose $r_{n}(t)=l_{n}(t)/2$ . Hence, we shall refer to the axis connecting the two joints as the long axis of the ellipsoid. As a remark note that the primitive size ( $r_{n}$ and $l_{n}$ ) varies slightly over time due to the non-rigid nature of human motion.

Let $R_{\text{t}}^{n}(t)=|\mathbf{p}_{\text{t}}-\mathbf{p}^{\text{c}}_{n}(t)|$ be the distance between the transmitter and the $n$ -th primitive center, $R_{\text{r}}^{n}(t)=|\mathbf{p}_{\text{r}}-\mathbf{p}^{\text{c}}_{n}(t)|$ be the distance between the receiver and the $n$ -th primitive center, $G_{\text{t}}^{n}(t)$ and $G_{\text{r}}^{n}(t)$ be the transmit and receive antenna gains at the directions of incident ray $\mathbf{p}_{\text{t}}-\mathbf{p}^{\text{c}}_{n}(t)$ and scattered ray $\mathbf{p}^{\text{c}}_{n}(t)-\mathbf{p}_{\text{r}}$ , $\sigma_{n}(t)$ be the bistatic radar cross section (RCS) of the $n$ -th primitive, $c$ be the speed of light, $f_{\text{c}}$ and $\lambda$ be the carrier frequency and wavelength respectively. The response of the path scattered off the $n$ -th primitive can be expressed as

\displaystyle u_{n}(\tau,t)=\lambda\sqrt{\frac{\sigma_{n}(t)G^{n}_{\text{t}}(t% )G^{n}_{\text{r}}(t)}{(4\pi)^{3}(R_{\text{t}}^{n}(t)R_{\text{r}}^{n}(t))^{2}}}% e^{-\mathrm{j}\phi_{n}(t)}\delta(\tau-\tau_{n}(t)),

(3)

where $\delta(a)$ is the impulse function, whose value is $1$ when $a=0$ and $0$ otherwise, while $\tau_{n}(t)=\left[R_{\text{t}}^{n}(t)+R_{\text{r}}^{n}(t)\right]/c$ and $\phi_{n}(t)=2\pi f_{\text{c}}\tau_{n}(t)$ measure the delay and phase shift.

Moreover, the calculation of the bistatic RCS $\sigma_{n}(t)$ follows the method in [22, 23]. As depicted in Fig. 2, let $\theta^{n}_{\text{t}}(t)$ and $\theta^{n}_{\text{r}}(t)$ represent the incident and scattered elevation angles respectively, $\phi^{n}_{\text{t}}(t)$ and $\phi^{n}_{\text{r}}(t)$ represent the incident and scattered azimuth angles respectively, $\boldsymbol{v}_{n}(t)=[\mathbf{p}_{i}(t)-\mathbf{p}_{j}(t)]/(2l_{n}(t))$ represent the normalized vector along the long axis, we have

\displaystyle\theta_{\text{t}}^{n}(t)=\arccos\left((\mathbf{p}^{\text{c}}_{n}(% t)-\mathbf{p}_{\text{t}})^{T}\boldsymbol{v}_{n}(t)/R^{n}_{\text{t}}(t)\right),

(4)

\displaystyle\theta_{\text{r}}^{n}(t)=\arccos\left((\mathbf{p}^{\text{c}}_{n}(% t)-\mathbf{p}_{\text{r}})^{T}\boldsymbol{v}_{n}(t)/R^{n}_{\text{r}}(t)\right),

(5)

and

\displaystyle|\phi^{n}_{\text{r}}(t)-\phi^{n}_{\text{t}}(t)|

\displaystyle=\arccos\left(\frac{(\mathbf{p}^{\text{c}}_{n}(t)-\tilde{\mathbf{% p}}_{\text{t}}(t))^{T}(\mathbf{p}^{\text{c}}_{n}(t)-\tilde{\mathbf{p}}_{\text{% r}}(t))}{|\mathbf{p}^{\text{c}}_{n}(t)-\tilde{\mathbf{p}}_{\text{t}}(t)||% \mathbf{p}^{\text{c}}_{n}(t))-\tilde{\mathbf{p}}_{\text{r}}(t)|}\right),

(6)

where

\tilde{\mathbf{p}}_{\text{t}}(t)=\mathbf{p}_{\text{t}}-\boldsymbol{v}_{n}(t)(% \mathbf{p}_{\text{t}}-\mathbf{p}^{\text{c}}_{n}(t))^{T}\boldsymbol{v}_{n}(t)

and

\tilde{\mathbf{p}}_{\text{r}}(t)=\mathbf{p}_{\text{r}}-\boldsymbol{v}_{n}(t)(% \mathbf{p}_{\text{r}}-\mathbf{p}^{\text{c}}_{n}(t))^{T}\boldsymbol{v}_{n}(t)

denotes the projection of the transmitter and receiver’s locations on the plane containing the center of the $n$ -th ellipsoid and perpendicular to its long axis in the $t$ -th snapshot. As a result, the bistatic RCS $\sigma_{n}(t)$ of $n$ -th ellipsoid in the $t$ -th snapshot is given by (7).

\sigma_{n}(t)=\frac{4\pi r_{n}^{4}(t)l_{n}^{2}(t)[(1+\cos\theta^{n}_{\text{t}}% (t)\cos\theta^{n}_{\text{r}}(t))\cos(\phi^{n}_{\text{r}}(t)-\phi^{n}_{\text{t}% }(t))+\sin\theta^{n}_{\text{t}}(t)\sin\theta^{n}_{\text{r}}(t)]^{2}}{[r_{n}^{2% }(t)(\sin^{2}\theta^{n}_{\text{t}}(t)+\sin^{2}\theta^{n}_{\text{r}}(t)+2\sin% \theta^{n}_{\text{t}}(t)\sin\theta^{n}_{\text{r}}(t)\cos(\phi^{n}_{\text{r}}(t% )-\phi^{n}_{\text{t}}(t)))+l_{n}^{2}(t)(\cos\theta^{n}_{\text{t}}(t)+\cos% \theta^{n}_{\text{r}}(t))^{2}]^{2}}.

(7)

Aggregating the NLoS rays scattered off all the primitives, the target-related channel impulse response can be written as

\displaystyle u(\tau,t)=\sum_{n=1}^{21}u_{n}(\tau,t).

(8)

III-B Target-Unrelated Channel Components

CASTER simulator models the environment by $K$ static scatterers. Let the RCS, transmit and receive antenna gains and the distance of the $k$ -th NLoS ray be $\sigma_{k}$ , $G_{\text{t}}^{k}$ , $G_{\text{r}}^{k}$ , $R_{\text{t}}^{k}$ , and $R_{\text{r}}^{k}$ , respectively. The NLoS components of target-unrelated channel impulse response can be written as

\displaystyle v_{\text{NLoS}}(\tau)=\sum_{k=1}^{K}\lambda\sqrt{\frac{\sigma_{k% }G_{\text{t}}^{k}G_{\text{r}}^{k}}{(4\pi)^{3}(R_{\text{t}}^{k}R_{\text{r}}^{k}% )^{2}}}e^{-\mathrm{j}\phi_{k}}\delta(\tau-\tau_{k}),

(9)

where $\tau_{k}=\left(R^{k}_{\text{t}}+R^{k}_{\text{r}}\right)/c$ and $\phi_{k}=2\pi f_{\text{c}}\tau_{k}$ .

Moreover, let transmit and receive antenna gains at the direction of LoS path be $G_{\text{t, LoS}}$ and $G_{\text{r, LoS}}$ , distance between transmitter and receiver be $R_{\text{LoS}}$ , the LoS component of target-related channel is modeled via the following free space model:

\displaystyle v_{\text{LoS}}(\tau)=\frac{\lambda\sqrt{G_{\text{t, LoS}}G_{% \text{r, LoS}}}}{4\pi R_{\text{LoS}}}e^{-\mathrm{j}\phi_{\text{LoS}}}\delta(% \tau-\tau_{\text{LoS}}),

(10)

where $\tau_{\text{LoS}}=R_{\text{LoS}}/c$ and $\phi_{\text{LoS}}=2\pi f_{\text{c}}\tau_{\text{LoS}}$ . As a result, according to [9], the target-unrelated channel impulse response can be written as

\displaystyle v(\tau)=v_{\text{LoS}}(\tau)+v_{\text{NLoS}}(\tau).

(11)

IV Video Gesture Catcher

As mentioned in the previous section, the motion of the target hand is characterized by the trajectories of the $21$ keypoints in a sequence of snapshots, denoted as $\mathbf{p}_{i}(t),i=1,2,...,21$ . We leverage the tool of Mediapipe [20] to extract the keypoint trajectories from videos of monocular cameras, where two issues in the conversion are addressed in this section. The Mediapipe could localize the positions of the keypoints in each video frame. The positions are represented in the coordinate system with the origin at the hand center, namely hand world coordinate system. However, it is difficult to calculate the Doppler frequency with such coordinate system, as the hand center is moving. Hence, we first transfer the coordinates to a unified coordinate system by solving the Perspective-n-Point (PnP) problem [24], where the fake hops on the trajectories are smoothed. Moreover, because there are usually $30$ video frames per second, which is not sufficient for estimating the Doppler frequencies of gesture. For example, the typical Doppler frequencies of gestures on the $60$ GHz signals are around $800$ Hz (assuming a maximum radial velocity of $4$ meters per second), which requests $1600$ snapshots per second at least. Hence, interpolation is introduced such that the channel impulse response can be generated with a shorter interval.

IV-A Conversion of Coordinate Systems

For the elaboration convenience, we first introduce the following three coordinate systems. The two-dimensional (2D) pixel coordinate system in the unit of pixels is used to identify the positions of hand keypoints in each video frame. The origin of the pixel coordinate system is usually at the upper left corner of each frame, as shown in Fig. 3. The three-dimensional (3D) hand world coordinate system in the unit of meters measures the positions of hand keypoints in the real world with respect to the hand center. Moreover, the 3D camera coordinate system in the unit of meters measures the positions of hand keypoints with respect to the static camera lens, which captures the videos. The Mediapipe is able to identify the $21$ keypoints, localize them in the first two coordinate systems. Because the hand center is usually in motion and the camera is static, the trajectories in the camera coordinate system instead of in the hand world coordinate system, could be used to calculate the Doppler frequencies. Thus, the coordinates of hand keypoints $\mathbf{p}_{i}(t)$ , $i=1,2,...,21$ , transmitter $\mathbf{p}_{\text{t}}$ and receiver $\mathbf{p}_{\text{r}}$ , defined in the previous section should be measured in the camera coordinate system. The above three coordinate systems are illustrated in Fig. 3, as referenced.

Define the coordinates of the $i$ -th keypoint ( $i=1,2,...,21$ ) in the pixel, hand world and camera coordinate systems as $(u_{i},v_{i})$ , $(x_{i}^{\text{w}},y_{i}^{\text{w}},z_{i}^{\text{w}})$ , and $(x_{i},y_{i},z_{i})$ , respectively, where the snapshot index $t$ is ignored in this section for the simplicity of elaboration. Let $f$ be the focal length in the unit of pixels, $(c_{x},c_{y})$ be the coordinates of image center in the pixel coordinate system, we define the camera intrinsic matrix $\mathbf{A}$ as

\displaystyle\mathbf{A}=\begin{bmatrix}f&0&c_{x}\\ 0&f&c_{y}\\ 0&0&1\end{bmatrix}.

(12)

Hence, the relation between the 2D pixel and 3D camera coordinate systems can be expressed as

\displaystyle z_{i}[u_{i}\ v_{i}\ 1]^{T}=\mathbf{A}[x_{i}\ y_{i}\ z_{i}]^{T}.

(13)

Let $\mathbf{R}\in\mathbb{R}^{3\times 3}$ and $\mathbf{t}$ be the rotation matrix and translation vector from hand world coordinate system to camera coordinate system, we define the camera extrinsic matrix $\mathbf{T}$ and perspective projection matrix $\mathbf{\Pi}$ as follows:

\displaystyle\mathbf{T}=\begin{bmatrix}\mathbf{R}&\mathbf{t}\\ \mathbf{0}_{1\times 3}&1\end{bmatrix},

(14)

\displaystyle\mathbf{\Pi}=\begin{bmatrix}\mathbf{I}_{3\times 3}&\mathbf{1}_{3% \times 1}\end{bmatrix},

(15)

where $\mathbf{I}_{3\times 3}$ denotes a $3\times 3$ identity matrix, $\mathbf{0}_{1\times 3}$ and $\mathbf{1}_{3\times 1}$ are the three-dimensional row and column vectors with all 0 and 1 entries respectively. According to [24], the relations between the hand world and camera coordinate systems are given by

\displaystyle[x_{i}\ y_{i}\ z_{i}\ 1]^{T}=\textbf{T}[x_{i}^{\text{w}}\ y_{i}^{% \text{w}}\ z_{i}^{\text{w}}\ 1]^{T}.

(16)

As a result, the relation between the hand world and the pixel coordinate system could be described as

\displaystyle z_{i}[u_{i}\ v_{i}\ 1]^{T}=\mathbf{A}\mathbf{\Pi}\mathbf{T}[x_{i% }^{\text{w}}\ y_{i}^{\text{w}}\ z_{i}^{\text{w}}\ 1]^{T}.

(17)

For the elaboration convenience, we denote the projection from the hand world coordinate system to the pixel coordinate system as the following function $\mathcal{P}$ :

	$\displaystyle[u_{i}\ v_{i}]^{T}$	$\displaystyle=\mathcal{P}([x_{i}^{\text{w}}\ y_{i}^{\text{w}}\ z_{i}^{\text{w}% }]^{T},\mathbf{R},\mathbf{t},\mathbf{A})$
		$\displaystyle=\frac{1}{z_{i}}[\mathbf{I}_{2\times 2}\ \mathbf{0}_{2\times 1}]% \mathbf{A}\underbrace{(\mathbf{R}[x_{i}^{\text{w}}\ y_{i}^{\text{w}}\ z_{i}^{% \text{w}}]^{T}+\mathbf{t})}_{=[x_{i}\ y_{i}\ z_{i}]^{T}}.$		(18)

The Mediapipe could provide the coordinates $(u_{i},v_{i})$ and $(x_{i}^{\text{w}},y_{i}^{\text{w}},z_{i}^{\text{w}})$ of all the keypoints ( $i=1,2,...,21$ ) in each video frame. Hence, their coordinates in the camera coordinate system can be calculated with the knowledge of the rotation matrix $\mathbf{R}$ and translation vector $\mathbf{t}$ .

In fact, the parameters in the camera intrinsic matrix $\mathbf{A}$ can be measured in advance, the rotation matrix $\mathbf{R}$ and translation vector $\mathbf{t}$ can be estimated via (IV-A) for $i=1,2,...,21$ . Particularly, given the coordinates of the $21$ keypoints in the pixel and hand world coordinate systems, the detection of the rotation matrix $\mathbf{R}$ and translation vector $\mathbf{t}$ can be formulated as follows.

	$\displaystyle\mathop{\mathrm{min}}_{\mathbf{R},\mathbf{t}}\quad$	$\displaystyle\sum_{i=1}^{21}\|(u_{i},v_{i})-\mathcal{P}([x_{i}^{\text{w}}\ y_{i% }^{\text{w}}\ z_{i}^{\text{w}}]^{T},\mathbf{R},\mathbf{t},\mathbf{A})\|^{2},$
	$\displaystyle\mathrm{s.t.}\quad$	$\displaystyle\mathbf{R}(\mathbf{R})^{T}=\mathbf{I}_{3\times 3},\ \mathrm{det}(% \mathbf{R})=1,$		(19)

where $\mathrm{det}(.)$ represents the determinant of a matrix.

The above problem is referred to as the Perspective-n-Point (PnP) problem [24]. It can be solved via the cv2.solvePnP function from the popular computer vision library OpenCV [25], where the Levenberg-Marquardt optimization method [26] is adopted.

IV-B Motion Smoothing and Snapshot Interpolation

Because of the errors of keypoint detection with Mediapipe, there might be fake hops or jitters in the detected trajectories of keypoints, which do not exist actually. This will lead to the false alarm of high Doppler frequencies (as depicted in Fig. 4). In order to generate a high-fidelity dataset for gesture recognition model training, a low-pass filter, namely one-euro filter [27], is proposed to smooth both trajectories and velocities, followed by snapshot interpolation between neighboring video frames.

Let $\mathbf{q}_{i,k}=[x_{i,k}\ y_{i,k}\ z_{i,k}]^{T}$ and $\hat{\mathbf{q}}_{i,k}=[{\hat{x}}_{i,k}\ \hat{y}_{i,k}\ \hat{z}_{i,k}]^{T}$ be the positions of the $i$ -th keypoint in the $k$ -th frame before and after the low-pass filtering respectively, $\dot{\mathbf{q}}_{i,k}=[\dot{x}_{i,k}\ \dot{y}_{i,k}\ \dot{z}_{i,k}]^{T}$ and $\hat{\dot{\mathbf{q}}}_{i,k}=[\hat{\dot{x}}_{i,k}\ \hat{\dot{y}}_{i,k}\ \hat{% \dot{z}}_{i,k}]^{T}$ be the estimated velocities of the $i$ -th keypoint in the $k$ -th frame before and after the low-pass filtering respectively. Initializing $\hat{\mathbf{q}}_{i,1}$ with $\mathbf{q}_{i,1}$ , the trajectory smoothing for the $i$ -th keypoint in the $k$ -th frame is given by

\displaystyle\hat{o}_{i,k}=\alpha_{i,k}o_{i,k}+(1-\alpha_{i,k})\hat{o}_{i,k-1}% ,\quad\forall i,k\geq 2

(20)

where the notation $o$ represents the dimensions of $x$ , $y$ and $z$ , respectively, and

\displaystyle\alpha_{i,k}=\frac{1}{1+\frac{1}{2\pi{\Delta t}_{\text{v}}(f_{% \text{c}_{\text{min}}}+\beta|\hat{\dot{o}}_{i,k}|)}}

is the smoothing factor, ${\Delta t}_{\text{v}}$ is the video frame interval, $f_{\text{c}_{\text{min}}}$ is the minimum cutoff frequency, $\beta$ is the speed coefficient of update. Moreover, the velocity in the above equation can be calculated as

\displaystyle\hat{\dot{o}}_{i,k}=\gamma\dot{o}_{i,k}+(1-\gamma)\hat{\dot{o}}_{% i,k-1},\quad\forall i,k\geq 2

(21)

where $\dot{o}_{i,k}=(o_{i,k}-\hat{o}_{i,k-1})/{\Delta t}_{\text{v}}$ , $\hat{\dot{o}}_{i,1}$ is initialized with $0$ , $\gamma$ is the fixed smoothing factor.

Algorithm 1 One-euro low-path filter for keypoint trajectory smoothing.

1:Input:

•

$\{\mathbf{q}_{i,k}=[x_{i,k}\ y_{i,k}\ z_{i,k}]^{T}|i\in\{1,\ldots,21\},k\in\{1% ,\ldots,K\}\}$ , where $\mathbf{q}_{i,k}$ denotes the location of the $i$ -th keypoint in the $k$ -th frame.
•

${f_{\text{c}}}_{\text{min}}$ : Minimum cutoff frequency for position.
•

$\beta$ : Speed coefficient.
•

$\gamma$ : Smoothing factor for velocity.
•

$\Delta t_{\text{v}}$ : Video frame interval.

2:Output:

•

$\{\mathbf{\hat{q}}_{i,k}=[\hat{x}_{i,k}\ \hat{y}_{i,k}\ \hat{z}_{i,k}]^{T}|i% \in\{1,\ldots,21\},k\in\{1,\ldots,K\}\}$ : where $\mathbf{\hat{q}}_{i,k}$ denotes the location of the $i$ -th keypoint in the $k$ -th frame after smoothing.

3:for

k\leftarrow 2

K

\triangleright

Iteration over frames.

4: for

i\leftarrow 1

21

\triangleright

Iteration over keypoints.

5: for

o

represents the dimensions of

x

y

, and

z

respectively do

\hat{o}_{i,1}\leftarrow o_{i,1}

\hat{\dot{o}}_{i,1}\leftarrow 0

\dot{o}_{i,k}=(o_{i,k}-\hat{o}_{i,k-1})/{\Delta t}_{\text{v}}

\hat{\dot{o}}_{i,k}=\gamma\dot{o}_{i,k}+(1-\gamma)\hat{\dot{o}}_{i,k-1}

\triangleright

Equation (21): smooth velocity.

\alpha_{i,k}=\frac{1}{1+\frac{1}{2\pi{\Delta t}_{\text{v}}(f_{\text{c}_{\text{% min}}}+\beta|\hat{\dot{o}}_{i,k}|)}}

\triangleright

Update smoothing factor for position.

10:

\hat{o}_{i,k}=\alpha_{i,k}o_{i,k}+(1-\alpha_{i,k})\hat{o}_{i,k-1}

\triangleright

Equation (20): smooth position.

11: end for

12:

\mathbf{\hat{q}}_{i,k}=[\hat{x}_{i,k}\ \hat{y}_{i,k}\ \hat{z}_{i,k}]^{T}

13: end for

14:end for

15:return

\{\mathbf{\hat{q}}_{i,k}=[\hat{x}_{i,k}\ \hat{y}_{i,k}\ \hat{z}_{i,k}]^{T}|i% \in\{1,\ldots,21\},k\in\{1,\ldots,K\}\}

The overall smoothing procedure via one-euro filter is illustrated in Alg. 1. In fact, the smoothing of the $i$ -th keypoint’s velocity $\hat{\dot{o}}_{i,k}$ and trajectory $\hat{o}_{i,k}$ in the $k$ -th frame is conducted by repeating two first-order low-pass filters (21) and (20) to the position and velocity of the $i$ -th keypoint. This procedure effectively eliminates false hops or jitters in the detected keypoint trajectories while preserving the motion features. An example of the smoothing result is shown in Fig. 4.

Finally, we adopt the cubic spline interpolation method [28] to insert $\Delta t_{\text{v}}/\Delta t_{\text{s}}-1$ positions of the $i$ -th keypoint ( $\forall i$ ) between every two neighboring frames (say $\hat{\mathbf{q}}_{i,k}$ and $\hat{\mathbf{q}}_{i,k+1}$ , $\forall k$ ), and denote the position of the $i$ -th keypoint in the $t$ -th snapshot as $\mathbf{p}_{i}(t)$ .

V Evaluation of CASTER Simulator

In this section, the high fidelity of the CASTER simulator in the applications of gesture recognition is demonstrated. Specifically, the generation of gesture datasets via CASTER simulator and real measurement is first elaborated. Then, the recognition performance via the above two datasets is discussed.

V-A Simulation and Experimental Datasets

In order to verify the quality of the dataset generated by CASTER simulator, $500$ clips of videos on $5$ gestures, including “Pushing and Pulling”, “Beckoning”, “Rubbing Fingers”, “Plugging” (slicing forward with all fingers together), and “Scaling” (spreading thumb, index finger, middle finger) were recorded using a normal monocular camera at a rate of $30$ frames per second (fps). The motion data for hand model is then extracted via the video gesture catcher.

On the other hand, in the channel generator, the locations of transmitter, receiver and target hand center are $[0m,-0.1m,-1.5m]$ , $[0.2m,-0.1m,0.1m]$ , and $[0m,0m,0.4\sim 0.8m]$ , respectively. Moreover, in order to model the target-unrelated channel, $K$ static RCSs are randomly generated from a normal distribution with a mean value of $0.005\,m^{2}$ and a standard deviation of $0.001\,m^{2}$ . These RCSs are associated with scatterers that are randomly located within a $2\,m\times 2\,m\times 2\,m$ cuboid centered at the receiver. The positions of these scatterers are used to calculate the associated parameters $G_{\text{t}}^{k}$ , $G_{\text{r}}^{k}$ , $R_{\text{t}}^{k}$ , and $R_{\text{r}}^{k}$ .

Thus, $100$ sequences of channel impulse responses for each gesture are obtained via the proposed CASTER simulator with a sampling rate of $2000$ snapshots per second. Then, one spectrogram, illustrating the Doppler frequency versus time, is calculated for each video clip (each sequence of channel impulse responses) by applying the short-time Fourier transform (STFT) with a window of $0.125$ seconds ( $250$ snapshots). As a result, a simulated dataset of $500$ spectrograms for the recognition of 5 gestures is obtained as illustrated in Fig. 5.

In order to measure the real Doppler spectrum of gestures, an integrated passive sensing and communication system working on millimeter wave (mmWave) band is developed as in our previous work [7]. As illustrated in Fig. 6, at the transmitter, an NI USRP-2954R [29] is utilized to generate an intermediate frequency (IF) signal at $500$ MHz, which is subsequently up-converted to 60 GHz and transmitted using a Sivers $60$ GHz phased array[30]. At the receiver, two phased arrays are connected to a single USRP device to receive signals from the reference and surveillance channels, respectively. The transmit mmWave signal is modulated via orthogonal frequency-division multiplexing (OFDM). The carrier frequency is $60.48$ GHz and the signal bandwidth is $5$ MHz.

In the experiment, the locations of the transmitter and receiver are consistent with those in the simulator. $100$ trials are measured for each gesture via the passive sensing system. Following the signal processing in [7], the spectrogram of hand gestures can be computed through the cross-ambiguity function (CAF). As a result, an experimental dataset with $100$ spectrograms per gesture is obtained, as illustrated in Fig. 5.

V-B Performance of Gesture Recognition

First of all, it can be observed from Fig. 7 that the spectrograms from real experiment and CASTER simulator exhibit similar time-Doppler patterns. To further demonstrate the high fidelity of the proposed simulator in the applications of gesture recognition, the following six training and testing schemes are adopted with the same image recognition model named ResNet18[31]:

•

Scheme $1$ : The training set consists of $60$ simulated spectrograms for each gesture, and the test set consists of $40$ measured ones for each gesture;
•

Scheme $2$ : The training set consists of $50$ simulated spectrograms and $10$ measured ones for each gesture, and the test set consists of $40$ measured ones for each gesture;
•

Scheme $3$ : The training set consists of $40$ simulated spectrograms and $20$ measured ones for each gesture, and the test set consists of $40$ measured ones for each gesture;
•

Scheme $4$ : The training set consists of $30$ simulated spectrograms and $30$ measured ones for each gesture, and the test set consists of $40$ measured ones for each gesture;
•

Scheme $5$ : The training set consists of $60$ measured spectrograms for each gesture, and the test set consists of $40$ measured ones for each gesture;
•

Scheme $6$ : The training set consists of $60$ simulated spectrograms for each gesture, and the test set consists of $40$ simulated ones for each gesture.

The overall results of the gesture recognition are shown in Fig. 8, and the confusion charts of the $6$ schemes are shown in Fig. 9 respectively. It can be observed that an accuracy of $83.0\%$ (Scheme $1$ ) can be achieved if the simulated dataset is used for training and the experimental dataset is used for testing. On the other hand, there is still roughly $16.0\%$ and $17.0\%$ performance loss compared with the Scheme $5$ and $6$ , indicating that the difference between simulated and experimental datasets is not negligible. One method to mitigate such difference is to mix some experimental samples into the simulated dataset. It can be observed from the results of Scheme $2,3,4$ that mixing some experimental samples could significantly improve the testing accuracy. Moreover, it can be observed that the enhanced recognition accuracy converges to $98.5\%$ for Schemes $3$ and $4$ . However, this is still $0.5\%$ lower than the accuracy achieved with Scheme $5$ . This difference indicates the inherent feature distinctions between simulated and experimental datasets.

Furthermore, it is apparent from the Fig. 9(a) that gesture recognition for “beckoning” and “plugging” are not sufficiently accurate. The recognition accuracy is $70\%$ and $75\%$ respectively. Moreover, $22.5\%$ confusion probability exists between the gestures of “beckoning” and “pushing and pulling”, indicating that some of the simulation samples of “beckoning” are similar to the experimental samples of “pushing and pulling”.

To qualitatively support the aforementioned observations, we applied dimensionality reduction techniques to the extracted features (network output before entering the fully-connected layer classifier) for the entire simulation and experimental datasets using the ResNet18 model trained by Scheme $1$ . Specifically, t-distributed Stochastic Neighbor Embedding (t-SNE)[32] and Principal Component Analysis (PCA)[33] were employed to visualize and analyze the high-dimensional features ( $512$ dimensions for ResNet18) of the dataset, as illustrated in Fig. 10. It could be observed that although $83.0\%$ gesture recognition accuracy is achieved, the distributions of the features of different gestures are not sufficiently separated. Moreover, the simulated and experimental features for the gesture “Scaling” are not well aligned, indicating the inherent feature distinctions between simulated and experimental datasets. These could be regarded as the limitation of the proposed simulator, since the real-world channel is more complex than the simulated one due to the impacts of multipath and the non-ideal hardware.

V-C Improvement via Transfer Learning

The transfer learning technique [16] is applied in this part to relieve the above issue of feature distinction. In this context, the simulated dataset is referred to as the source domain, and the experimental dataset as the target domain. The adversarial discriminative domain adaptation (ADDA) [17] is adopted to align the feature distributions of the source and target domains. The ResNet18 model trained by Scheme $1$ in the previous part, serves as the source domain gesture recognition model, and the target domain gesture recognition model is initialized with the same architecture and parameters. Then, additional $50$ unlabeled experimental samples are added to the simulated dataset for Scheme $1$ . This is used to train a domain discriminator to distinguish the source and target domain features and fine-tune the feature extractor part of the target model alternatively, such that the source feature representation is mimicked. The details of the ADDA method can be found in [17]. The confusion chart of the testing result and t-SNE visualization of the feature spaces in simulation and experimental datasets after ADDA are shown in Fig. 11. The recognition accuracy is boosted to $96.5\%$ and the feature spaces of simulation and experimental datasets are well aligned. This result indicates that the feature distinctions between simulated and experimental datasets can be mitigated significantly by transfer learning.

VI Conclusion

In this paper, a computer-vision assisted wireless channel simulator, namely CASTER simulator, is proposed to generate high-fidelity dataset for hand gesture recognition. In the simulator, the target hand is modeled by $21$ ellipsoid primitives, and the ray-tracing method is adopted to calculate the channel impulse responses. Moreover, a video gesture catcher is proposed to capture real motion data of gestures. In the experiments with $5$ different gestures, both real dataset via experiment and simulated dataset via CASTER simulator are obtained. An accuracy of $83.0\%$ can be achieved in simulation-to-reality inference, i.e., using simulated and experimental datasets in model training and inference respectively. Moreover, this accuracy can be boosted to $96.5\%$ by transfer learning, i.e., fine-tuning the gesture recognition model with a few unlabeled real data.

References

[1] Y. Ma, G. Zhou, and S. Wang, “Wifi sensing with channel state information: A survey,” ACM Computing Surveys (CSUR), vol. 52, no. 3, pp. 1–36, jun 2019.
[2] J. Liu, H. Liu, Y. Chen, Y. Wang, and C. Wang, “Wireless sensing for human activity: A survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 3, pp. 1629–1645, thirdquarter 2020.
[3] Y. Zhang, Y. Zheng, K. Qian, G. Zhang, Y. Liu, C. Wu, and Z. Yang, “Widar3. 0: Zero-effort cross-domain gesture recognition with wi-fi,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8671–8688, Nov 2021.
[4] J. W. Smith, S. Thiagarajan, R. Willis, Y. Makris, and M. Torlak, “Improved static hand gesture classification on deep convolutional neural networks using novel sterile training technique,” IEEE Access, vol. 9, pp. 10 893–10 902, Jan 2021.
[5] W. Li, R. J. Piechocki, K. Woodbridge, C. Tang, and K. Chetty, “Passive wifi radar for human sensing using a stand-alone access point,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 3, pp. 1986–1998, March 2020.
[6] H. Sun, L. G. Chia, and S. G. Razul, “Through-wall human sensing with wifi passive radar,” IEEE Transactions on Aerospace and Electronic Systems, vol. 57, no. 4, pp. 2135–2148, Aug 2021.
[7] J. Li, C. Yu, Y. Luo, Y. Sun, and R. Wang, “Passive motion detection via mmwave communication system,” in 2022 IEEE 95th Vehicular Technology Conference:(VTC2022-Spring). IEEE, 2022, pp. 1–6.
[8] R. Du, H. Hua, H. Xie, X. Song, Z. Lyu, M. Hu, Y. Xin, S. McCann, M. Montemurro, T. X. Han et al., “An overview on ieee 802.11 bf: Wlan sensing,” arXiv preprint arXiv:2310.17661, 2023.
[9] M. Zhang et al., “Channel models for WLAN sensing systems,” IEEE 802.11 Documents, Sep 2021. [Online]. Available: https://mentor.ieee.org/802.11/documents?isdcn=Meihong
[10] G. Li, S. Wang, J. Li, R. Wang, X. Peng, and T. X. Han, “Wireless sensing with deep spectrogram network and primitive based autoregressive hybrid channel model,” in 2021 IEEE 22nd International Workshop on Signal Processing Advances in Wireless Communications (SPAWC). IEEE, 2021, pp. 481–485.
[11] G. Li, S. Wang, J. Li, R. Wang, F. Liu, X. Peng, T. X. Han, and C. Xu, “Integrated sensing and communication from learning perspective: An sdp3 approach,” IEEE Internet of Things Journal, Feb 2023.
[12] “Wigig tools.” [Online]. Available: https://github.com/wigig-tools
[13] S. Vishwakarma, W. Li, C. Tang, K. Woodbridge, R. Adve, and K. Chetty, “Simhumalator: An open-source end-to-end radar simulator for human activity recognition,” IEEE Aerospace and Electronic Systems Magazine, vol. 37, no. 3, pp. 6–22, March 2021.
[14] B. Erol, C. Karabacak, S. Z. Gürbüz, and A. C. Gürbüz, “Simulation of human micro-doppler signatures with kinect sensor,” in 2014 IEEE Radar Conference, 2014, pp. 0863–0868.
[15] R. Boulic, N. M. Thalmann, and D. Thalmann, “A global human walking model with real-time kinematic personification,” The visual computer, vol. 6, pp. 344–358, Nov 1990.
[16] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, Jan 2021.
[17] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2962–2971.
[18] N. Wheatland, Y. Wang, H. Song, M. Neff, V. Zordan, and S. Jörg, “State of the art in hand and finger modeling and animation,” in Computer Graphics Forum, vol. 34, no. 2. Wiley Online Library, 2015, pp. 735–760.
[19] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 172–186, Jan 2021.
[20] F. Zhang, V. Bazarevsky, A. Vakunov, A. Tkachenka, G. Sung, C.-L. Chang, and M. Grundmann, “Mediapipe hands: On-device real-time hand tracking,” arXiv preprint arXiv:2006.10214, 2020.
[21] J. Romero, D. Tzionas, and M. J. Black, “Embodied hands: Modeling and capturing hands and bodies together,” ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), vol. 36, no. 6, Nov. 2017.
[22] E. F. Knott, J. F. Schaeffer, and M. T. Tulley, Radar cross section. SciTech Publishing, 2004.
[23] K. D. Trott, “Stationary phase derivation for rcs of an ellipsoid,” IEEE Antennas Wireless Propag. Lett., vol. 6, pp. 240–243, Jun 2007.
[24] E. Marchand, H. Uchiyama, and F. Spindler, “Pose estimation for augmented reality: A hands-on survey,” IEEE Transactions on Visualization and Computer Graphics, vol. 22, no. 12, pp. 2633–2651, Dec 2016.
[25] “Opencv: Perspective-n-point (pnp) pose computation.” [Online]. Available: https://docs.opencv.org/4.x/d5/d1f/calib3d_solvePnP.html
[26] K. Levenberg, “A method for the solution of certain non-linear problems in least squares,” Quarterly of applied mathematics, vol. 2, no. 2, pp. 164–168, 1944.
[27] G. Casiez, N. Roussel, and D. Vogel, “1€ filter: a simple speed-based low-pass filter for noisy input in interactive systems,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2012, pp. 2527–2530.
[28] C. De Boor and C. De Boor, A practical guide to splines. springer-verlag New York, 1978, vol. 27.
[29] National Instruments. Usrp-2954. [Online]. Available: https://www.ni.com/en-us/shop/model/usrp-2954.html
[30] Sivers IMA. Evk 06002/00. [Online]. Available: https://www.siversima.com/product/evk-06002-00/
[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[32] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, Nov 2008.
[33] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.