CASTER: A Computer-Vision-Assisted Wireless Channel Simulator for Gesture Recognition

Zhenyu Ren, Guoliang Li, Chenqing Ji, Shuai Wang, Chao Yu, Rui Wang
Abstract

In this paper, a computer-vision-assisted simulation method is proposed to address the issue of training dataset acquisition for wireless hand gesture recognition. In the existing literature, in order to classify gestures via the wireless channel estimation, massive training samples should be measured in a consistent environment, consuming significant efforts. In the proposed CASTER simulator, however, the training dataset can be simulated via existing videos. Particularly, in the channel simulation, a gesture is represented by a sequence of snapshots, and the channel impulse response of each snapshot is calculated via tracing the rays scattered off a primitive-based hand model. Moreover, CASTER simulator relies on the existing video clips to extract the motion data of gestures. Thus, the massive measurements of wireless channel can be eliminated. The experiments first demonstrate an 83.0%percent83.083.0\%83.0 % average recognition accuracy of simulation-to-reality inference in recognizing 5555 categories of gestures. Moreover, this accuracy can be boosted to 96.5%percent96.596.5\%96.5 % via the method of transfer learning.

Index Terms:
Wireless hand gesture recognition, channel model, simulation-to-reality inference.
Refer to caption
Figure 1: Illustration of primitive-based hand model and channel simulation scenario.

I Introduction

Sensing is becoming one of the core services of the next-generation wireless systems. There have been a significant number of works on wireless sensing, particularly the machine-learning-based human motion recognition (HMR), via channel state information (CSI)[1, 2, 3, 4] or passive architecture[5, 6, 7]. In most of these works, a significant number of labeled wireless signals should be collected and processed for the training of motion recognition models, which might be infeasible in many applications. In this paper, we would like to show that it is possible to generate the above training dataset for hand gesture recognition via channel simulation, instead of real measurement.

In fact, there have been a number of works on the extension of sophisticated communication channel models, such that the effects of sensing target on the channel impulse response are incorporated. Hence, the channel simulation based on these models might be used for motion recognition. For instance, the Data-Driven Hybrid Channel (DAHC) model of IEEE 802.11bf specification [8, 9] divided wireless channel into two parts: the target-unrelated components and the target-related components. The existing methods of communication channel modeling can be applied on the former; whereas the primitive-based human body model [10] was utilized to compute the latter. A similar channel model was also used in [11] for the optimization of communication and sensing performance. The WiGig Tools [12], developed by National Institute of Standards and Technology (NIST), enriched existing quasi-deterministic channel ray-tracers with supplementary target-related rays (T-Rays), such that the consistent effects of human motion could be included. Moreover, the methods for simulating radar echo signals off the human body were proposed in [13, 14]. All the above works relied on the primitive-based human body model [10, 15], where the hand was modeled as a single ellipsoid. Thus, these methods cannot model fine-grained hand gestures.

In order to facilitate the machine-learning-based HMR with the above channel models, diversified motion data are required to drive the primitive-based human body model in the channel simulation. Depth cameras and wearable sensors were used in [13, 14] to obtain sufficient body motion data for channel simulation. Nevertheless, to the best of our knowledge, there is no study on the capture of hand gestures for channel simulation. Moreover, it is unknown if conventional monocular cameras, instead of depth cameras, could obtain the motion data with adequate accuracy in the applications of wireless HMR. Note that the monocular cameras are of lower cost, and it is much more convenient to obtain hand gesture video clips of monocular cameras from online sources.

In this paper, we would like to shed some light on the above issues by proposing a Computer-vision-Assisted wireless channel SimulaTor for gEsture Recognition, namely CASTER. The proposed CASTER simulator is composed of channel generator and video gesture catcher. In the channel generator, the target hand is modeled with 21212121 primitives, and the channel impulse response is calculated by tracing the rays scattered off all the primitives. Based on the hand model, a gesture is represented by a sequence of snapshots, and the channel impulse responses for all the snapshots can be obtained respectively. In the video gesture catcher, trajectories of 21212121 primitives in one gesture can be captured from videos of a conventional monocular camera. Thus, the catcher provides an efficient way to retrieve motion data for the channel generator. In order to demonstrate the high fidelity of the proposed CASTER simulator, we use the simulated dataset of channel impulse responses to train a gesture recognition model and use a passive sensing system[7] to measure the real channel for model testing. It is shown that an 83.0%percent83.083.0\%83.0 % average recognition accuracy of simulation-to-reality inference can be achieved by recognizing 5555 categories of gestures. Moreover, this accuracy can be boosted to 96.5%percent96.596.5\%96.5 % via the method of transfer learning[16], where the gesture recognition model trained via a simulated dataset is further fine-tuned with a small amount of unlabeled real measurements according to the adversarial discriminative domain adaptation (ADDA) method in [17]. The main advantages of the proposed CASTER simulator are summarized below:

  • Conventional measurement of training dataset for wireless HMR is replaced by channel simulation and gesture video recognition, saving the significant cost of real experiments.

  • In the proposed CASTER simulator, the locations of the signal transmitter, sensing receiver, target hand, and scattering clusters can be adjusted freely to adapt to heterogeneous sensing scenarios.

As a result, the proposed CASTER simulator has the potential to customize the gesture recognition models for heterogeneous scenarios without real measurements.

The remainder of this paper is organized as follows. The simulator framework is elaborated in Section II. The channel generator is presented in Section III, and the video gesture catcher is presented in Section IV. The performance of the CASTER simulator is evaluated in Section V. Finally, the conclusion is drawn in Section VI.

In this paper, we use the following notations: non-bold letters are used to denote scalar values, bold lowercase letters (e.g., 𝐚𝐚\mathbf{a}bold_a) are used to denote column vectors, bold uppercase letters (e.g., 𝐀𝐀\mathbf{A}bold_A) are used to denote matrices, |𝐚|𝐚|{\mathbf{a}}|| bold_a | and 𝐚Tsuperscript𝐚𝑇\mathbf{a}^{T}bold_a start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denote the L2-norm and transpose of vector 𝐚𝐚\mathbf{a}bold_a.

II Simulator Framework

The proposed CASTER simulator is developed with the primitive-based hand model. In order to extract high-fidelity channel impulse responses from existing videos, the CASTER simulator is composed of the channel generator and video gesture catcher. The former generates a sequence of channel impulse response snapshots given arbitrary hand gestures and arbitrary locations of the transmitter and receiver. The latter captures the parameters of real hand motions from existing videos as the former’s input. As a result, the CASTER simulator is able to provide datasets for the training of the hand gesture recognition model without real channel measurement.

As depicted in Fig. 1, the locations of the transmitter, receiver and the target hand can be arbitrary in the channel generator. A gesture is represented as a sequence of snapshots, with an interval of ΔtsΔsubscript𝑡s\Delta t_{\text{s}}roman_Δ italic_t start_POSTSUBSCRIPT s end_POSTSUBSCRIPT seconds. In each snapshot, the channel is assumed to be quasi-static, and the channel impulse response is calculated via the primitive-based method [10]. Particularly, the hand is modeled via 21212121 keypoints (joints) and 21212121 ellipsoids (primitives) connecting the keypoints. The non-line-of-sight (NLoS) channel components via the hand can be approximated by the 21212121 rays respectively scattered off the centers of all primitives. Hence, the channel impulse response of one snapshot can be obtained by aggregating all the rays from the transmitter to the receiver, including the line-of-sight (LoS) ray, the NLoS ones scattered off the target hand, and the others scattered at the environment.

As a remark notice, the 21212121-keypoint hand model is widely recognized in the fields of computer vision and biomedical engineering [18]. The renowned hand models, such as openpose[19], mediaipipe[20], and MANO[21], are all based on this 21212121-keypoint representation. It could provide the same degrees of freedom in describing the complex hand and finger motions as explained in [18]: a human hand consists of 21212121 joints, yielding 27272727 degrees of freedom, which are the same as the 21212121-keypoint hand model.

Moreover, the proposed video gesture catcher first extracts the 3-dimensional (3D) coordinates of hand keypoints from each video frame in a local hand world coordination system via machine learning technique, converts the trajectories of the keypoints from the local hand world coordinate system to a global camera coordinate system and then eliminates the fake hops and jitters of trajectories via low-pass filtering. Finally, since the interval between two video frames, denoted as ΔtvΔsubscript𝑡v\Delta t_{\text{v}}roman_Δ italic_t start_POSTSUBSCRIPT v end_POSTSUBSCRIPT, is usually much larger than ΔtsΔsubscript𝑡s\Delta t_{\text{s}}roman_Δ italic_t start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, an interpolation is necessary to fill a sufficient number of snapshots between two video frames. As a remark, the video clips for the gesture catcher can be recorded in arbitrary environment as long as the desired hand gestures can be identified by the gesture catcher. Hence, they could be obtained from massive online sources.

III Channel Generator

Without loss of generality, the generation of channel impulse response for the t𝑡titalic_t-th snapshot (tfor-all𝑡\forall t∀ italic_t) is elaborated in this section. As shown in Fig. 1, the rays from the transmitter to the receiver can be categorized into two parts: target-unrelated components and target-related components. The former refers to the LoS ray and the NLoS rays scattered at the static environment, and the latter refers to the NLoS rays scattered off the target hand. Particularly, let h(τ,t)𝜏𝑡h(\tau,t)italic_h ( italic_τ , italic_t ) and u(τ,t)𝑢𝜏𝑡u(\tau,t)italic_u ( italic_τ , italic_t ) be the overall channel impulse response and target-related channel impulse response of the t𝑡titalic_t-th snapshot, v(τ)𝑣𝜏v(\tau)italic_v ( italic_τ ) be time-invariant target-unrelated channel impulse response. Following the channel model in [9], we have

h(τ,t)=u(τ,t)+v(τ),𝜏𝑡𝑢𝜏𝑡𝑣𝜏\displaystyle h(\tau,t)=u(\tau,t)+v(\tau),italic_h ( italic_τ , italic_t ) = italic_u ( italic_τ , italic_t ) + italic_v ( italic_τ ) , (1)

where the generation of u(τ,t)𝑢𝜏𝑡u(\tau,t)italic_u ( italic_τ , italic_t ) and v(τ)𝑣𝜏v(\tau)italic_v ( italic_τ ) is elaborated in the following parts respectively.

III-A Target-Related Channel Components

Let 𝐩tsubscript𝐩t\mathbf{p}_{\text{t}}bold_p start_POSTSUBSCRIPT t end_POSTSUBSCRIPT and 𝐩rsubscript𝐩r\mathbf{p}_{\text{r}}bold_p start_POSTSUBSCRIPT r end_POSTSUBSCRIPT be the coordinates of the transmitter and the receiver respectively, 𝐩i(t)subscript𝐩𝑖𝑡\mathbf{p}_{i}(t)bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) and 𝐩j(t)subscript𝐩𝑗𝑡\mathbf{p}_{j}(t)bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) be the coordinates of the two joints associated with the n𝑛nitalic_n-th primitive in the t𝑡titalic_t-th snapshot (n,tfor-all𝑛𝑡\forall n,t∀ italic_n , italic_t). Hence, the center of the n𝑛nitalic_n-th primitive is 𝐩nc(t)=[𝐩i(t)+𝐩j(t)]/2superscriptsubscript𝐩𝑛𝑐𝑡delimited-[]subscript𝐩𝑖𝑡subscript𝐩𝑗𝑡2\mathbf{p}_{n}^{c}(t)=[\mathbf{p}_{i}(t)+\mathbf{p}_{j}(t)]/2bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_t ) = [ bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) + bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ] / 2. As previously mentioned, each primitive is modeled as an ellipsoid, the length of the axis connecting the two joints is denoted as 2ln(t)2subscript𝑙𝑛𝑡2l_{n}(t)2 italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ), where

ln(t)=|𝐩i(t)𝐩j(t)|/2.subscript𝑙𝑛𝑡subscript𝐩𝑖𝑡subscript𝐩𝑗𝑡2l_{n}(t)=|\mathbf{p}_{i}(t)-\mathbf{p}_{j}(t)|/2.italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) = | bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) | / 2 . (2)

Moreover, the lengths of the other two axes are identical, denoted as 2rn(t)2subscript𝑟𝑛𝑡2r_{n}(t)2 italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ). Usually, rn(t)<ln(t)subscript𝑟𝑛𝑡subscript𝑙𝑛𝑡r_{n}(t)<l_{n}(t)italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) < italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ), and we choose rn(t)=ln(t)/2subscript𝑟𝑛𝑡subscript𝑙𝑛𝑡2r_{n}(t)=l_{n}(t)/2italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) = italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) / 2. Hence, we shall refer to the axis connecting the two joints as the long axis of the ellipsoid. As a remark note that the primitive size (rnsubscript𝑟𝑛r_{n}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and lnsubscript𝑙𝑛l_{n}italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) varies slightly over time due to the non-rigid nature of human motion.

Refer to caption
Figure 2: Bistatic RCS estimation for the n𝑛nitalic_n-th primitive.

Let Rtn(t)=|𝐩t𝐩nc(t)|superscriptsubscript𝑅t𝑛𝑡subscript𝐩tsubscriptsuperscript𝐩c𝑛𝑡R_{\text{t}}^{n}(t)=|\mathbf{p}_{\text{t}}-\mathbf{p}^{\text{c}}_{n}(t)|italic_R start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_t ) = | bold_p start_POSTSUBSCRIPT t end_POSTSUBSCRIPT - bold_p start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) | be the distance between the transmitter and the n𝑛nitalic_n-th primitive center, Rrn(t)=|𝐩r𝐩nc(t)|superscriptsubscript𝑅r𝑛𝑡subscript𝐩rsubscriptsuperscript𝐩c𝑛𝑡R_{\text{r}}^{n}(t)=|\mathbf{p}_{\text{r}}-\mathbf{p}^{\text{c}}_{n}(t)|italic_R start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_t ) = | bold_p start_POSTSUBSCRIPT r end_POSTSUBSCRIPT - bold_p start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) | be the distance between the receiver and the n𝑛nitalic_n-th primitive center, Gtn(t)superscriptsubscript𝐺t𝑛𝑡G_{\text{t}}^{n}(t)italic_G start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_t ) and Grn(t)superscriptsubscript𝐺r𝑛𝑡G_{\text{r}}^{n}(t)italic_G start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_t ) be the transmit and receive antenna gains at the directions of incident ray 𝐩t𝐩nc(t)subscript𝐩tsubscriptsuperscript𝐩c𝑛𝑡\mathbf{p}_{\text{t}}-\mathbf{p}^{\text{c}}_{n}(t)bold_p start_POSTSUBSCRIPT t end_POSTSUBSCRIPT - bold_p start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) and scattered ray 𝐩nc(t)𝐩rsubscriptsuperscript𝐩c𝑛𝑡subscript𝐩r\mathbf{p}^{\text{c}}_{n}(t)-\mathbf{p}_{\text{r}}bold_p start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) - bold_p start_POSTSUBSCRIPT r end_POSTSUBSCRIPT, σn(t)subscript𝜎𝑛𝑡\sigma_{n}(t)italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) be the bistatic radar cross section (RCS) of the n𝑛nitalic_n-th primitive, c𝑐citalic_c be the speed of light, fcsubscript𝑓cf_{\text{c}}italic_f start_POSTSUBSCRIPT c end_POSTSUBSCRIPT and λ𝜆\lambdaitalic_λ be the carrier frequency and wavelength respectively. The response of the path scattered off the n𝑛nitalic_n-th primitive can be expressed as

un(τ,t)=λσn(t)Gtn(t)Grn(t)(4π)3(Rtn(t)Rrn(t))2ejϕn(t)δ(ττn(t)),subscript𝑢𝑛𝜏𝑡𝜆subscript𝜎𝑛𝑡subscriptsuperscript𝐺𝑛t𝑡subscriptsuperscript𝐺𝑛r𝑡superscript4𝜋3superscriptsuperscriptsubscript𝑅t𝑛𝑡superscriptsubscript𝑅r𝑛𝑡2superscript𝑒jsubscriptitalic-ϕ𝑛𝑡𝛿𝜏subscript𝜏𝑛𝑡\displaystyle u_{n}(\tau,t)=\lambda\sqrt{\frac{\sigma_{n}(t)G^{n}_{\text{t}}(t% )G^{n}_{\text{r}}(t)}{(4\pi)^{3}(R_{\text{t}}^{n}(t)R_{\text{r}}^{n}(t))^{2}}}% e^{-\mathrm{j}\phi_{n}(t)}\delta(\tau-\tau_{n}(t)),italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_τ , italic_t ) = italic_λ square-root start_ARG divide start_ARG italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) italic_G start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) italic_G start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG ( 4 italic_π ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_t ) italic_R start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - roman_j italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_δ ( italic_τ - italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) ) , (3)

where δ(a)𝛿𝑎\delta(a)italic_δ ( italic_a ) is the impulse function, whose value is 1111 when a=0𝑎0a=0italic_a = 0 and 00 otherwise, while τn(t)=[Rtn(t)+Rrn(t)]/csubscript𝜏𝑛𝑡delimited-[]superscriptsubscript𝑅t𝑛𝑡superscriptsubscript𝑅r𝑛𝑡𝑐\tau_{n}(t)=\left[R_{\text{t}}^{n}(t)+R_{\text{r}}^{n}(t)\right]/citalic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) = [ italic_R start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_t ) + italic_R start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_t ) ] / italic_c and ϕn(t)=2πfcτn(t)subscriptitalic-ϕ𝑛𝑡2𝜋subscript𝑓csubscript𝜏𝑛𝑡\phi_{n}(t)=2\pi f_{\text{c}}\tau_{n}(t)italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) = 2 italic_π italic_f start_POSTSUBSCRIPT c end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) measure the delay and phase shift.

Moreover, the calculation of the bistatic RCS σn(t)subscript𝜎𝑛𝑡\sigma_{n}(t)italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) follows the method in [22, 23]. As depicted in Fig. 2, let θtn(t)subscriptsuperscript𝜃𝑛t𝑡\theta^{n}_{\text{t}}(t)italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) and θrn(t)subscriptsuperscript𝜃𝑛r𝑡\theta^{n}_{\text{r}}(t)italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) represent the incident and scattered elevation angles respectively, ϕtn(t)subscriptsuperscriptitalic-ϕ𝑛t𝑡\phi^{n}_{\text{t}}(t)italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) and ϕrn(t)subscriptsuperscriptitalic-ϕ𝑛r𝑡\phi^{n}_{\text{r}}(t)italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) represent the incident and scattered azimuth angles respectively, 𝒗n(t)=[𝐩i(t)𝐩j(t)]/(2ln(t))subscript𝒗𝑛𝑡delimited-[]subscript𝐩𝑖𝑡subscript𝐩𝑗𝑡2subscript𝑙𝑛𝑡\boldsymbol{v}_{n}(t)=[\mathbf{p}_{i}(t)-\mathbf{p}_{j}(t)]/(2l_{n}(t))bold_italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) = [ bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ] / ( 2 italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) ) represent the normalized vector along the long axis, we have

θtn(t)=arccos((𝐩nc(t)𝐩t)T𝒗n(t)/Rtn(t)),superscriptsubscript𝜃t𝑛𝑡superscriptsubscriptsuperscript𝐩c𝑛𝑡subscript𝐩t𝑇subscript𝒗𝑛𝑡subscriptsuperscript𝑅𝑛t𝑡\displaystyle\theta_{\text{t}}^{n}(t)=\arccos\left((\mathbf{p}^{\text{c}}_{n}(% t)-\mathbf{p}_{\text{t}})^{T}\boldsymbol{v}_{n}(t)/R^{n}_{\text{t}}(t)\right),italic_θ start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_t ) = roman_arccos ( ( bold_p start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) - bold_p start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) / italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) ) , (4)
θrn(t)=arccos((𝐩nc(t)𝐩r)T𝒗n(t)/Rrn(t)),superscriptsubscript𝜃r𝑛𝑡superscriptsubscriptsuperscript𝐩c𝑛𝑡subscript𝐩r𝑇subscript𝒗𝑛𝑡subscriptsuperscript𝑅𝑛r𝑡\displaystyle\theta_{\text{r}}^{n}(t)=\arccos\left((\mathbf{p}^{\text{c}}_{n}(% t)-\mathbf{p}_{\text{r}})^{T}\boldsymbol{v}_{n}(t)/R^{n}_{\text{r}}(t)\right),italic_θ start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_t ) = roman_arccos ( ( bold_p start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) - bold_p start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) / italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) ) , (5)

and

|ϕrn(t)ϕtn(t)|subscriptsuperscriptitalic-ϕ𝑛r𝑡subscriptsuperscriptitalic-ϕ𝑛t𝑡\displaystyle|\phi^{n}_{\text{r}}(t)-\phi^{n}_{\text{t}}(t)|| italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) - italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) | =arccos((𝐩nc(t)𝐩~t(t))T(𝐩nc(t)𝐩~r(t))|𝐩nc(t)𝐩~t(t)||𝐩nc(t))𝐩~r(t)|),\displaystyle=\arccos\left(\frac{(\mathbf{p}^{\text{c}}_{n}(t)-\tilde{\mathbf{% p}}_{\text{t}}(t))^{T}(\mathbf{p}^{\text{c}}_{n}(t)-\tilde{\mathbf{p}}_{\text{% r}}(t))}{|\mathbf{p}^{\text{c}}_{n}(t)-\tilde{\mathbf{p}}_{\text{t}}(t)||% \mathbf{p}^{\text{c}}_{n}(t))-\tilde{\mathbf{p}}_{\text{r}}(t)|}\right),= roman_arccos ( divide start_ARG ( bold_p start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) - over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_p start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) - over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) ) end_ARG start_ARG | bold_p start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) - over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) | | bold_p start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) ) - over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) | end_ARG ) , (6)

where

𝐩~t(t)=𝐩t𝒗n(t)(𝐩t𝐩nc(t))T𝒗n(t)subscript~𝐩t𝑡subscript𝐩tsubscript𝒗𝑛𝑡superscriptsubscript𝐩tsubscriptsuperscript𝐩c𝑛𝑡𝑇subscript𝒗𝑛𝑡\tilde{\mathbf{p}}_{\text{t}}(t)=\mathbf{p}_{\text{t}}-\boldsymbol{v}_{n}(t)(% \mathbf{p}_{\text{t}}-\mathbf{p}^{\text{c}}_{n}(t))^{T}\boldsymbol{v}_{n}(t)over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) = bold_p start_POSTSUBSCRIPT t end_POSTSUBSCRIPT - bold_italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) ( bold_p start_POSTSUBSCRIPT t end_POSTSUBSCRIPT - bold_p start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t )

and

𝐩~r(t)=𝐩r𝒗n(t)(𝐩r𝐩nc(t))T𝒗n(t)subscript~𝐩r𝑡subscript𝐩rsubscript𝒗𝑛𝑡superscriptsubscript𝐩rsubscriptsuperscript𝐩c𝑛𝑡𝑇subscript𝒗𝑛𝑡\tilde{\mathbf{p}}_{\text{r}}(t)=\mathbf{p}_{\text{r}}-\boldsymbol{v}_{n}(t)(% \mathbf{p}_{\text{r}}-\mathbf{p}^{\text{c}}_{n}(t))^{T}\boldsymbol{v}_{n}(t)over~ start_ARG bold_p end_ARG start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) = bold_p start_POSTSUBSCRIPT r end_POSTSUBSCRIPT - bold_italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) ( bold_p start_POSTSUBSCRIPT r end_POSTSUBSCRIPT - bold_p start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t )

denotes the projection of the transmitter and receiver’s locations on the plane containing the center of the n𝑛nitalic_n-th ellipsoid and perpendicular to its long axis in the t𝑡titalic_t-th snapshot. As a result, the bistatic RCS σn(t)subscript𝜎𝑛𝑡\sigma_{n}(t)italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) of n𝑛nitalic_n-th ellipsoid in the t𝑡titalic_t-th snapshot is given by (7).

σn(t)=4πrn4(t)ln2(t)[(1+cosθtn(t)cosθrn(t))cos(ϕrn(t)ϕtn(t))+sinθtn(t)sinθrn(t)]2[rn2(t)(sin2θtn(t)+sin2θrn(t)+2sinθtn(t)sinθrn(t)cos(ϕrn(t)ϕtn(t)))+ln2(t)(cosθtn(t)+cosθrn(t))2]2.subscript𝜎𝑛𝑡4𝜋superscriptsubscript𝑟𝑛4𝑡superscriptsubscript𝑙𝑛2𝑡superscriptdelimited-[]1subscriptsuperscript𝜃𝑛t𝑡subscriptsuperscript𝜃𝑛r𝑡subscriptsuperscriptitalic-ϕ𝑛r𝑡subscriptsuperscriptitalic-ϕ𝑛t𝑡subscriptsuperscript𝜃𝑛t𝑡subscriptsuperscript𝜃𝑛r𝑡2superscriptdelimited-[]superscriptsubscript𝑟𝑛2𝑡superscript2subscriptsuperscript𝜃𝑛t𝑡superscript2subscriptsuperscript𝜃𝑛r𝑡2subscriptsuperscript𝜃𝑛t𝑡subscriptsuperscript𝜃𝑛r𝑡subscriptsuperscriptitalic-ϕ𝑛r𝑡subscriptsuperscriptitalic-ϕ𝑛t𝑡superscriptsubscript𝑙𝑛2𝑡superscriptsubscriptsuperscript𝜃𝑛t𝑡subscriptsuperscript𝜃𝑛r𝑡22\sigma_{n}(t)=\frac{4\pi r_{n}^{4}(t)l_{n}^{2}(t)[(1+\cos\theta^{n}_{\text{t}}% (t)\cos\theta^{n}_{\text{r}}(t))\cos(\phi^{n}_{\text{r}}(t)-\phi^{n}_{\text{t}% }(t))+\sin\theta^{n}_{\text{t}}(t)\sin\theta^{n}_{\text{r}}(t)]^{2}}{[r_{n}^{2% }(t)(\sin^{2}\theta^{n}_{\text{t}}(t)+\sin^{2}\theta^{n}_{\text{r}}(t)+2\sin% \theta^{n}_{\text{t}}(t)\sin\theta^{n}_{\text{r}}(t)\cos(\phi^{n}_{\text{r}}(t% )-\phi^{n}_{\text{t}}(t)))+l_{n}^{2}(t)(\cos\theta^{n}_{\text{t}}(t)+\cos% \theta^{n}_{\text{r}}(t))^{2}]^{2}}.italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) = divide start_ARG 4 italic_π italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_t ) italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) [ ( 1 + roman_cos italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) roman_cos italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) ) roman_cos ( italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) - italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) ) + roman_sin italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) roman_sin italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG [ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) ( roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) + roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) + 2 roman_sin italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) roman_sin italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) roman_cos ( italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) - italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) ) ) + italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) ( roman_cos italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ( italic_t ) + roman_cos italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (7)

 

Aggregating the NLoS rays scattered off all the primitives, the target-related channel impulse response can be written as

u(τ,t)=n=121un(τ,t).𝑢𝜏𝑡superscriptsubscript𝑛121subscript𝑢𝑛𝜏𝑡\displaystyle u(\tau,t)=\sum_{n=1}^{21}u_{n}(\tau,t).italic_u ( italic_τ , italic_t ) = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_τ , italic_t ) . (8)

III-B Target-Unrelated Channel Components

CASTER simulator models the environment by K𝐾Kitalic_K static scatterers. Let the RCS, transmit and receive antenna gains and the distance of the k𝑘kitalic_k-th NLoS ray be σksubscript𝜎𝑘\sigma_{k}italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, Gtksuperscriptsubscript𝐺t𝑘G_{\text{t}}^{k}italic_G start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, Grksuperscriptsubscript𝐺r𝑘G_{\text{r}}^{k}italic_G start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, Rtksuperscriptsubscript𝑅t𝑘R_{\text{t}}^{k}italic_R start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, and Rrksuperscriptsubscript𝑅r𝑘R_{\text{r}}^{k}italic_R start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, respectively. The NLoS components of target-unrelated channel impulse response can be written as

vNLoS(τ)=k=1KλσkGtkGrk(4π)3(RtkRrk)2ejϕkδ(ττk),subscript𝑣NLoS𝜏superscriptsubscript𝑘1𝐾𝜆subscript𝜎𝑘superscriptsubscript𝐺t𝑘superscriptsubscript𝐺r𝑘superscript4𝜋3superscriptsuperscriptsubscript𝑅t𝑘superscriptsubscript𝑅r𝑘2superscript𝑒jsubscriptitalic-ϕ𝑘𝛿𝜏subscript𝜏𝑘\displaystyle v_{\text{NLoS}}(\tau)=\sum_{k=1}^{K}\lambda\sqrt{\frac{\sigma_{k% }G_{\text{t}}^{k}G_{\text{r}}^{k}}{(4\pi)^{3}(R_{\text{t}}^{k}R_{\text{r}}^{k}% )^{2}}}e^{-\mathrm{j}\phi_{k}}\delta(\tau-\tau_{k}),italic_v start_POSTSUBSCRIPT NLoS end_POSTSUBSCRIPT ( italic_τ ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ square-root start_ARG divide start_ARG italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG ( 4 italic_π ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - roman_j italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_δ ( italic_τ - italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (9)

where τk=(Rtk+Rrk)/csubscript𝜏𝑘subscriptsuperscript𝑅𝑘tsubscriptsuperscript𝑅𝑘r𝑐\tau_{k}=\left(R^{k}_{\text{t}}+R^{k}_{\text{r}}\right)/citalic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( italic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT t end_POSTSUBSCRIPT + italic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ) / italic_c and ϕk=2πfcτksubscriptitalic-ϕ𝑘2𝜋subscript𝑓csubscript𝜏𝑘\phi_{k}=2\pi f_{\text{c}}\tau_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 2 italic_π italic_f start_POSTSUBSCRIPT c end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Moreover, let transmit and receive antenna gains at the direction of LoS path be Gt, LoSsubscript𝐺t, LoSG_{\text{t, LoS}}italic_G start_POSTSUBSCRIPT t, LoS end_POSTSUBSCRIPT and Gr, LoSsubscript𝐺r, LoSG_{\text{r, LoS}}italic_G start_POSTSUBSCRIPT r, LoS end_POSTSUBSCRIPT, distance between transmitter and receiver be RLoSsubscript𝑅LoSR_{\text{LoS}}italic_R start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT, the LoS component of target-related channel is modeled via the following free space model:

vLoS(τ)=λGt, LoSGr, LoS4πRLoSejϕLoSδ(ττLoS),subscript𝑣LoS𝜏𝜆subscript𝐺t, LoSsubscript𝐺r, LoS4𝜋subscript𝑅LoSsuperscript𝑒jsubscriptitalic-ϕLoS𝛿𝜏subscript𝜏LoS\displaystyle v_{\text{LoS}}(\tau)=\frac{\lambda\sqrt{G_{\text{t, LoS}}G_{% \text{r, LoS}}}}{4\pi R_{\text{LoS}}}e^{-\mathrm{j}\phi_{\text{LoS}}}\delta(% \tau-\tau_{\text{LoS}}),italic_v start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT ( italic_τ ) = divide start_ARG italic_λ square-root start_ARG italic_G start_POSTSUBSCRIPT t, LoS end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT r, LoS end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 4 italic_π italic_R start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - roman_j italic_ϕ start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_δ ( italic_τ - italic_τ start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT ) , (10)

where τLoS=RLoS/csubscript𝜏LoSsubscript𝑅LoS𝑐\tau_{\text{LoS}}=R_{\text{LoS}}/citalic_τ start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT / italic_c and ϕLoS=2πfcτLoSsubscriptitalic-ϕLoS2𝜋subscript𝑓csubscript𝜏LoS\phi_{\text{LoS}}=2\pi f_{\text{c}}\tau_{\text{LoS}}italic_ϕ start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT = 2 italic_π italic_f start_POSTSUBSCRIPT c end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT. As a result, according to [9], the target-unrelated channel impulse response can be written as

v(τ)=vLoS(τ)+vNLoS(τ).𝑣𝜏subscript𝑣LoS𝜏subscript𝑣NLoS𝜏\displaystyle v(\tau)=v_{\text{LoS}}(\tau)+v_{\text{NLoS}}(\tau).italic_v ( italic_τ ) = italic_v start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT ( italic_τ ) + italic_v start_POSTSUBSCRIPT NLoS end_POSTSUBSCRIPT ( italic_τ ) . (11)

IV Video Gesture Catcher

As mentioned in the previous section, the motion of the target hand is characterized by the trajectories of the 21212121 keypoints in a sequence of snapshots, denoted as 𝐩i(t),i=1,2,,21formulae-sequencesubscript𝐩𝑖𝑡𝑖1221\mathbf{p}_{i}(t),i=1,2,...,21bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_i = 1 , 2 , … , 21. We leverage the tool of Mediapipe [20] to extract the keypoint trajectories from videos of monocular cameras, where two issues in the conversion are addressed in this section. The Mediapipe could localize the positions of the keypoints in each video frame. The positions are represented in the coordinate system with the origin at the hand center, namely hand world coordinate system. However, it is difficult to calculate the Doppler frequency with such coordinate system, as the hand center is moving. Hence, we first transfer the coordinates to a unified coordinate system by solving the Perspective-n-Point (PnP) problem [24], where the fake hops on the trajectories are smoothed. Moreover, because there are usually 30303030 video frames per second, which is not sufficient for estimating the Doppler frequencies of gesture. For example, the typical Doppler frequencies of gestures on the 60606060 GHz signals are around 800800800800 Hz (assuming a maximum radial velocity of 4444 meters per second), which requests 1600160016001600 snapshots per second at least. Hence, interpolation is introduced such that the channel impulse response can be generated with a shorter interval.

IV-A Conversion of Coordinate Systems

Refer to caption
(a) Pixel coordinate system
Refer to caption
(b) Hand world coordinate system
Refer to caption
(c) Camera coordinate system
Figure 3: Illustration of three coordinate systems.

For the elaboration convenience, we first introduce the following three coordinate systems. The two-dimensional (2D) pixel coordinate system in the unit of pixels is used to identify the positions of hand keypoints in each video frame. The origin of the pixel coordinate system is usually at the upper left corner of each frame, as shown in Fig. 3. The three-dimensional (3D) hand world coordinate system in the unit of meters measures the positions of hand keypoints in the real world with respect to the hand center. Moreover, the 3D camera coordinate system in the unit of meters measures the positions of hand keypoints with respect to the static camera lens, which captures the videos. The Mediapipe is able to identify the 21212121 keypoints, localize them in the first two coordinate systems. Because the hand center is usually in motion and the camera is static, the trajectories in the camera coordinate system instead of in the hand world coordinate system, could be used to calculate the Doppler frequencies. Thus, the coordinates of hand keypoints 𝐩i(t)subscript𝐩𝑖𝑡\mathbf{p}_{i}(t)bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ), i=1,2,,21𝑖1221i=1,2,...,21italic_i = 1 , 2 , … , 21, transmitter 𝐩tsubscript𝐩t\mathbf{p}_{\text{t}}bold_p start_POSTSUBSCRIPT t end_POSTSUBSCRIPT and receiver 𝐩rsubscript𝐩r\mathbf{p}_{\text{r}}bold_p start_POSTSUBSCRIPT r end_POSTSUBSCRIPT, defined in the previous section should be measured in the camera coordinate system. The above three coordinate systems are illustrated in Fig. 3, as referenced.

Define the coordinates of the i𝑖iitalic_i-th keypoint (i=1,2,,21𝑖1221i=1,2,...,21italic_i = 1 , 2 , … , 21) in the pixel, hand world and camera coordinate systems as (ui,vi)subscript𝑢𝑖subscript𝑣𝑖(u_{i},v_{i})( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), (xiw,yiw,ziw)superscriptsubscript𝑥𝑖wsuperscriptsubscript𝑦𝑖wsuperscriptsubscript𝑧𝑖w(x_{i}^{\text{w}},y_{i}^{\text{w}},z_{i}^{\text{w}})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT ), and (xi,yi,zi)subscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖(x_{i},y_{i},z_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), respectively, where the snapshot index t𝑡titalic_t is ignored in this section for the simplicity of elaboration. Let f𝑓fitalic_f be the focal length in the unit of pixels, (cx,cy)subscript𝑐𝑥subscript𝑐𝑦(c_{x},c_{y})( italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) be the coordinates of image center in the pixel coordinate system, we define the camera intrinsic matrix 𝐀𝐀\mathbf{A}bold_A as

𝐀=[f0cx0fcy001].𝐀matrix𝑓0subscript𝑐𝑥0𝑓subscript𝑐𝑦001\displaystyle\mathbf{A}=\begin{bmatrix}f&0&c_{x}\\ 0&f&c_{y}\\ 0&0&1\end{bmatrix}.bold_A = [ start_ARG start_ROW start_CELL italic_f end_CELL start_CELL 0 end_CELL start_CELL italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_f end_CELL start_CELL italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] . (12)

Hence, the relation between the 2D pixel and 3D camera coordinate systems can be expressed as

zi[uivi 1]T=𝐀[xiyizi]T.subscript𝑧𝑖superscriptdelimited-[]subscript𝑢𝑖subscript𝑣𝑖1𝑇𝐀superscriptdelimited-[]subscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖𝑇\displaystyle z_{i}[u_{i}\ v_{i}\ 1]^{T}=\mathbf{A}[x_{i}\ y_{i}\ z_{i}]^{T}.italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = bold_A [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (13)

Let 𝐑3×3𝐑superscript33\mathbf{R}\in\mathbb{R}^{3\times 3}bold_R ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT and 𝐭𝐭\mathbf{t}bold_t be the rotation matrix and translation vector from hand world coordinate system to camera coordinate system, we define the camera extrinsic matrix 𝐓𝐓\mathbf{T}bold_T and perspective projection matrix 𝚷𝚷\mathbf{\Pi}bold_Π as follows:

𝐓=[𝐑𝐭𝟎1×31],𝐓matrix𝐑𝐭subscript0131\displaystyle\mathbf{T}=\begin{bmatrix}\mathbf{R}&\mathbf{t}\\ \mathbf{0}_{1\times 3}&1\end{bmatrix},bold_T = [ start_ARG start_ROW start_CELL bold_R end_CELL start_CELL bold_t end_CELL end_ROW start_ROW start_CELL bold_0 start_POSTSUBSCRIPT 1 × 3 end_POSTSUBSCRIPT end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , (14)
𝚷=[𝐈3×3𝟏3×1],𝚷matrixsubscript𝐈33subscript131\displaystyle\mathbf{\Pi}=\begin{bmatrix}\mathbf{I}_{3\times 3}&\mathbf{1}_{3% \times 1}\end{bmatrix},bold_Π = [ start_ARG start_ROW start_CELL bold_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL bold_1 start_POSTSUBSCRIPT 3 × 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , (15)

where 𝐈3×3subscript𝐈33\mathbf{I}_{3\times 3}bold_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT denotes a 3×3333\times 33 × 3 identity matrix, 𝟎1×3subscript013\mathbf{0}_{1\times 3}bold_0 start_POSTSUBSCRIPT 1 × 3 end_POSTSUBSCRIPT and 𝟏3×1subscript131\mathbf{1}_{3\times 1}bold_1 start_POSTSUBSCRIPT 3 × 1 end_POSTSUBSCRIPT are the three-dimensional row and column vectors with all 0 and 1 entries respectively. According to [24], the relations between the hand world and camera coordinate systems are given by

[xiyizi 1]T=T[xiwyiwziw 1]T.superscriptdelimited-[]subscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖1𝑇Tsuperscriptdelimited-[]superscriptsubscript𝑥𝑖wsuperscriptsubscript𝑦𝑖wsuperscriptsubscript𝑧𝑖w1𝑇\displaystyle[x_{i}\ y_{i}\ z_{i}\ 1]^{T}=\textbf{T}[x_{i}^{\text{w}}\ y_{i}^{% \text{w}}\ z_{i}^{\text{w}}\ 1]^{T}.[ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = T [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (16)

As a result, the relation between the hand world and the pixel coordinate system could be described as

zi[uivi 1]T=𝐀𝚷𝐓[xiwyiwziw 1]T.subscript𝑧𝑖superscriptdelimited-[]subscript𝑢𝑖subscript𝑣𝑖1𝑇𝐀𝚷𝐓superscriptdelimited-[]superscriptsubscript𝑥𝑖wsuperscriptsubscript𝑦𝑖wsuperscriptsubscript𝑧𝑖w1𝑇\displaystyle z_{i}[u_{i}\ v_{i}\ 1]^{T}=\mathbf{A}\mathbf{\Pi}\mathbf{T}[x_{i% }^{\text{w}}\ y_{i}^{\text{w}}\ z_{i}^{\text{w}}\ 1]^{T}.italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = bold_A bold_Π bold_T [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (17)

For the elaboration convenience, we denote the projection from the hand world coordinate system to the pixel coordinate system as the following function 𝒫𝒫\mathcal{P}caligraphic_P:

[uivi]Tsuperscriptdelimited-[]subscript𝑢𝑖subscript𝑣𝑖𝑇\displaystyle[u_{i}\ v_{i}]^{T}[ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT =𝒫([xiwyiwziw]T,𝐑,𝐭,𝐀)absent𝒫superscriptdelimited-[]superscriptsubscript𝑥𝑖wsuperscriptsubscript𝑦𝑖wsuperscriptsubscript𝑧𝑖w𝑇𝐑𝐭𝐀\displaystyle=\mathcal{P}([x_{i}^{\text{w}}\ y_{i}^{\text{w}}\ z_{i}^{\text{w}% }]^{T},\mathbf{R},\mathbf{t},\mathbf{A})= caligraphic_P ( [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_R , bold_t , bold_A )
=1zi[𝐈2×2 02×1]𝐀(𝐑[xiwyiwziw]T+𝐭)=[xiyizi]T.absent1subscript𝑧𝑖delimited-[]subscript𝐈22subscript 021𝐀subscript𝐑superscriptdelimited-[]superscriptsubscript𝑥𝑖wsuperscriptsubscript𝑦𝑖wsuperscriptsubscript𝑧𝑖w𝑇𝐭absentsuperscriptdelimited-[]subscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖𝑇\displaystyle=\frac{1}{z_{i}}[\mathbf{I}_{2\times 2}\ \mathbf{0}_{2\times 1}]% \mathbf{A}\underbrace{(\mathbf{R}[x_{i}^{\text{w}}\ y_{i}^{\text{w}}\ z_{i}^{% \text{w}}]^{T}+\mathbf{t})}_{=[x_{i}\ y_{i}\ z_{i}]^{T}}.= divide start_ARG 1 end_ARG start_ARG italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG [ bold_I start_POSTSUBSCRIPT 2 × 2 end_POSTSUBSCRIPT bold_0 start_POSTSUBSCRIPT 2 × 1 end_POSTSUBSCRIPT ] bold_A under⏟ start_ARG ( bold_R [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + bold_t ) end_ARG start_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (18)

The Mediapipe could provide the coordinates (ui,vi)subscript𝑢𝑖subscript𝑣𝑖(u_{i},v_{i})( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and (xiw,yiw,ziw)superscriptsubscript𝑥𝑖wsuperscriptsubscript𝑦𝑖wsuperscriptsubscript𝑧𝑖w(x_{i}^{\text{w}},y_{i}^{\text{w}},z_{i}^{\text{w}})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT ) of all the keypoints (i=1,2,,21𝑖1221i=1,2,...,21italic_i = 1 , 2 , … , 21) in each video frame. Hence, their coordinates in the camera coordinate system can be calculated with the knowledge of the rotation matrix 𝐑𝐑\mathbf{R}bold_R and translation vector 𝐭𝐭\mathbf{t}bold_t.

In fact, the parameters in the camera intrinsic matrix 𝐀𝐀\mathbf{A}bold_A can be measured in advance, the rotation matrix 𝐑𝐑\mathbf{R}bold_R and translation vector 𝐭𝐭\mathbf{t}bold_t can be estimated via (IV-A) for i=1,2,,21𝑖1221i=1,2,...,21italic_i = 1 , 2 , … , 21. Particularly, given the coordinates of the 21212121 keypoints in the pixel and hand world coordinate systems, the detection of the rotation matrix 𝐑𝐑\mathbf{R}bold_R and translation vector 𝐭𝐭\mathbf{t}bold_t can be formulated as follows.

min𝐑,𝐭subscriptmin𝐑𝐭\displaystyle\mathop{\mathrm{min}}_{\mathbf{R},\mathbf{t}}\quadroman_min start_POSTSUBSCRIPT bold_R , bold_t end_POSTSUBSCRIPT i=121|(ui,vi)𝒫([xiwyiwziw]T,𝐑,𝐭,𝐀)|2,superscriptsubscript𝑖121superscriptsubscript𝑢𝑖subscript𝑣𝑖𝒫superscriptdelimited-[]superscriptsubscript𝑥𝑖wsuperscriptsubscript𝑦𝑖wsuperscriptsubscript𝑧𝑖w𝑇𝐑𝐭𝐀2\displaystyle\sum_{i=1}^{21}|(u_{i},v_{i})-\mathcal{P}([x_{i}^{\text{w}}\ y_{i% }^{\text{w}}\ z_{i}^{\text{w}}]^{T},\mathbf{R},\mathbf{t},\mathbf{A})|^{2},∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT | ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - caligraphic_P ( [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT w end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_R , bold_t , bold_A ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
s.t.formulae-sequencest\displaystyle\mathrm{s.t.}\quadroman_s . roman_t . 𝐑(𝐑)T=𝐈3×3,det(𝐑)=1,formulae-sequence𝐑superscript𝐑𝑇subscript𝐈33det𝐑1\displaystyle\mathbf{R}(\mathbf{R})^{T}=\mathbf{I}_{3\times 3},\ \mathrm{det}(% \mathbf{R})=1,bold_R ( bold_R ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = bold_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT , roman_det ( bold_R ) = 1 , (19)

where det(.)\mathrm{det}(.)roman_det ( . ) represents the determinant of a matrix.

The above problem is referred to as the Perspective-n-Point (PnP) problem [24]. It can be solved via the cv2.solvePnP function from the popular computer vision library OpenCV [25], where the Levenberg-Marquardt optimization method [26] is adopted.

IV-B Motion Smoothing and Snapshot Interpolation

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Comparison of simulated spectrograms via CASTER. (a) before one-euro filter smoothing; (b) after one-euro filter smoothing.

Because of the errors of keypoint detection with Mediapipe, there might be fake hops or jitters in the detected trajectories of keypoints, which do not exist actually. This will lead to the false alarm of high Doppler frequencies (as depicted in Fig. 4). In order to generate a high-fidelity dataset for gesture recognition model training, a low-pass filter, namely one-euro filter [27], is proposed to smooth both trajectories and velocities, followed by snapshot interpolation between neighboring video frames.

Let 𝐪i,k=[xi,kyi,kzi,k]Tsubscript𝐪𝑖𝑘superscriptdelimited-[]subscript𝑥𝑖𝑘subscript𝑦𝑖𝑘subscript𝑧𝑖𝑘𝑇\mathbf{q}_{i,k}=[x_{i,k}\ y_{i,k}\ z_{i,k}]^{T}bold_q start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and 𝐪^i,k=[x^i,ky^i,kz^i,k]Tsubscript^𝐪𝑖𝑘superscriptdelimited-[]subscript^𝑥𝑖𝑘subscript^𝑦𝑖𝑘subscript^𝑧𝑖𝑘𝑇\hat{\mathbf{q}}_{i,k}=[{\hat{x}}_{i,k}\ \hat{y}_{i,k}\ \hat{z}_{i,k}]^{T}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = [ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be the positions of the i𝑖iitalic_i-th keypoint in the k𝑘kitalic_k-th frame before and after the low-pass filtering respectively, 𝐪˙i,k=[x˙i,ky˙i,kz˙i,k]Tsubscript˙𝐪𝑖𝑘superscriptdelimited-[]subscript˙𝑥𝑖𝑘subscript˙𝑦𝑖𝑘subscript˙𝑧𝑖𝑘𝑇\dot{\mathbf{q}}_{i,k}=[\dot{x}_{i,k}\ \dot{y}_{i,k}\ \dot{z}_{i,k}]^{T}over˙ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = [ over˙ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT over˙ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT over˙ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and 𝐪˙^i,k=[x˙^i,ky˙^i,kz˙^i,k]Tsubscript^˙𝐪𝑖𝑘superscriptdelimited-[]subscript^˙𝑥𝑖𝑘subscript^˙𝑦𝑖𝑘subscript^˙𝑧𝑖𝑘𝑇\hat{\dot{\mathbf{q}}}_{i,k}=[\hat{\dot{x}}_{i,k}\ \hat{\dot{y}}_{i,k}\ \hat{% \dot{z}}_{i,k}]^{T}over^ start_ARG over˙ start_ARG bold_q end_ARG end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = [ over^ start_ARG over˙ start_ARG italic_x end_ARG end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT over^ start_ARG over˙ start_ARG italic_y end_ARG end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT over^ start_ARG over˙ start_ARG italic_z end_ARG end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be the estimated velocities of the i𝑖iitalic_i-th keypoint in the k𝑘kitalic_k-th frame before and after the low-pass filtering respectively. Initializing 𝐪^i,1subscript^𝐪𝑖1\hat{\mathbf{q}}_{i,1}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT with 𝐪i,1subscript𝐪𝑖1\mathbf{q}_{i,1}bold_q start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT, the trajectory smoothing for the i𝑖iitalic_i-th keypoint in the k𝑘kitalic_k-th frame is given by

o^i,k=αi,koi,k+(1αi,k)o^i,k1,i,k2formulae-sequencesubscript^𝑜𝑖𝑘subscript𝛼𝑖𝑘subscript𝑜𝑖𝑘1subscript𝛼𝑖𝑘subscript^𝑜𝑖𝑘1for-all𝑖𝑘2\displaystyle\hat{o}_{i,k}=\alpha_{i,k}o_{i,k}+(1-\alpha_{i,k})\hat{o}_{i,k-1}% ,\quad\forall i,k\geq 2over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ) over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_k - 1 end_POSTSUBSCRIPT , ∀ italic_i , italic_k ≥ 2 (20)

where the notation o𝑜oitalic_o represents the dimensions of x𝑥xitalic_x, y𝑦yitalic_y and z𝑧zitalic_z, respectively, and

αi,k=11+12πΔtv(fcmin+β|o˙^i,k|)subscript𝛼𝑖𝑘1112𝜋Δsubscript𝑡vsubscript𝑓subscriptcmin𝛽subscript^˙𝑜𝑖𝑘\displaystyle\alpha_{i,k}=\frac{1}{1+\frac{1}{2\pi{\Delta t}_{\text{v}}(f_{% \text{c}_{\text{min}}}+\beta|\hat{\dot{o}}_{i,k}|)}}italic_α start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_π roman_Δ italic_t start_POSTSUBSCRIPT v end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT c start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_β | over^ start_ARG over˙ start_ARG italic_o end_ARG end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT | ) end_ARG end_ARG

is the smoothing factor, ΔtvΔsubscript𝑡v{\Delta t}_{\text{v}}roman_Δ italic_t start_POSTSUBSCRIPT v end_POSTSUBSCRIPT is the video frame interval, fcminsubscript𝑓subscriptcminf_{\text{c}_{\text{min}}}italic_f start_POSTSUBSCRIPT c start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the minimum cutoff frequency, β𝛽\betaitalic_β is the speed coefficient of update. Moreover, the velocity in the above equation can be calculated as

o˙^i,k=γo˙i,k+(1γ)o˙^i,k1,i,k2formulae-sequencesubscript^˙𝑜𝑖𝑘𝛾subscript˙𝑜𝑖𝑘1𝛾subscript^˙𝑜𝑖𝑘1for-all𝑖𝑘2\displaystyle\hat{\dot{o}}_{i,k}=\gamma\dot{o}_{i,k}+(1-\gamma)\hat{\dot{o}}_{% i,k-1},\quad\forall i,k\geq 2over^ start_ARG over˙ start_ARG italic_o end_ARG end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = italic_γ over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT + ( 1 - italic_γ ) over^ start_ARG over˙ start_ARG italic_o end_ARG end_ARG start_POSTSUBSCRIPT italic_i , italic_k - 1 end_POSTSUBSCRIPT , ∀ italic_i , italic_k ≥ 2 (21)

where o˙i,k=(oi,ko^i,k1)/Δtvsubscript˙𝑜𝑖𝑘subscript𝑜𝑖𝑘subscript^𝑜𝑖𝑘1Δsubscript𝑡v\dot{o}_{i,k}=(o_{i,k}-\hat{o}_{i,k-1})/{\Delta t}_{\text{v}}over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = ( italic_o start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_k - 1 end_POSTSUBSCRIPT ) / roman_Δ italic_t start_POSTSUBSCRIPT v end_POSTSUBSCRIPT, o˙^i,1subscript^˙𝑜𝑖1\hat{\dot{o}}_{i,1}over^ start_ARG over˙ start_ARG italic_o end_ARG end_ARG start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT is initialized with 00, γ𝛾\gammaitalic_γ is the fixed smoothing factor.

Algorithm 1 One-euro low-path filter for keypoint trajectory smoothing.
1:Input:
  • {𝐪i,k=[xi,kyi,kzi,k]T|i{1,,21},k{1,,K}}conditional-setsubscript𝐪𝑖𝑘superscriptdelimited-[]subscript𝑥𝑖𝑘subscript𝑦𝑖𝑘subscript𝑧𝑖𝑘𝑇formulae-sequence𝑖121𝑘1𝐾\{\mathbf{q}_{i,k}=[x_{i,k}\ y_{i,k}\ z_{i,k}]^{T}|i\in\{1,\ldots,21\},k\in\{1% ,\ldots,K\}\}{ bold_q start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_i ∈ { 1 , … , 21 } , italic_k ∈ { 1 , … , italic_K } }, where 𝐪i,ksubscript𝐪𝑖𝑘\mathbf{q}_{i,k}bold_q start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT denotes the location of the i𝑖iitalic_i-th keypoint in the k𝑘kitalic_k-th frame.

  • fcminsubscriptsubscript𝑓cmin{f_{\text{c}}}_{\text{min}}italic_f start_POSTSUBSCRIPT c end_POSTSUBSCRIPT start_POSTSUBSCRIPT min end_POSTSUBSCRIPT: Minimum cutoff frequency for position.

  • β𝛽\betaitalic_β: Speed coefficient.

  • γ𝛾\gammaitalic_γ: Smoothing factor for velocity.

  • ΔtvΔsubscript𝑡v\Delta t_{\text{v}}roman_Δ italic_t start_POSTSUBSCRIPT v end_POSTSUBSCRIPT: Video frame interval.

2:Output:
  • {𝐪^i,k=[x^i,ky^i,kz^i,k]T|i{1,,21},k{1,,K}}conditional-setsubscript^𝐪𝑖𝑘superscriptdelimited-[]subscript^𝑥𝑖𝑘subscript^𝑦𝑖𝑘subscript^𝑧𝑖𝑘𝑇formulae-sequence𝑖121𝑘1𝐾\{\mathbf{\hat{q}}_{i,k}=[\hat{x}_{i,k}\ \hat{y}_{i,k}\ \hat{z}_{i,k}]^{T}|i% \in\{1,\ldots,21\},k\in\{1,\ldots,K\}\}{ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = [ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_i ∈ { 1 , … , 21 } , italic_k ∈ { 1 , … , italic_K } }: where 𝐪^i,ksubscript^𝐪𝑖𝑘\mathbf{\hat{q}}_{i,k}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT denotes the location of the i𝑖iitalic_i-th keypoint in the k𝑘kitalic_k-th frame after smoothing.

3:for k2𝑘2k\leftarrow 2italic_k ← 2 to K𝐾Kitalic_K do \triangleright Iteration over frames.
4:     for i1𝑖1i\leftarrow 1italic_i ← 1 to 21212121 do \triangleright Iteration over keypoints.
5:         for o𝑜oitalic_o represents the dimensions of x𝑥xitalic_x, y𝑦yitalic_y, and z𝑧zitalic_z respectively do
6:              o^i,1oi,1subscript^𝑜𝑖1subscript𝑜𝑖1\hat{o}_{i,1}\leftarrow o_{i,1}over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT ← italic_o start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT, o˙^i,10subscript^˙𝑜𝑖10\hat{\dot{o}}_{i,1}\leftarrow 0over^ start_ARG over˙ start_ARG italic_o end_ARG end_ARG start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT ← 0
7:              o˙i,k=(oi,ko^i,k1)/Δtvsubscript˙𝑜𝑖𝑘subscript𝑜𝑖𝑘subscript^𝑜𝑖𝑘1Δsubscript𝑡v\dot{o}_{i,k}=(o_{i,k}-\hat{o}_{i,k-1})/{\Delta t}_{\text{v}}over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = ( italic_o start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_k - 1 end_POSTSUBSCRIPT ) / roman_Δ italic_t start_POSTSUBSCRIPT v end_POSTSUBSCRIPT
8:              o˙^i,k=γo˙i,k+(1γ)o˙^i,k1subscript^˙𝑜𝑖𝑘𝛾subscript˙𝑜𝑖𝑘1𝛾subscript^˙𝑜𝑖𝑘1\hat{\dot{o}}_{i,k}=\gamma\dot{o}_{i,k}+(1-\gamma)\hat{\dot{o}}_{i,k-1}over^ start_ARG over˙ start_ARG italic_o end_ARG end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = italic_γ over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT + ( 1 - italic_γ ) over^ start_ARG over˙ start_ARG italic_o end_ARG end_ARG start_POSTSUBSCRIPT italic_i , italic_k - 1 end_POSTSUBSCRIPT \triangleright Equation (21): smooth velocity.
9:              αi,k=11+12πΔtv(fcmin+β|o˙^i,k|)subscript𝛼𝑖𝑘1112𝜋Δsubscript𝑡vsubscript𝑓subscriptcmin𝛽subscript^˙𝑜𝑖𝑘\alpha_{i,k}=\frac{1}{1+\frac{1}{2\pi{\Delta t}_{\text{v}}(f_{\text{c}_{\text{% min}}}+\beta|\hat{\dot{o}}_{i,k}|)}}italic_α start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_π roman_Δ italic_t start_POSTSUBSCRIPT v end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT c start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_β | over^ start_ARG over˙ start_ARG italic_o end_ARG end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT | ) end_ARG end_ARG \triangleright Update smoothing factor for position.
10:              o^i,k=αi,koi,k+(1αi,k)o^i,k1subscript^𝑜𝑖𝑘subscript𝛼𝑖𝑘subscript𝑜𝑖𝑘1subscript𝛼𝑖𝑘subscript^𝑜𝑖𝑘1\hat{o}_{i,k}=\alpha_{i,k}o_{i,k}+(1-\alpha_{i,k})\hat{o}_{i,k-1}over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ) over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_k - 1 end_POSTSUBSCRIPT \triangleright Equation (20): smooth position.
11:         end for
12:         𝐪^i,k=[x^i,ky^i,kz^i,k]Tsubscript^𝐪𝑖𝑘superscriptdelimited-[]subscript^𝑥𝑖𝑘subscript^𝑦𝑖𝑘subscript^𝑧𝑖𝑘𝑇\mathbf{\hat{q}}_{i,k}=[\hat{x}_{i,k}\ \hat{y}_{i,k}\ \hat{z}_{i,k}]^{T}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = [ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
13:     end for
14:end for
15:return {𝐪^i,k=[x^i,ky^i,kz^i,k]T|i{1,,21},k{1,,K}}conditional-setsubscript^𝐪𝑖𝑘superscriptdelimited-[]subscript^𝑥𝑖𝑘subscript^𝑦𝑖𝑘subscript^𝑧𝑖𝑘𝑇formulae-sequence𝑖121𝑘1𝐾\{\mathbf{\hat{q}}_{i,k}=[\hat{x}_{i,k}\ \hat{y}_{i,k}\ \hat{z}_{i,k}]^{T}|i% \in\{1,\ldots,21\},k\in\{1,\ldots,K\}\}{ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = [ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_i ∈ { 1 , … , 21 } , italic_k ∈ { 1 , … , italic_K } }

The overall smoothing procedure via one-euro filter is illustrated in Alg. 1. In fact, the smoothing of the i𝑖iitalic_i-th keypoint’s velocity o˙^i,ksubscript^˙𝑜𝑖𝑘\hat{\dot{o}}_{i,k}over^ start_ARG over˙ start_ARG italic_o end_ARG end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT and trajectory o^i,ksubscript^𝑜𝑖𝑘\hat{o}_{i,k}over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT in the k𝑘kitalic_k-th frame is conducted by repeating two first-order low-pass filters (21) and (20) to the position and velocity of the i𝑖iitalic_i-th keypoint. This procedure effectively eliminates false hops or jitters in the detected keypoint trajectories while preserving the motion features. An example of the smoothing result is shown in Fig. 4.

Finally, we adopt the cubic spline interpolation method [28] to insert Δtv/Δts1Δsubscript𝑡vΔsubscript𝑡s1\Delta t_{\text{v}}/\Delta t_{\text{s}}-1roman_Δ italic_t start_POSTSUBSCRIPT v end_POSTSUBSCRIPT / roman_Δ italic_t start_POSTSUBSCRIPT s end_POSTSUBSCRIPT - 1 positions of the i𝑖iitalic_i-th keypoint (ifor-all𝑖\forall i∀ italic_i) between every two neighboring frames (say 𝐪^i,ksubscript^𝐪𝑖𝑘\hat{\mathbf{q}}_{i,k}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT and 𝐪^i,k+1subscript^𝐪𝑖𝑘1\hat{\mathbf{q}}_{i,k+1}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i , italic_k + 1 end_POSTSUBSCRIPT, kfor-all𝑘\forall k∀ italic_k), and denote the position of the i𝑖iitalic_i-th keypoint in the t𝑡titalic_t-th snapshot as 𝐩i(t)subscript𝐩𝑖𝑡\mathbf{p}_{i}(t)bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ).

V Evaluation of CASTER Simulator

Refer to caption
Figure 5: Illustration of the simulated and experimental dataset from CASTER, where some examples of spectrogram are plotted.
Refer to caption
Figure 6: Facilities and scenario of experiment.

In this section, the high fidelity of the CASTER simulator in the applications of gesture recognition is demonstrated. Specifically, the generation of gesture datasets via CASTER simulator and real measurement is first elaborated. Then, the recognition performance via the above two datasets is discussed.

V-A Simulation and Experimental Datasets

In order to verify the quality of the dataset generated by CASTER simulator, 500500500500 clips of videos on 5555 gestures, including “Pushing and Pulling”, “Beckoning”, “Rubbing Fingers”, “Plugging” (slicing forward with all fingers together), and “Scaling” (spreading thumb, index finger, middle finger) were recorded using a normal monocular camera at a rate of 30303030 frames per second (fps). The motion data for hand model is then extracted via the video gesture catcher.

On the other hand, in the channel generator, the locations of transmitter, receiver and target hand center are [0m,0.1m,1.5m]0𝑚0.1𝑚1.5𝑚[0m,-0.1m,-1.5m][ 0 italic_m , - 0.1 italic_m , - 1.5 italic_m ], [0.2m,0.1m,0.1m]0.2𝑚0.1𝑚0.1𝑚[0.2m,-0.1m,0.1m][ 0.2 italic_m , - 0.1 italic_m , 0.1 italic_m ], and [0m,0m,0.40.8m]delimited-[]similar-to0𝑚0𝑚0.40.8𝑚[0m,0m,0.4\sim 0.8m][ 0 italic_m , 0 italic_m , 0.4 ∼ 0.8 italic_m ], respectively. Moreover, in order to model the target-unrelated channel, K𝐾Kitalic_K static RCSs are randomly generated from a normal distribution with a mean value of 0.005m20.005superscript𝑚20.005\,m^{2}0.005 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and a standard deviation of 0.001m20.001superscript𝑚20.001\,m^{2}0.001 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. These RCSs are associated with scatterers that are randomly located within a 2m×2m×2m2𝑚2𝑚2𝑚2\,m\times 2\,m\times 2\,m2 italic_m × 2 italic_m × 2 italic_m cuboid centered at the receiver. The positions of these scatterers are used to calculate the associated parameters Gtksuperscriptsubscript𝐺t𝑘G_{\text{t}}^{k}italic_G start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, Grksuperscriptsubscript𝐺r𝑘G_{\text{r}}^{k}italic_G start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, Rtksuperscriptsubscript𝑅t𝑘R_{\text{t}}^{k}italic_R start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, and Rrksuperscriptsubscript𝑅r𝑘R_{\text{r}}^{k}italic_R start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT.

Thus, 100100100100 sequences of channel impulse responses for each gesture are obtained via the proposed CASTER simulator with a sampling rate of 2000200020002000 snapshots per second. Then, one spectrogram, illustrating the Doppler frequency versus time, is calculated for each video clip (each sequence of channel impulse responses) by applying the short-time Fourier transform (STFT) with a window of 0.1250.1250.1250.125 seconds (250250250250 snapshots). As a result, a simulated dataset of 500500500500 spectrograms for the recognition of 5 gestures is obtained as illustrated in Fig. 5.

In order to measure the real Doppler spectrum of gestures, an integrated passive sensing and communication system working on millimeter wave (mmWave) band is developed as in our previous work [7]. As illustrated in Fig. 6, at the transmitter, an NI USRP-2954R [29] is utilized to generate an intermediate frequency (IF) signal at 500500500500 MHz, which is subsequently up-converted to 60 GHz and transmitted using a Sivers 60606060 GHz phased array[30]. At the receiver, two phased arrays are connected to a single USRP device to receive signals from the reference and surveillance channels, respectively. The transmit mmWave signal is modulated via orthogonal frequency-division multiplexing (OFDM). The carrier frequency is 60.4860.4860.4860.48 GHz and the signal bandwidth is 5555 MHz.

Refer to caption
(a) Pushing & Pulling (real)
Refer to caption
(b) Beckoning (real)
Refer to caption
(c) Rubbing Fingers (real)
Refer to caption
(d) Plugging (real)
Refer to caption
(e) Scaling (real)
Refer to caption
(f) Pushing & Pulling (sim)
Refer to caption
(g) Beckoning (sim)
Refer to caption
(h) Rubbing Fingers (sim)
Refer to caption
(i) Plugging (sim)
Refer to caption
(j) Scaling (sim)
Figure 7: Spectrogram comparison of 5555 gestures generated by CASTER (first row) and the experiment (second row).

In the experiment, the locations of the transmitter and receiver are consistent with those in the simulator. 100100100100 trials are measured for each gesture via the passive sensing system. Following the signal processing in [7], the spectrogram of hand gestures can be computed through the cross-ambiguity function (CAF). As a result, an experimental dataset with 100100100100 spectrograms per gesture is obtained, as illustrated in Fig. 5.

Refer to caption
Figure 8: Gesture recognition accuracy of the 6666 training and testing schemes.
Refer to caption
(a) Scheme 1111
Refer to caption
(b) Scheme 2222
Refer to caption
(c) Scheme 3333
Refer to caption
(d) Scheme 4444
Refer to caption
(e) Scheme 5555
Refer to caption
(f) Scheme 6666
Figure 9: Confusion charts of the 6666 training and testing schemes.
Refer to caption
Figure 10: t-SNE visualization for the feature spaces of simulation and experimental datasets. Five gesture categories are distinguished by different colors, with simulation sample features denoted by star shapes and experimental sample features represented by solid circle shapes.
Refer to caption
(a)
Refer to caption
(b)
Figure 11: Gesture recognition result after ADDA. (a) confusion chart of simulation-to-reality inference; (b) t-SNE visualization of the feature spaces in simulation and experimental datasets.

V-B Performance of Gesture Recognition

First of all, it can be observed from Fig. 7 that the spectrograms from real experiment and CASTER simulator exhibit similar time-Doppler patterns. To further demonstrate the high fidelity of the proposed simulator in the applications of gesture recognition, the following six training and testing schemes are adopted with the same image recognition model named ResNet18[31]:

  • Scheme 1111: The training set consists of 60606060 simulated spectrograms for each gesture, and the test set consists of 40404040 measured ones for each gesture;

  • Scheme 2222: The training set consists of 50505050 simulated spectrograms and 10101010 measured ones for each gesture, and the test set consists of 40404040 measured ones for each gesture;

  • Scheme 3333: The training set consists of 40404040 simulated spectrograms and 20202020 measured ones for each gesture, and the test set consists of 40404040 measured ones for each gesture;

  • Scheme 4444: The training set consists of 30303030 simulated spectrograms and 30303030 measured ones for each gesture, and the test set consists of 40404040 measured ones for each gesture;

  • Scheme 5555: The training set consists of 60606060 measured spectrograms for each gesture, and the test set consists of 40404040 measured ones for each gesture;

  • Scheme 6666: The training set consists of 60606060 simulated spectrograms for each gesture, and the test set consists of 40404040 simulated ones for each gesture.

The overall results of the gesture recognition are shown in Fig. 8, and the confusion charts of the 6666 schemes are shown in Fig. 9 respectively. It can be observed that an accuracy of 83.0%percent83.083.0\%83.0 % (Scheme 1111) can be achieved if the simulated dataset is used for training and the experimental dataset is used for testing. On the other hand, there is still roughly 16.0%percent16.016.0\%16.0 % and 17.0%percent17.017.0\%17.0 % performance loss compared with the Scheme 5555 and 6666, indicating that the difference between simulated and experimental datasets is not negligible. One method to mitigate such difference is to mix some experimental samples into the simulated dataset. It can be observed from the results of Scheme 2,3,42342,3,42 , 3 , 4 that mixing some experimental samples could significantly improve the testing accuracy. Moreover, it can be observed that the enhanced recognition accuracy converges to 98.5%percent98.598.5\%98.5 % for Schemes 3333 and 4444. However, this is still 0.5%percent0.50.5\%0.5 % lower than the accuracy achieved with Scheme 5555. This difference indicates the inherent feature distinctions between simulated and experimental datasets.

Furthermore, it is apparent from the Fig. 9(a) that gesture recognition for “beckoning” and “plugging” are not sufficiently accurate. The recognition accuracy is 70%percent7070\%70 % and 75%percent7575\%75 % respectively. Moreover, 22.5%percent22.522.5\%22.5 % confusion probability exists between the gestures of “beckoning” and “pushing and pulling”, indicating that some of the simulation samples of “beckoning” are similar to the experimental samples of “pushing and pulling”.

To qualitatively support the aforementioned observations, we applied dimensionality reduction techniques to the extracted features (network output before entering the fully-connected layer classifier) for the entire simulation and experimental datasets using the ResNet18 model trained by Scheme 1111. Specifically, t-distributed Stochastic Neighbor Embedding (t-SNE)[32] and Principal Component Analysis (PCA)[33] were employed to visualize and analyze the high-dimensional features (512512512512 dimensions for ResNet18) of the dataset, as illustrated in Fig. 10. It could be observed that although 83.0%percent83.083.0\%83.0 % gesture recognition accuracy is achieved, the distributions of the features of different gestures are not sufficiently separated. Moreover, the simulated and experimental features for the gesture “Scaling” are not well aligned, indicating the inherent feature distinctions between simulated and experimental datasets. These could be regarded as the limitation of the proposed simulator, since the real-world channel is more complex than the simulated one due to the impacts of multipath and the non-ideal hardware.

V-C Improvement via Transfer Learning

The transfer learning technique [16] is applied in this part to relieve the above issue of feature distinction. In this context, the simulated dataset is referred to as the source domain, and the experimental dataset as the target domain. The adversarial discriminative domain adaptation (ADDA) [17] is adopted to align the feature distributions of the source and target domains. The ResNet18 model trained by Scheme 1111 in the previous part, serves as the source domain gesture recognition model, and the target domain gesture recognition model is initialized with the same architecture and parameters. Then, additional 50505050 unlabeled experimental samples are added to the simulated dataset for Scheme 1111. This is used to train a domain discriminator to distinguish the source and target domain features and fine-tune the feature extractor part of the target model alternatively, such that the source feature representation is mimicked. The details of the ADDA method can be found in [17]. The confusion chart of the testing result and t-SNE visualization of the feature spaces in simulation and experimental datasets after ADDA are shown in Fig. 11. The recognition accuracy is boosted to 96.5%percent96.596.5\%96.5 % and the feature spaces of simulation and experimental datasets are well aligned. This result indicates that the feature distinctions between simulated and experimental datasets can be mitigated significantly by transfer learning.

VI Conclusion

In this paper, a computer-vision assisted wireless channel simulator, namely CASTER simulator, is proposed to generate high-fidelity dataset for hand gesture recognition. In the simulator, the target hand is modeled by 21212121 ellipsoid primitives, and the ray-tracing method is adopted to calculate the channel impulse responses. Moreover, a video gesture catcher is proposed to capture real motion data of gestures. In the experiments with 5555 different gestures, both real dataset via experiment and simulated dataset via CASTER simulator are obtained. An accuracy of 83.0%percent83.083.0\%83.0 % can be achieved in simulation-to-reality inference, i.e., using simulated and experimental datasets in model training and inference respectively. Moreover, this accuracy can be boosted to 96.5%percent96.596.5\%96.5 % by transfer learning, i.e., fine-tuning the gesture recognition model with a few unlabeled real data.

References

  • [1] Y. Ma, G. Zhou, and S. Wang, “Wifi sensing with channel state information: A survey,” ACM Computing Surveys (CSUR), vol. 52, no. 3, pp. 1–36, jun 2019.
  • [2] J. Liu, H. Liu, Y. Chen, Y. Wang, and C. Wang, “Wireless sensing for human activity: A survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 3, pp. 1629–1645, thirdquarter 2020.
  • [3] Y. Zhang, Y. Zheng, K. Qian, G. Zhang, Y. Liu, C. Wu, and Z. Yang, “Widar3. 0: Zero-effort cross-domain gesture recognition with wi-fi,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8671–8688, Nov 2021.
  • [4] J. W. Smith, S. Thiagarajan, R. Willis, Y. Makris, and M. Torlak, “Improved static hand gesture classification on deep convolutional neural networks using novel sterile training technique,” IEEE Access, vol. 9, pp. 10 893–10 902, Jan 2021.
  • [5] W. Li, R. J. Piechocki, K. Woodbridge, C. Tang, and K. Chetty, “Passive wifi radar for human sensing using a stand-alone access point,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 3, pp. 1986–1998, March 2020.
  • [6] H. Sun, L. G. Chia, and S. G. Razul, “Through-wall human sensing with wifi passive radar,” IEEE Transactions on Aerospace and Electronic Systems, vol. 57, no. 4, pp. 2135–2148, Aug 2021.
  • [7] J. Li, C. Yu, Y. Luo, Y. Sun, and R. Wang, “Passive motion detection via mmwave communication system,” in 2022 IEEE 95th Vehicular Technology Conference:(VTC2022-Spring).   IEEE, 2022, pp. 1–6.
  • [8] R. Du, H. Hua, H. Xie, X. Song, Z. Lyu, M. Hu, Y. Xin, S. McCann, M. Montemurro, T. X. Han et al., “An overview on ieee 802.11 bf: Wlan sensing,” arXiv preprint arXiv:2310.17661, 2023.
  • [9] M. Zhang et al., “Channel models for WLAN sensing systems,” IEEE 802.11 Documents, Sep 2021. [Online]. Available: https://mentor.ieee.org/802.11/documents?isdcn=Meihong
  • [10] G. Li, S. Wang, J. Li, R. Wang, X. Peng, and T. X. Han, “Wireless sensing with deep spectrogram network and primitive based autoregressive hybrid channel model,” in 2021 IEEE 22nd International Workshop on Signal Processing Advances in Wireless Communications (SPAWC).   IEEE, 2021, pp. 481–485.
  • [11] G. Li, S. Wang, J. Li, R. Wang, F. Liu, X. Peng, T. X. Han, and C. Xu, “Integrated sensing and communication from learning perspective: An sdp3 approach,” IEEE Internet of Things Journal, Feb 2023.
  • [12] “Wigig tools.” [Online]. Available: https://github.com/wigig-tools
  • [13] S. Vishwakarma, W. Li, C. Tang, K. Woodbridge, R. Adve, and K. Chetty, “Simhumalator: An open-source end-to-end radar simulator for human activity recognition,” IEEE Aerospace and Electronic Systems Magazine, vol. 37, no. 3, pp. 6–22, March 2021.
  • [14] B. Erol, C. Karabacak, S. Z. Gürbüz, and A. C. Gürbüz, “Simulation of human micro-doppler signatures with kinect sensor,” in 2014 IEEE Radar Conference, 2014, pp. 0863–0868.
  • [15] R. Boulic, N. M. Thalmann, and D. Thalmann, “A global human walking model with real-time kinematic personification,” The visual computer, vol. 6, pp. 344–358, Nov 1990.
  • [16] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, Jan 2021.
  • [17] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2962–2971.
  • [18] N. Wheatland, Y. Wang, H. Song, M. Neff, V. Zordan, and S. Jörg, “State of the art in hand and finger modeling and animation,” in Computer Graphics Forum, vol. 34, no. 2.   Wiley Online Library, 2015, pp. 735–760.
  • [19] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 172–186, Jan 2021.
  • [20] F. Zhang, V. Bazarevsky, A. Vakunov, A. Tkachenka, G. Sung, C.-L. Chang, and M. Grundmann, “Mediapipe hands: On-device real-time hand tracking,” arXiv preprint arXiv:2006.10214, 2020.
  • [21] J. Romero, D. Tzionas, and M. J. Black, “Embodied hands: Modeling and capturing hands and bodies together,” ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), vol. 36, no. 6, Nov. 2017.
  • [22] E. F. Knott, J. F. Schaeffer, and M. T. Tulley, Radar cross section.   SciTech Publishing, 2004.
  • [23] K. D. Trott, “Stationary phase derivation for rcs of an ellipsoid,” IEEE Antennas Wireless Propag. Lett., vol. 6, pp. 240–243, Jun 2007.
  • [24] E. Marchand, H. Uchiyama, and F. Spindler, “Pose estimation for augmented reality: A hands-on survey,” IEEE Transactions on Visualization and Computer Graphics, vol. 22, no. 12, pp. 2633–2651, Dec 2016.
  • [25] “Opencv: Perspective-n-point (pnp) pose computation.” [Online]. Available: https://docs.opencv.org/4.x/d5/d1f/calib3d_solvePnP.html
  • [26] K. Levenberg, “A method for the solution of certain non-linear problems in least squares,” Quarterly of applied mathematics, vol. 2, no. 2, pp. 164–168, 1944.
  • [27] G. Casiez, N. Roussel, and D. Vogel, “1€ filter: a simple speed-based low-pass filter for noisy input in interactive systems,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2012, pp. 2527–2530.
  • [28] C. De Boor and C. De Boor, A practical guide to splines.   springer-verlag New York, 1978, vol. 27.
  • [29] National Instruments. Usrp-2954. [Online]. Available: https://www.ni.com/en-us/shop/model/usrp-2954.html
  • [30] Sivers IMA. Evk 06002/00. [Online]. Available: https://www.siversima.com/product/evk-06002-00/
  • [31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [32] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, Nov 2008.
  • [33] C. M. Bishop, Pattern Recognition and Machine Learning.   Springer, 2006.