SUPER: Seated Upper Body Pose Estimation using mmWave Radars

Bo Zhang, Zimeng Zhou, Boyu Jiang, Rong Zheng Department of Computing and Software
McMaster University
Hamilton, ON, Canada
{zhanb59, zhouz287, jiangb11, rzheng}@mcmaster.ca
Abstract

In industrial countries, adults spend a considerable amount of time sedentary each day at work, driving and during activities of daily living. Characterizing the seated upper body human poses using mmWave radars is an important, yet under-studied topic with many applications in human-machine interaction, transportation and road safety. In this work, we devise SUPER, a framework for seated upper body human pose estimation that utilizes dual-mmWave radars in close proximity. A novel masking algorithm is proposed to coherently fuse data from the radars to generate intensity and Doppler point clouds with complementary information for high-motion but small radar cross section areas (e.g., upper extremities) and low-motion but large RCS areas (e.g. torso). A lightweight neural network extracts both global and local features of upper body and output pose parameters for the Skinned Multi-Person Linear (SMPL) model. Extensive leave-one-subject-out experiments on various motion sequences from multiple subjects show that SUPER outperforms a state-of-the-art baseline method by 30 – 184%. We also demonstrate its utility in a simple downstream task for hand-object interaction.

Index Terms:
Seated upper body pose estimation, mmWave radars, data fusion, point clouds, deep neural networks

I Introduction

Human pose estimation (HPE) estimates the configuration of human body parts from input data captured by sensors and has attracted much attention in industry and the research community due to its wide range of applications, including the human-machine interactions [1], fitness [2], virtual reality [3], smart home [4] and smart vehicle [5], etc. While full-body HPE is important in characterizing joint movements during locomotions, a 2019 study showed that adults ages 20 to 75 in the US reported spending an average of 9.5 hours sedentary each day [6]. Therefore, seated upper body human pose estimation (SUB-HPE) is arguably more relevant in interactive applications and understanding users’ mental states (e.g., alertness and attention). For example, by monitoring upper limb movements while sitting, novel applications can be developed to empower users to control digital interfaces, manipulate augmented reality environments, and manage smart home systems. SUB-HPE can also find applications in transportation and road safety, where drowsy or inattentive drivers pose a significant risk on roadways. Analyzing head poses, hand placements and orientation of the upper body allows the detection of early signs of drowsiness or distraction.

In recent years, the rapid advancements in deep learning led to significant progress in human body modeling [7, 8] and HPE using various sensing modalities. Notable work in HPE includes OpenPose [9] and VitPose [10] in computer vision, Deep inertial poser [11] and IMUPoser [12] using IMU sensors, mmPose [13] and mmMesh [14] with mmWave radars, and DensePose [15] using WiFi devices, to name a few. Among different sensing modalities, mmWave radars offer distinct advantages due to their ability to penetrate obstructions like garments or walls, adapt to diverse lighting and weather conditions, and preserve user privacy. Furthermore, the substantial bandwidth (in the GHz range) equips mmWave radars with resilience against noise, interference, and center-meter level range resolutions. However, existing mmWave-based solutions predominantly target full-body locomotions and are not designed for handling nuanced upper-body movements. mmWave-based SUB-HPE shares with full-body HPE common challenges stemming from low spatial resolutions as the result of few on-board transmitting and receiving antennas on low-end commercial-of-the-shelf (COTS) mmWave radars, specular reflections and variations from inherent micro-body movements. But, crucially, it must also handle limited mobility in the upper body’s core area when sitting, as well as the small radar cross-sections (RCS) of upper extremities, ranging from -45 dBsm to -20 dBsm for hands [16].

In this work, we devise SUPER, a framework for Seated Upper Body Pose Estimation using mmWave Radars. The framework encompasses a dual-radar pre-processing and fusion pipeline and a light weight neural network to predict upper body pose parameters. To increase the spatial resolution of the acquired radar data, two closely positioned radar sensors, oriented perpendicular to each other, are utilized. A novel dual-radar masking algorithm coherently fuses data from the radars to generate two complementary types of point clouds: the intensity point cloud (IPC) and the Doppler point cloud (DPC). The latter captures motion information of extremities while the former better characterizes low-motion portions of the upper body (e.g., torso areas). Benefiting from the sparse point cloud representation, the lightweight neural network extracts both global and local features of the upper body. Finally, the Skinned Multi-Person Linear (SMPL) model is applied to yield realistic human body poses and motions. An example of the data captured by an RGB camera, a motion capture system, and the predicted and ground truth poses can be found in Figure 1.

Refer to caption
(a) Camera view.
Refer to caption
(b) OptiTrack motion capture view.
Refer to caption
(c) SUPER output
Figure 1: The estimated skeleton model from SUPER vs. ground truth when a subject raises her/his hand up while seating. The blue circle markers stand for the estimated skeleton model, and the red plus markers are the corresponding ground truth.

We have implemented a prototype of SUPER utilizing two Texas Instruments IWR6843ISK mmWave radars111Demonstration videos can be found at https://super-2023-web.github.io/SUPER/.. A diverse group of 10 subjects, encompassing different genders, ages, and body mass indices (BMIs), were recruited for data collection in a laboratory setting. The data collection process involved subjects engaging in predefined arm, head, torso motion sequences. Experiment results show that SUPER consistently outperforms a state-of-the-art (SOTA) baseline method and achieves 112mm in average Mean Per Joint Position Error (MPJPE) and 15.89mm Procrustes alignment MPJPE (PA-MPJPE) metrics in leave-one-subject out trials. To demonstrate the utility of SUPER, we also implement and evaluate a simple downstream task of hand-object interaction.

In summary, we make the following new contributions toward mmWave-based fine-grained SUB-HPE in this work.

  • In this work, we investigate a new task, i.e., SUB-HPE, and collect a dataset consisting of various head, torso as well as arm motions using mmWave radars.

  • The proposed framework, SUPER, utilizes the intensity information from multi-antenna radar systems, to characterize the spatial occupation of human body under low mobility and Doppler information to capture motions of extremities.

  • We demonstrate the feasibility of deploying two asynchronous but closely located mmWave radars to improve spatial resolution. A novel masking algorithm is proposed to coherently fuse data from both radars.

  • SUPER has been evaluated using different motion sequences and data from a diverse set of users and shows superior performance compared to a SOTA baseline method.

The rest of the paper is organized as follows. A review of recent development of mmWave-based HPE methods and public datasets is presented in Section II. In Section III, we introduce the proposed pipeline and key techniques. Section IV provides experiment setups and the dataset we build. Detailed results and system performance are provided in Section V. Section VI demonstrates the potentials of the proposed system by a downstream task. Finally, we discuss the limitations of the work and conclude this paper in Section VII.

II Related work

FMCW radars as an emerging technology have attracted significant attention and have been investigated in a variety of sensing tasks, e.g. tracking and localization[17, 18, 19], gesture recognition[20, 21, 22, 23], and vital sign monitoring[24, 25, 26, 27], etc. In this section, we focus on mmWave-based HPE methods and public datasets.

II-A MmWave-based human pose estimation

In [13], Sengupta et al. present mm-Pose, which is among the first works in mmWave-based full-body HPE. mm-Pose projects radar point clouds from two separate and perpendicularly oriented radars onto the depth-azimuth(XY) and depth-elevation(XZ) plane, respectively to create two 2D intensity images. The images are then fed into a forked CNN structure to predict the human skeletal joints. In [28], An et al. propose the MARS system which takes 5D radar point clouds (x, y, z, intensity and Doppler) as input and outputs human pose in several rehabilitation scenarios. Xue et al.[14] introduce mmMesh which adopts PointNet[29] as the feature extractor of the point cloud and incorporates SMPL[7] to this task, facilitating both body shape and pose predictions. In a follow-up work to [14], multi-subject 3D human mesh construction is investigated [30]. This is achieved by obtaining the location information from an energy map, and selectively generating 4D point clouds close to the subjects. A fine-grained human mesh is then predicted using a coarse-to-fine mesh estimation framework. Most recently, instead of using radar point clouds, Lee et al.[31] introduce the velocity-specific range-doppler-azimuth-elevation map (VRDAEMap) as the input and developed a cross-modality training framework that fuses multi-scale radar features using a Cross- and Self-Attention Module (CSAM), and further refines the predicted key points through a Pose Refinement Graph Convolutional Networks (PRGCN).

The aforementioned works on mmWave-based HPE differ in the number of devices used for data collection, data representation (point clouds vs. images), and the backbone neural network architecture. However, none considers SUB-HPE, where there is typically limited trunk and lower limb mobility. A summary of the key aspects of these methods can be found in Table I.

TABLE I: Comparison of existing works on mmWave-based HPE
Method Radar Sensor Ground Truth Sensor Data Representation Body Motions
mm-Pose[13] 2 TI AWR1642 Microsoft Kinect two 2D intensity image (XY-plane and XZ plane) Walking Left-Arm Swing, Right-Arm Swing, Both-Arms-Swing
MARS[28] 1 TI IWR1443 Microsoft Kinect 5D Point Cloud (x, y, z, velocity, intensity) 10 rehabilitation movements1
mmMesh[14] 1 TI AWR1843 VICON system 6D Point Cloud (x,y,z,range, velocity, energy) 8 daily activities2
m4eshsuperscript𝑚4𝑒𝑠m^{4}eshitalic_m start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_e italic_s italic_h[30] 1 TI AWR1843 VICON system 6D Point Cloud (x, y, z, range, velocity, energy) 7 daily activities3 \dagger freely performed by multi-person
HuPr[31] 2 TI IWR1843 RGB camera VRDAEMap(velocity-specific range-doppler-azimuth-elevation map) static actions, standing and waving hand(s), walking with waving hand(s)
mmBody[32] Arbe Robotics Phoenix MoCap system 6D dense Point Cloud (x, y, z, velocity, amplitude, energy) 100 motions
Ours 2 TI IWR6843 OptiTrack system intensity point cloud, Doppler point cloud upper limb movements, head rotation, driving simulation
  • 1

    Right/left/both limb extension, right/left side lunge right/left front lunge, right/left upper body extension, squad.

  • 2

    Torso rotations, clockwise walking, counter-clockwise walking, arm swing), walking back and forth; walking back, and forth with arm swing, walking in the place, lunges.

  • 3

    Walking in circles, walking back and forth in straight, picking up the phone from the desk, putting down the phone on the desk, answering phone calls while walking, playing with the cell phone while sitting on the chair, sitting on the chair and standing up from the chair.

  • \dagger

    Freely performed by multi-person in one recording.

II-B Public mmWave-based HPE datasets

Very few public datasets are currently available for mmWave-based HPE. In [28], the authors release a dataset MARS containing radar point clouds and annotation obtained using Microsoft Kinect V2 sensor. Chen et al. proposed mmBody[32], a multi-scenario RGBD-paired mmWave radar (Arbe Robotics Phoenix) point cloud dataset for human pose reconstruction with 3D ground truth provided by a motion capture system. The work in [31] also provides a dataset HuPR, which contains raw radar data together with 3D annotation generated from a synced RGB camera.

With the exception of HuPR, the aforementioned public datasets only contain intermediate representations of the radar data, e.g., point clouds. The lack of raw data greatly limits innovations on radar signal processing algorithms and consequently affects the informativeness of training data to the HPE models. Another limitation of some datasets (e.g., HuPR and MARS) lies in the absence of accurate ground truth due to the use of RGB or RGB-D inputs for annotations.

III Methodology

SUPER considers the problem of estimating upper body human poses when a subject faces mmWave radar sensors at a known distance. The assumption for known distance is valid in confined environments such as in an office cubicle or inside a vehicle. Alternatively, existing approaches for mmWave-based target localization can be adopted to determine a bounding box around the subject [33]. In this section, we first provide an overview and the design rationale of the SUPER pipeline and then present details of its individual components.

III-A Overview and Design Rationale

Refer to caption
Figure 2: A 2-Dimensional MIMO antenna array for IWR6843ISK radar. The separation d𝑑ditalic_d equals half wavelength.

Low-end COTS mmWave radars typically have a small number of Tx and Rx antennas, which restrict their spatial resolution. Take TI IWR6843ISK radar as an example. It features 3 Tx and 4 Rx antennas forming a 12 virtual antenna array as illustrated in Figure 2. Placed horizontally, this configuration results in angle resolutions of 15 degrees and 55 degrees, respectively, in the horizontal and vertical directions. To estimate fine-grained SUB-HPE, a high azimuth angle resolution is necessary for extremities when the arms are extended while a high elevation angle resolution is helpful in distinguishing subtle head and trunk poses. To mitigate the limitations of low-end mmWave radars, we employ two closely located radar sensors: one oriented horizontally and the other vertically. Despite the lack of coordination, the reflected wave from one radar’s transmission is unlikely mistaken as that from the other radar since the resulting range bins are outside the region of interest (ROI). Note that although dual-radar systems have been also employed in mm-Pose [13] and HuPR [31], the data is used to produce 2D heatmaps (images) in perpendicular planes rather than being fused together in 3D point clouds.

Several existing mmWave-based HPE methods model human body as a point cloud, which is obtained from range-Doppler maps over multiple chirps of radar signals. Doppler information has sufficient coverage on the entire body only if there are significant motions in different body parts. In seated positions, however, movements in the trunk and low limbs are confined leading to sparse points in space. In contrast, the intensity of reflected signals from the bulk of the body is high regardless of motions as long as the subject is sufficiently close to the radars. Thus, a range-angle map, augmented with intensity information from a multi-antenna system, better captures the occupation of human body in space. Motivated by this observation and with the unique characteristics of seated SUB-HPE in mind, we extract two point clouds with reflected intensity and Doppler information. The ablation study in Section V further substantiates the empirical evidence supporting the complementary nature of the two input sources.

Refer to caption
Figure 3: The system diagram of SUPER. New processing blocks introduced in this paper are highlighted in orange, and intermediate data flows are highlighted in blue.

The overall system diagram of SUPER shown in Figure 3, consists of two main processing blocks, i.e., point cloud generation and a backbone network. The reflected RF signals from two radar sensors are preprocessed using match filtering and range-FFT. Dense point clouds are then generated by sampling the ROIs in 3D space centred around each radar. A dual-radar fusion algorithm coherently combines data from two radars and samples the results to produce fine-grained point cloud data representation for intensity and for Doppler. Both point clouds are fed into the backbone network. The network comprises building blocks from PointNet [34], PointNet++ [35], and LSTM to extract global and local features to predict the SMPL pose parameters in each frame. The pipeline can be easily extended to predict body shape parameters and will be investigated as part of future work.

III-B Point cloud generation with dual-radar fusion

Refer to caption
Figure 4: Generation of dense point clouds from raw radar data. One intensity point cloud and one Doppler point cloud are produced for each radar separately.

In this section, we introduce a novel pipeline to generate quality point clouds from data collected by two closely located radars. Data from each radar goes through separate branches to handle intensity and Doppler information. The overall processing consists of two stages: the first stage transforms raw radar data to a dense point cloud, which acts as an intermediate representation. In the second stage, data from the two radar sensors are fused together and then sampled to produce a fine-grained point cloud.

III-B1 Dense point cloud generation

Raw I-Q samples from each radar in intermediate frequency (IF) follow the standard pre-processing steps. These include map** the raw radar data into a range map through range-FFT and DC compensation to eliminate static background clutters. As previously mentioned, SUPER operates under the assumption that the approximate distance between the subject and the radars is known. This knowledge enables the designation of an ROI that encompasses the subject. For example, when seated around 1 meter away from the radars, the range bins that span the subject’s body are approximately from 0.4 meters to 1.8 meters. These parameters can be easily adjusted given the setup of different scenarios.

Intensity point clouds

To generate intensity point clouds, we further consider 180-degree field of view (FOV) in both horizontal and vertical directions and choose a non-uniform sampling scheme as shown in Table II. Specifically, for the radar placed horizontally (radar H), θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ correspond to the azimuth and elevation angles; for radar V, the reverse is true. Clearly, as indicated in Table II, angles are densely sampled in the axis where more spaced virtual antennas are available and near the center, while in the perpendicular direction, fewer angle bins are sampled. Consequently, amongst the 30 range bins between 0.4 meters and 1.8 meters from the subject, there are in total 6930 (=21×11×30absent211130=21\times 11\times 30= 21 × 11 × 30) sample points in the ROI.

TABLE II: Non-uniform Angle Sampling (unit in degree)
θ𝜃\thetaitalic_θ -70 -60 -50 -40 -30 -25 -20 -15 -10 -5 0
5 10 15 20 25 30 40 50 60 70
ϕitalic-ϕ\phiitalic_ϕ -70 -50 -30 -20 -10 0 10 20 30 50 70

Next, we apply Minimum Variance Distortionless Response (MVDR) to generate an intensity spectrum for each point location in the ROI. We first estimate the correlation matrix for each range index i𝑖iitalic_i, using all N𝑁Nitalic_N chirps within one frame,

Risubscript𝑅𝑖\displaystyle R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =n=1N𝐲𝐲HN,absentsuperscriptsubscript𝑛1𝑁superscript𝐲𝐲𝐻𝑁\displaystyle=\frac{\sum_{n=1}^{N}\mathbf{y}\mathbf{y}^{H}}{N},= divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_yy start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG ,
Risubscript𝑅𝑖\displaystyle R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =Ri+αtrace(Ri)KIK,absentsubscript𝑅𝑖𝛼𝑡𝑟𝑎𝑐𝑒subscript𝑅𝑖𝐾subscript𝐼𝐾\displaystyle=R_{i}+\alpha\frac{trace(R_{i})}{K}I_{K},= italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α divide start_ARG italic_t italic_r italic_a italic_c italic_e ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_K end_ARG italic_I start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ,

where 𝐲𝐲\mathbf{y}bold_y is a column vector of the received signal at each antenna, N𝑁Nitalic_N is the number of chirps in one frame, K𝐾Kitalic_K is the number of received antennas, and α𝛼\alphaitalic_α is a control parameter to prevent singularity.

Next, we calculate the steering vector 𝐚𝐬subscript𝐚𝐬\mathbf{a_{s}}bold_a start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT from the virtual antennas array as

𝐚𝐬(n)={exp(jπ(μa(n1))),1n8,exp(jπ(μa(n61)+μb)),9x12,subscript𝐚𝐬𝑛cases𝑒𝑥𝑝𝑗𝜋subscript𝜇𝑎𝑛11𝑛8𝑒𝑥𝑝𝑗𝜋subscript𝜇𝑎𝑛61subscript𝜇𝑏9𝑥12\displaystyle\mathbf{a_{s}}(n)=\left\{\begin{array}[]{lr}exp(j\pi(\mu_{a}(n-1)% )),&1\leq n\leq 8,\\ exp(j\pi(\mu_{a}(n-6-1)+\mu_{b})),&9\leq x\leq 12,\end{array}\right.bold_a start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ( italic_n ) = { start_ARRAY start_ROW start_CELL italic_e italic_x italic_p ( italic_j italic_π ( italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_n - 1 ) ) ) , end_CELL start_CELL 1 ≤ italic_n ≤ 8 , end_CELL end_ROW start_ROW start_CELL italic_e italic_x italic_p ( italic_j italic_π ( italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_n - 6 - 1 ) + italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ) , end_CELL start_CELL 9 ≤ italic_x ≤ 12 , end_CELL end_ROW end_ARRAY

where

μasubscript𝜇𝑎\displaystyle\mu_{a}italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT =sin(θπ/180)cos(ϕπ/180),absent𝑠𝑖𝑛𝜃𝜋180𝑐𝑜𝑠italic-ϕ𝜋180\displaystyle=sin(\theta\pi/180)cos(\phi\pi/180),= italic_s italic_i italic_n ( italic_θ italic_π / 180 ) italic_c italic_o italic_s ( italic_ϕ italic_π / 180 ) ,
μbsubscript𝜇𝑏\displaystyle\mu_{b}italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT =sin(ϕπ/180).absent𝑠𝑖𝑛italic-ϕ𝜋180\displaystyle=sin(\phi\pi/180).= italic_s italic_i italic_n ( italic_ϕ italic_π / 180 ) .

Finally, we calculate the intensity spectrum for each sample point as

IS(θ,ϕ,i)=1𝐚𝐬HRi1𝐚𝐬,𝐼𝑆𝜃italic-ϕ𝑖1superscriptsubscript𝐚𝐬𝐻superscriptsubscript𝑅𝑖1subscript𝐚𝐬\displaystyle IS(\theta,\phi,i)=\frac{1}{\mathbf{a_{s}}^{H}R_{i}^{-1}\mathbf{a% _{s}}},italic_I italic_S ( italic_θ , italic_ϕ , italic_i ) = divide start_ARG 1 end_ARG start_ARG bold_a start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_a start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT end_ARG ,

where 𝐚𝐬subscript𝐚𝐬\mathbf{a_{s}}bold_a start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT is the steering vector, and i𝑖iitalic_i is the range index. This process creates a 4D point cloud with intensity values in polar coordinates, which can then be transformed into a dense point cloud in a Cartesian coordinate system centered on a radar.

Doppler point clouds

To generate Doppler point clouds, we follow a similar procedure to that in [14]. Specifically, Doppler-FFT on the chirps in a frame is applied to derive 2D range-Doppler maps (30×1283012830\times 12830 × 128) of each received antenna. For every point in the 2D range-Doppler map, its velocity and power are calculated through an additional angle-FFT across multiple received antennas. The procedure is applied to data from the two radars independently, resulting two 5D point clouds of 3840 (=128×30absent12830=128\times 30= 128 × 30) points for each radar.

It is worth noting that the term “dense” is adopted to differentiate this representation from the eventual fused point clouds. While the point clouds in this initial stage remain relatively sparse when compared to those generated by Lidar sensors, they are denser than the point clouds typically found in existing literature on mmWave-based HPE. This increased density is achieved through spatial oversampling in the intensity point clouds. Further information regarding the process is illustrated in Figure 4.

III-B2 Dual-radar fusion for fine-grained point clouds

Refer to caption
Figure 5: Generation of fine-grained point clouds by fusing and sampling dense point clouds from the two radars.

To this end, we have generated four point clouds, i.e., one 4D intensity point cloud and one 5D Doppler point cloud from each radar. The two radar sensors are positioned in close proximity, approximately 15cm apart. Thus, the dense point clouds generated by each radar sensor roughly share the same ROI but are complementary spatially. Radar H captures detailed information in the horizontal direction, which can be used to enhance the quality of the point cloud derived from radar V, and vice versa. Therefore, the purpose of dual-radar fusion is two-folded. First, it refines the point clouds from one radar using the point clouds from the other radar. Second, it trims the over-sampled point clouds and retains only salient points. At the end of the procedure, a single intensity point cloud and a single Doppler point cloud are obtained for further processing. An overview of this process is given in Figure 5.

Masked refinement

To refine the point clouds from both radars, we first transform their representations from polar coordinate frames to a unified Cartesian coordinate frame. Consider the 4D intensity point clouds from radar H as an example. A similar procedure is applied to the intensity point cloud from radar V and the 5D Doppler point clouds from both radars. Let the point cloud from radar H be the target and that from radar V serves as a reference. For each point in the target point cloud, the K𝐾Kitalic_K nearest points in the reference point cloud are identified. The mean power value of these points is computed through averaging. The value of the point in the target point cloud is replaced by the product of itself and the mean value. This multiplication has the effect of masking or suppressing points with high values in only one point cloud and amplifying those with high values in both. Furthermore, the operation can preserve local power variations, as the masks within the same local area are nearly identical.

Point cloud trimming

Due to spatial over-sampling, the dense point clouds produced thus far contain redundant information. To retain only informative points, we extend the principles of the 2D Constant False Alarm Rate (CFAR) algorithm [36] and implement a 3D CFAR algorithm, by adaptively calculating thresholds to detect local peaks as key points. Finally, we output the top 256 key points from the intensity point clouds and the top 64 key points for the Doppler point clouds.

Following the extraction of key points, we merge the point clouds from both radars and apply a Gaussian normalization filter to the values. The final fine-grained point cloud consists of 512 key points, featuring [x,y,z,intensity]𝑥𝑦𝑧𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦[x,y,z,intensity][ italic_x , italic_y , italic_z , italic_i italic_n italic_t italic_e italic_n italic_s italic_i italic_t italic_y ] for intensity, and 128 key points with attributes [x,y,z,power,velocity]𝑥𝑦𝑧𝑝𝑜𝑤𝑒𝑟𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦[x,y,z,power,velocity][ italic_x , italic_y , italic_z , italic_p italic_o italic_w italic_e italic_r , italic_v italic_e italic_l italic_o italic_c italic_i italic_t italic_y ] for Doppler. An example fine-grained point clouds generated from the process is shown in Figure 6. In this example, the subject raises their right hand to the top. It is evident from Figure 6(a), the intensity points are present not only around the raised arm but also at other areas of the upper body. In contrast, as shown in Figure 6(b), the Doppler points mainly appear around the raising arm with non-negligible velocity.

Refer to caption
(a) An intensity point cloud.
Refer to caption
(b) A Doppler point cloud.
Figure 6: An example fine-grained point clouds. Ground truth skeleton is shown in red. The magnitude and direction of Doppler velocity are shown in arrows

III-C The deep neural network backbone

Refer to caption
Figure 7: The architecture of the deep neural network backbone.

A deep neural network (Figure 7) is designed to take multiple frames of fine-grained point clouds as inputs to predict joint positions in a human skeleton model. The network incorporates both global and local contexts to estimate the intricate translation and rotation dynamics. To capture the global context, we include a dedicated branch that stacks three basic PointNet blocks [34]. To extract local information, three hierarchy set abstraction layers in PointNet++ are stacked to process both the intensity and Doppler point clouds [35].

Furthermore, to exploit the temporal dependencies between frames, two layers of unidirectional Long Short-Term Memory (LSTM) cells are used [37], spanning T=20𝑇20T=20italic_T = 20 steps or frames (equivalent to one second). To enhance information flow, a skip/residual link is introduced that connects features prior to the LSTM layers and post-LSTM. Finally, after several fully connected (FC) layers, the model outputs rotations of each joint within the human skeleton model. To improve the accuracy of rotation estimation, following[38], we represent joint rotations using 6D parameters of the rotation matrices rather than 3D axis angles.

The model subsequently leverages SMPL to generate the final joint positions. A gender-neutral model is used by fixing the default shape parameters. For seated upper body poses, we freeze the rotation parameters of joints in the lower body and only estimate the positions of the upper body joints (14 joints) [7].

The loss function is defined as the mean square error (MSE) of the joint coordinates:

Loss=1Ff=1FPf,J(f)Pgt,J(f)2,𝐿𝑜𝑠𝑠1𝐹superscriptsubscript𝑓1𝐹subscriptnormsuperscriptsubscript𝑃𝑓𝐽𝑓superscriptsubscript𝑃𝑔𝑡𝐽𝑓2\displaystyle Loss=\frac{1}{F}\sum_{f=1}^{F}||P_{f,J}^{(f)}-P_{gt,J}^{(f)}||_{% 2},italic_L italic_o italic_s italic_s = divide start_ARG 1 end_ARG start_ARG italic_F end_ARG ∑ start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT | | italic_P start_POSTSUBSCRIPT italic_f , italic_J end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_f ) end_POSTSUPERSCRIPT - italic_P start_POSTSUBSCRIPT italic_g italic_t , italic_J end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_f ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (1)

where f𝑓fitalic_f is the frame index, F𝐹Fitalic_F is the total number of frames in the batch, J𝐽Jitalic_J denotes the joint set, Pf,J(f)superscriptsubscript𝑃𝑓𝐽𝑓P_{f,J}^{(f)}italic_P start_POSTSUBSCRIPT italic_f , italic_J end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_f ) end_POSTSUPERSCRIPT is the estimated positions of key joints, and Pgt,J(f)superscriptsubscript𝑃𝑔𝑡𝐽𝑓P_{gt,J}^{(f)}italic_P start_POSTSUBSCRIPT italic_g italic_t , italic_J end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_f ) end_POSTSUPERSCRIPT is the corresponding ground truth positions. Note that the loss is a function of the pose parameters (β𝛽\betaitalic_β) and global translation (t𝑡titalic_t). From the experiments, we find that instead of directly regressing the joint positions, passing the joint rotation parameters through SMPL to estimate the resulting joint position errors results in higher accuracy and faster convergence. This can be interpreted as a non-linear transformation of the MSE loss function using the SMPL model.

The total number of learning parameters in the network is 2.9 million or 2.65G FLOPs. Incoming point clouds are processed in a sliding window manner with a window size of 20 frames.

IV Implementation and Datasets

In this section, we present the implementation of a prototype SUPER system using COTS mmWave radars and the experiment results from multi-subject testbed evaluations under various conditions, which are purposely chosen to closely mimic real-life situations.

IV-A Implementation

Refer to caption
(a) Markers placement: front and back.
Refer to caption
(b) Co-located radar sensors.
Figure 8: Experiment setup: markers and radars.

Two IWR6843ISK boards [39] together with DCA1000EVM boards [40] are used in the experiments. The radar boards operate at 6064similar-to606460\sim 6460 ∼ 64 GHz (with 4-GHz bandwidth) and transmit FMCW signals. The radar front-ends include 3333 transmit antennas (Tx), 4444 receive antennas (Rx), with 120superscript120120^{\circ}120 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT azimuth field of view (FoV) and elevation FoV. The 3 transmitting antennas emit FMCW chirps in a time-division manner, which results in a 12 virtual antennas array. Each FMCW chirp is composed of 225 sampling points, and the frequency of RF will increase from 60 GHz to 64 GHz. 128 chirps constitute one frame at a frame rate of 20Hz. The acquired raw IF signal is sent to a host PC via Ethernet, where mmWave Studio [41] is used to initiate, configure, and control the radar boards. The detailed radar sensor settings is summarized in Table III.

TABLE III: Radar Hardware Settings.
parameters description values
Ntxsubscript𝑁𝑡𝑥N_{tx}italic_N start_POSTSUBSCRIPT italic_t italic_x end_POSTSUBSCRIPT number of transmit antennas 3
Nrxsubscript𝑁𝑟𝑥N_{rx}italic_N start_POSTSUBSCRIPT italic_r italic_x end_POSTSUBSCRIPT number of receive antennas 4
Nvirtualsubscript𝑁𝑣𝑖𝑟𝑡𝑢𝑎𝑙N_{virtual}italic_N start_POSTSUBSCRIPT italic_v italic_i italic_r italic_t italic_u italic_a italic_l end_POSTSUBSCRIPT number of virtual antennas 12
Pfsubscript𝑃𝑓P_{f}italic_P start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT frame duration 50 (ms)
fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start frequency 60 (GHz)
fesubscript𝑓𝑒f_{e}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end frequency 64 (GHz)
trssubscript𝑡𝑟𝑠t_{rs}italic_t start_POSTSUBSCRIPT italic_r italic_s end_POSTSUBSCRIPT start ramp time 0 (μ𝜇\muitalic_μs)
tresubscript𝑡𝑟𝑒t_{re}italic_t start_POSTSUBSCRIPT italic_r italic_e end_POSTSUBSCRIPT end ramp time 58 (μ𝜇\muitalic_μs)
tidlesubscript𝑡𝑖𝑑𝑙𝑒t_{idle}italic_t start_POSTSUBSCRIPT italic_i italic_d italic_l italic_e end_POSTSUBSCRIPT chirp idle time 7 (μ𝜇\muitalic_μs)
Nadcsubscript𝑁𝑎𝑑𝑐N_{adc}italic_N start_POSTSUBSCRIPT italic_a italic_d italic_c end_POSTSUBSCRIPT number of samples per chirp 225
Nchirpsubscript𝑁𝑐𝑖𝑟𝑝N_{chirp}italic_N start_POSTSUBSCRIPT italic_c italic_h italic_i italic_r italic_p end_POSTSUBSCRIPT number of chirps per frame 128

The preprocessing steps and point cloud generation are implemented in MATLAB R2021a, which takes raw IF signals as input, and outputs the fine-grained 3D point cloud data. The neural network backbone is implemented in PyTorch.

IV-B Data collection procedure

To evaluate SUPER’s performance, we recruited 10 participants (3 females and 7 males), aged between 21 and 46, and with BMI in the range of 18.131.6similar-to18.131.618.1\sim 31.618.1 ∼ 31.6. Participants wore their daily attire such as T-shirts, blouses, and sweaters of different fabric materials. This research protocol has been approved by the research ethical board (REB) from our institution.

Both radar and mocap data are collected in a 6.5m×6m6.5𝑚6𝑚6.5m\times 6m6.5 italic_m × 6 italic_m lab. The lab (Figure 9) has standard office furniture and many electronic equipment and wireless transceivers (WiFi, LTE, Bluetooth, etc.). Both radar sensors on a tripod as in Figure 8(b) with 1.5 meters high and 1 meter away from the subjects and oriented at a 20-degree horizontal angle. We define a local coordinate system with respect to radar H. During the experiments, only one subject is present in the predefined position.

Ground truth of subject poses are collected from OptiTrack, a motion capture system [42] with 12 cameras. Both radar sensors and the OptiTrack system are synchronized after data collections at frame level using “synchronization” motions at the beginning of each trial. The output of the OptiTrack system are coordinates of markers and rigid bodies on the body of participants as shown in Figure 8(a). We utilize MotionBuilder [43] to build a customized human actor for each participant and generate accurate joints coordinates through motion tracking functionalities built in the software. Videos have been recorded during data collection for reviewing purposes but are not further processed.

Refer to caption
Figure 9: The Lab environment for data collection.

During the data collection process in the controlled laboratory environment, subjects engaged in three distinct motion sequences that are designed to mimic movements while seated in confined environments. These include: hand-reaching, driving, and head rotation. A Microsoft Xbox Gaming steering wheel is used to mimic a driving platform and is placed in front of the subjects.

  • Hand-reaching trials: Participants were instructed to use their right hand to interact with hypothetical objects in their surroundings while kee** their left hand stationary. These trials included interacting with objects positioned directly above one’s head (top), in the top front (up-front), in front but to the side (right-front), to the side (right), and below (bottom).

  • Driving trials: These trials aimed to replicate common driving activities. Subjects were instructed to perform routine driving (with both hands on the wheel), conduct traffic checks (by leaning forward and inspecting both left and right directions), engage in a conversation with a passenger (rotating the head towards the passenger), execute reverse maneuvers (turning the head to see one’s back over the shoulder), and operate the control panel (reaching the right-front area and virtually press buttons with one’s right hand).

  • Head rotation trials: These trials capture deliberate head movements while kee** one’s torso mostly stationary. Subjects were instructed to look left and right, up and down, and upper/lower left/right, etc.

IV-C The dataset

In total, we conducted 30 trials from 10 participants, with each lasting around 10 minutes. The total number of radar frames collected is around 360,000 from each radar sensor. The total size of all raw radar data in the dataset is around 900GB. The ground truth data for each frame contains joint angles and positions of 14 upper body key joints222The local body joints are set to fixed sitting poses and the global translations. The total size of the ground truth data is around 900MB. The dataset is organized by subject ID (de-identified), trial name, and data types (radar data vs ground truth data).

V Performance Evaluation

In this section, we present the performance of SUPER and ablation studies.

V-A Evaluation metrics and baseline method

We chose three metrics in literature to quantify the accuracy of the estimated upper body joints. The first one is the Mean Per Joint Position Error (MPJPE), which measures the absolute average distance (mm) between the predicted joints of a human skeleton and the ground truth joints in a given dataset. The MPJPE is defined as:

EMPJPE(f,J)=1KJk=1KPf,J(f)(k)Pgt,J(f)(k)2,subscript𝐸𝑀𝑃𝐽𝑃𝐸𝑓𝐽1subscript𝐾𝐽superscriptsubscript𝑘1𝐾subscriptnormsuperscriptsubscript𝑃𝑓𝐽𝑓𝑘superscriptsubscript𝑃𝑔𝑡𝐽𝑓𝑘2\displaystyle E_{MPJPE}(f,J)=\frac{1}{K_{J}}\sum_{k=1}^{K}||P_{f,J}^{(f)}(k)-P% _{gt,J}^{(f)}(k)||_{2},italic_E start_POSTSUBSCRIPT italic_M italic_P italic_J italic_P italic_E end_POSTSUBSCRIPT ( italic_f , italic_J ) = divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT | | italic_P start_POSTSUBSCRIPT italic_f , italic_J end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_f ) end_POSTSUPERSCRIPT ( italic_k ) - italic_P start_POSTSUBSCRIPT italic_g italic_t , italic_J end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_f ) end_POSTSUPERSCRIPT ( italic_k ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (2)

where f𝑓fitalic_f denotes a frame, J𝐽Jitalic_J denotes the joints model/set, K𝐾Kitalic_K is the number of joints in the model/set, Pf,J(f)(k)superscriptsubscript𝑃𝑓𝐽𝑓𝑘P_{f,J}^{(f)}(k)italic_P start_POSTSUBSCRIPT italic_f , italic_J end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_f ) end_POSTSUPERSCRIPT ( italic_k ) is the estimated position of joint k𝑘kitalic_k, and Pgt,J(f)(k)superscriptsubscript𝑃𝑔𝑡𝐽𝑓𝑘P_{gt,J}^{(f)}(k)italic_P start_POSTSUBSCRIPT italic_g italic_t , italic_J end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_f ) end_POSTSUPERSCRIPT ( italic_k ) is the corresponding ground truth position. Finally, the MPJPEs are averaged over all frames.

The second metric is the Procrustes alignment MPJPE (PA-MPJPE) that calculates the average 3D joint distance (mm) after performing Procrustes alignment [44] on the estimated and ground-truth joint sets. PA-MPJPE measures how well the pose estimation model captures the structural information of the pose, rather than just its location or scale. It eliminates system biases and allows for fair comparisons across different scales of the same pose.

The third metric is the percentage of correct keypoints under a distance threshold e.g. 15mm15𝑚𝑚15mm15 italic_m italic_m (PCK@15mm𝑃𝐶𝐾@15𝑚𝑚PCK@15mmitalic_P italic_C italic_K @ 15 italic_m italic_m). This metric is defined as:

PCK@15mm=1Kk=1Kδk,𝑃𝐶𝐾@15𝑚𝑚1𝐾superscriptsubscript𝑘1𝐾subscript𝛿𝑘\displaystyle PCK@15mm=\frac{1}{K}\sum_{k=1}^{K}\delta_{k},italic_P italic_C italic_K @ 15 italic_m italic_m = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (3)

where K𝐾Kitalic_K is the total number of keypoints (joints), δksubscript𝛿𝑘\delta_{k}italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a binary value indicating whether the distance between the ground truth keypoint and the predicted keypoint is within a certain threshold.

Baseline method

We adopt mmMesh [14] as the baseline method for comparison. The choice is primarily driven by the fact that the model architecture was made publicly available by the authors. Although m4eshsuperscript𝑚4𝑒𝑠m^{4}eshitalic_m start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_e italic_s italic_h is more recent, it targets multi-subject scenarios, which are outside the scope of this paper. The mmMesh model parameters were retrained using the hyperparameters suggested in [14] and data from radar H only for consistency.

V-B Main results

TABLE IV: Accuracy of joint estimations (in mm)

action method MPJPE\downarrow PA_MPJPE\downarrow PCK@15mm\uparrow driving mmMesh 156.85±plus-or-minus\pm±25.18 29.60±plus-or-minus\pm±6.2 13.76±plus-or-minus\pm±7.38 Ours 112.46±plus-or-minus\pm±12.70 16.32±plus-or-minus\pm±2.45 27.38±plus-or-minus\pm±11.15 handreaching mmMesh 148.33±plus-or-minus\pm±25.18 26.97±plus-or-minus\pm±4.39 15.74±plus-or-minus\pm±8.65 Ours 114.87±plus-or-minus\pm±25.07 15.19±plus-or-minus\pm±2.56 37.17±plus-or-minus\pm±8.28 head rot. mmMesh 174.42±plus-or-minus\pm±40.61 30.43±plus-or-minus\pm±8.82 10.00±plus-or-minus\pm±6.85 Ours 108.85±plus-or-minus\pm±15.46 16.16±plus-or-minus\pm±2.59 28.46 ±plus-or-minus\pm±9.34

We calculate the average MPJPE, PA_MPJPE, and PCK@15mm of the 14 upper body joints in leave-one-subject-out experiments. The results presented in TABLE IV reveal that our approach remarkably surpasses the baseline model by average margins(take the average of the three actions) of 30%, 45%, and 184% on MPJPE, PA_MPJPE, PCK@15mm respectively.

Furthermore, we evaluate the model’s effectiveness on upper limb joints pivotal to hand-object interactions. The the MPJPE and PA_MPJPE of left and right wrist and elbow joints are summarized in Table V.

TABLE V: Accuracy of upper limb key joint positions (in mm)

action method MPJPE\downarrow PA_MPJPE\downarrow wrist elbow wrist elbow driving mmMesh 341.96±plus-or-minus\pm±64.83 199.94±plus-or-minus\pm±37.81 103.35±plus-or-minus\pm±25.63 66.50±plus-or-minus\pm±18.48 Ours 119.46±plus-or-minus\pm±25.71 114.30±plus-or-minus\pm±14.16 38.95±plus-or-minus\pm±6.98 28.45±plus-or-minus\pm±6.20 handreaching mmMesh 312.22±plus-or-minus\pm±66.61 187.68±plus-or-minus\pm±30.50 96.44±plus-or-minus\pm±21,47 64.44±plus-or-minus\pm±16.39 Ours 140.46±plus-or-minus\pm±30.77 124.86±plus-or-minus\pm±27.79 42.57±plus-or-minus\pm±11.65 32.30±plus-or-minus\pm±9.78 head mmMesh 377.14±plus-or-minus\pm±102.65 226.84±plus-or-minus\pm±56.17 104.40±plus-or-minus\pm±29.49 68.90±plus-or-minus\pm±20.96 Ours 131.08±plus-or-minus\pm±26.88 127.10±plus-or-minus\pm±31.05 38.70±plus-or-minus\pm±9.41 27.96±plus-or-minus\pm±6.84

From the results, we can see that the average accuracy of wrist joints is lower than that of elbow and other upper body joints. This can be explained by the low RCS of hands, making them difficult to be captured by radars. However, SUPER considerably outperforms mmMesh in the estimation of both upper arm joints. Thus, we conclude that it is important to design a specific pipeline for SUB-HPE, and the inclusion of intensity features and the use of radar V are instrumental in improving the accuracy.

Refer to caption
Figure 10: Constructed 3D poses in skeleton representation from SUPER, mmMesh and Ground Truth

Examples of constructed 3D poses in skeleton representations using SUPER, mmMesh and Ground Truth can be found in Fig 10.

V-C Ablation study

TABLE VI: Results from Ablation Study

action information MPJPE\downarrow PA_MPJPE\downarrow PCK@15mm\uparrow driving Doppler only 237.33±plus-or-minus\pm±26.72 44.78±plus-or-minus\pm±2.29 0.37±plus-or-minus\pm±1.46 intensity only 196.97±plus-or-minus\pm±32.89 20.09±plus-or-minus\pm±6.94 25.36±plus-or-minus\pm±15.98 Doppler+intensity 101.51±plus-or-minus\pm±19.06 12.27±plus-or-minus\pm±3.41 47.40±plus-or-minus\pm±17.18 handreaching Doppler only 266.28±plus-or-minus\pm±36.64 48.49±plus-or-minus\pm±3.74 0.08±plus-or-minus\pm±0.91 intensity only 193.05±plus-or-minus\pm±43.37 19.26±plus-or-minus\pm±7.94 26.75±plus-or-minus\pm±19.83 Doppler+intensity 110.73±plus-or-minus\pm±30.91 12.51±plus-or-minus\pm±5.86 47.10±plus-or-minus\pm±20.53 head Doppler only 221.46±plus-or-minus\pm±31.73 46.04±plus-or-minus\pm±1.21 0.02±plus-or-minus\pm±0.63 intensity only 190.71±plus-or-minus\pm±44.41 19.05±plus-or-minus\pm±7.47 27.68±plus-or-minus\pm±18.90 Doppler+intensity 99.67±plus-or-minus\pm±28.14 12.28±plus-or-minus\pm±5.87 45.25±plus-or-minus\pm±18.79 * The standard deviation in this table is calculated across the estimated position errors per joint and per frame.

We further conduct ablation experiments to evaluate the effectiveness of Doppler and intensity point clouds in the training data. To do so, we only input the Doppler point cloud or the intensity point cloud and remove the respective branch in the backbone (Figure 7). Table VI reports the results from one test subject performing different actions. Clearly, neither intensity or Doppler point clouds alone is sufficient. Combining both sets of features leads to the highest accuracy. Somewhat interesting, between the two, intensity point clouds appear to be more informative.

VI Demonstrative Application

In this section, we demonstrate the utility of SUPER through a downstream task that identifies hand-object interaction through SUB-HPE. Note that what is being presented acts as a proof-of-concept. Likely, more sophisticated methods can be implemented for the task on top of SUB-HPE.

Refer to caption
Figure 11: Visualization of the ground truth and model estimated wrist trajectories of a 4s sequence from a handreaching action. The units are in meters.

In this task, the aim is to determine which objects in the 3D space one is interacting with by hands. Consider a motion sequence where a hand starts from some resting position, moves toward an object at known location, and then interacts with the object for a period of time. We transform the problem of object identification to a localization problem, namely, to determine whether one’s hand (a wrist joint specifically) falls into the predefined bounding boxes around target locations for a sufficient amount of time.

To test this idea, we first calculate the amount of displacement of a wrist joint during 1s windows in the ground truth trajectory. The intervals that the total displacement is less than a predefined threshold (100mm in the implementation) indicate either the initial rest position or the rendezvous point between the hand and a target object. We compute the centroid of the wrist joint positions in such intervals and test against the ground truth target locations. As an example, consider the ground truth and estimated trajectories as shown in Figure 11. In this example, one’s hand travels from B𝐵Bitalic_B to A𝐴Aitalic_A and then reaches a target location C𝐶Citalic_C. Although the estimated trajectory does not exactly coincide with the ground truth one, it can be observed as the hand approaches and stays around the target location, the estimated locations are close to C𝐶Citalic_C.

We conduct experiments on all subjects using the hand-reaching trials. The results show that in 88.80% of the rest position or the rendezvous point intervals, the centroid of the estimated writ trajectory falls into a bounding box centered on the target location with a side length of 0.2m.

VII Discussion and Conclusion

In this work, we proposed SUPER, a pipeline for SUB-HPE. To address the challenges of nuanced upper body movements when seated, we obtained both intensity and Doppler point clouds by fusing data coherently from two radars with orthogonal orientations. Compared to a baseline method that only utilizes Doppler point clouds from a single radar, SUPER has superior performance in terms of all metrics for HPE.

The current SUPER framework assumes the presence of a single subject and the knowledge of the ROI. It can be easily extended to multiple subjects and unknown ROIs when combined with a target detection component. The current model can also be trained with additional mesh errors in SMPL and a term reflecting temporal consistency and smoothness of human movements [45]. Doing so is expected to further improve the accuracy and realism of the inferred poses.

Future research directions for mmWave-based SUB-HPE also include develo** models that are robust to different deployment environments and the investigation of more downstream tasks.

References

  • [1] Y. Liu, J. Yang, X. Gu, Y. Guo, and G.-Z. Yang, “Ego+x: An egocentric vision system for global 3d human pose estimation and social interaction characterization,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022, pp. 5271–5277.
  • [2] J. Wang, K. Qiu, H. Peng, J. Fu, and J. Zhu, “Ai coach: Deep human pose estimation and analysis for personalized athletic training assistance,” ser. MM ’19.   New York, NY, USA: Association for Computing Machinery, 2019. [Online]. Available: https://doi.org/10.1145/3343031.3350910
  • [3] T. Anvari and K. Park, “3d human body pose estimation in virtual reality: A survey,” in 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), 2022, pp. 624–628.
  • [4] Y. Zhou, H. Huang, S. Yuan, H. Zou, L. Xie, and J. Yang, “Metafi++: Wifi-enabled transformer-based human pose estimation for metaverse avatar simulation,” IEEE Internet of Things Journal, vol. 10, no. 16, pp. 14 128–14 136, 2023.
  • [5] S. Y. Cheng and M. Trivedi, “Turn-intent analysis using body pose for intelligent driver assistance,” IEEE Pervasive Computing, vol. 5, no. 4, pp. 28–37, 2006.
  • [6] C. E. Matthews, S. A. Carlson, P. F. Saint-Maurice, S. Patel, E. Salerno, E. Loftfield, R. P. Troiano, J. E. Fulton, J. N. Sampson, C. Tribby et al., “Sedentary behavior in united states adults: Fall 2019,” Medicine and science in sports and exercise, vol. 53, no. 12, p. 2512, 2021.
  • [7] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,” ACM Trans. Graph., vol. 34, no. 6, oct 2015. [Online]. Available: https://doi.org/10.1145/2816795.2818013
  • [8] A. A. Osman, T. Bolkart, and M. J. Black, “Star: Sparse trained articulated human body regressor,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16.   Springer, 2020, pp. 598–613.
  • [9] Z. Cao, G. H. Martinez, T. Simon, S. Wei, and Y. A. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • [10] Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “Vitpose: Simple vision transformer baselines for human pose estimation,” in Advances in Neural Information Processing Systems, 2022.
  • [11] Y. Huang, M. Kaufmann, E. Aksan, M. J. Black, O. Hilliges, and G. Pons-Moll, “Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time,” ACM Transactions on Graphics (TOG), vol. 37, no. 6, pp. 1–15, 2018.
  • [12] V. Mollyn, R. Arakawa, M. Goel, C. Harrison, and K. Ahuja, “Imuposer: Full-body pose estimation using imus in phones, watches, and earbuds,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1–12.
  • [13] A. Sengupta, F. **, R. Zhang, and S. Cao, “mm-pose: Real-time human skeletal posture estimation using mmwave radars and cnns,” IEEE Sensors Journal, vol. 20, no. 17, pp. 10 032–10 044, 2020.
  • [14] H. Xue, Y. Ju, C. Miao, Y. Wang, S. Wang, A. Zhang, and L. Su, “mmMesh: Towards 3D real-time dynamic human mesh construction using millimeter-wave,” in Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services, 2021, pp. 269–282.
  • [15] J. Geng, D. Huang, and F. De la Torre, “Densepose from wifi,” arXiv preprint arXiv:2301.00250, 2022.
  • [16] P. Hügler, M. Geiger, and C. Waldschmidt, “Rcs measurements of a human hand for radar-based gesture recognition at e-band,” in 2016 German Microwave Conference (GeMiC).   IEEE, 2016, pp. 259–262.
  • [17] T. Gu, Z. Fang, Z. Yang, P. Hu, and P. Mohapatra, “Mmsense: Multi-person detection and identification via mmwave sensing,” in Proceedings of the 3rd ACM Workshop on Millimeter-wave Networks and Sensing Systems, 2019, pp. 45–50.
  • [18] C. Wu, F. Zhang, B. Wang, and K. R. Liu, “mmtrack: Passive multi-person localization using commodity millimeter wave radio,” in IEEE INFOCOM 2020-IEEE Conference on Computer Communications.   IEEE, 2020, pp. 2400–2409.
  • [19] P. Zhao, C. X. Lu, J. Wang, C. Chen, W. Wang, N. Trigoni, and A. Markham, “mid: Tracking and identifying people with millimeter wave radar,” in 2019 15th International Conference on Distributed Computing in Sensor Systems (DCOSS).   IEEE, 2019, pp. 33–40.
  • [20] J. Lien, N. Gillian, M. E. Karagozler, P. Amihood, C. Schwesig, E. Olson, H. Raja, and I. Poupyrev, “Soli: Ubiquitous gesture sensing with millimeter wave radar,” ACM Transactions on Graphics (TOG), vol. 35, no. 4, pp. 1–19, 2016.
  • [21] H. Liu, A. Zhou, Z. Dong, Y. Sun, J. Zhang, L. Liu, H. Ma, J. Liu, and N. Yang, “M-gesture: Person-independent real-time in-air gesture recognition using commodity millimeter wave radar,” IEEE Internet of Things Journal, vol. 9, no. 5, pp. 3397–3415, 2021.
  • [22] S. Palipana, D. Salami, L. A. Leiva, and S. Sigg, “Pantomime: Mid-air gesture recognition with sparse millimeter-wave radar point clouds,” Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, vol. 5, no. 1, pp. 1–27, 2021.
  • [23] A. Khamis, B. Kusy, C. T. Chou, M.-L. McLaws, and W. Hu, “Rfwash: a weakly supervised tracking of hand hygiene technique,” in Proceedings of the 18th conference on embedded networked sensor systems, 2020, pp. 572–584.
  • [24] Z. Yang, P. H. Pathak, Y. Zeng, X. Liran, and P. Mohapatra, “Monitoring vital signs using millimeter wave,” in Proceedings of the 17th ACM international symposium on mobile ad hoc networking and computing, 2016, pp. 211–220.
  • [25] P. Zhao, C. X. Lu, B. Wang, C. Chen, L. Xie, M. Wang, N. Trigoni, and A. Markham, “Heart rate sensing with a robot mounted mmwave radar,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 2812–2818.
  • [26] F. Wang, X. Zeng, C. Wu, B. Wang, and K. R. Liu, “mmhrv: Contactless heart rate variability monitoring using millimeter-wave radio,” IEEE Internet of Things Journal, vol. 8, no. 22, pp. 16 623–16 636, 2021.
  • [27] B. Zhang, B. Jiang, R. Zheng, X. Zhang, J. Li, and Q. Xu, “Pi-vimo: Physiology-inspired robust vital sign monitoring using mmwave radars,” ACM Transactions on Internet of Things, vol. 4, no. 2, pp. 1–27, 2023.
  • [28] S. An and U. Y. Ogras, “Mars: mmwave-based assistive rehabilitation system for smart healthcare,” ACM Transactions on Embedded Computing Systems (TECS), vol. 20, no. 5s, pp. 1–22, 2021.
  • [29] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
  • [30] H. Xue, Q. Cao, Y. Ju, H. Hu, H. Wang, A. Zhang, and L. Su, “M4esh: mmwave-based 3d human mesh construction for multiple subjects,” in Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems, 2022, pp. 391–406.
  • [31] S.-P. Lee, N. P. Kini, W.-H. Peng, C.-W. Ma, and J.-N. Hwang, “Hupr: A benchmark for human pose estimation using millimeter wave radar,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5715–5724.
  • [32] A. Chen, X. Wang, S. Zhu, Y. Li, J. Chen, and Q. Ye, “mmbody benchmark: 3d body reconstruction dataset and analysis for millimeter wave radar,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 3501–3510.
  • [33] W. Chen, H. Yang, X. Bi, R. Zheng, F. Zhang, P. Bao, Z. Chang, X. Ma, and D. Zhang, “Environment-aware multi-person tracking in indoor environments with mmwave radars,” Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 7, no. 3, sep 2023.
  • [34] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 77–85.
  • [35] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17.   Red Hook, NY, USA: Curran Associates Inc., 2017, p. 5105–5114.
  • [36] M. A. Richards, Fundamentals of Radar Signal Processing, 2nd Edition.   McGraw Hill, 2005.
  • [37] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, p. 1735–1780, nov 1997. [Online]. Available: https://doi.org/10.1162/neco.1997.9.8.1735
  • [38] Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5738–5746.
  • [39] T. Instruments, “Iwr6843isk,” 2020. [Online]. Available: https://www.ti.com/product/IWR6843
  • [40] ——, “Dca1000evm,” 2020. [Online]. Available: https://www.ti.com/tool/DCA1000EVM
  • [41] ——, “mmwave studio,” 2020. [Online]. Available: http://www.ti.com/tool/MMWAVE-STUDIO
  • [42] OptiTrack, “Optitrack: Motion capture systems,” 2020. [Online]. Available: https://www.optitrack.com/
  • [43] AUTODESK, “Motionbuilder,” 2022. [Online]. Available: https://www.autodesk.com/
  • [44] J. Gower, “Generalized procrustes analysis.” Psychometrika, vol. 40, p. 33–51, 1975.
  • [45] C. Zheng, W. Wu, C. Chen, T. Yang, S. Zhu, J. Shen, N. Kehtarnavaz, and M. Shah, “Deep learning-based human pose estimation: A survey,” ACM Computing Surveys, vol. 56, no. 1, pp. 1–37, 2023.