ProbRadarM3F: mmWave Radar-based Human Skeletal Pose Estimation with Probability Map Guided Multi-Format Feature Fusion

B. Zhu, Z. He and W. Xiong are with School of Automation Science and Electrical Engineering, Beihang University, Bei**g 100191, P.R. China. Email: [email protected] (B. Zhu); [email protected] (Z. He); [email protected] (W.Xiong).G. Ding is with the School of Electronics and Information Engineering, Beihang University, Bei**g 100191, P.R. China. Email: [email protected]. Liu is with Vitalent Consulting, Gothenburg, Sweden. Email: [email protected]. Huang is with College of Science and Engineering, James Cook University, Cairns, Australia. Email: [email protected]. Chen is with the School of Biomedical Engineering, The University of Sydney, Camperdown, NSW 2050, Australia. Email: [email protected]. Xiang is with the School of Computing, Engineering and Mathematical Sciences, La Trobe University, Melbourne, Australia. Email: [email protected] authors contribute equally to the work and are co-first authors.2Corresponding author. Bing Zhu12, , Zixin He1, Weiyi Xiong, Guanhua Ding, Jianan Liu,
Tao Huang,  Wei Chen,  and Wei Xiang, 
Abstract

Millimeter wave (mmWave) radar is a non-intrusive privacy and relatively convenient and inexpensive device, which has been demonstrated to be applicable in place of RGB cameras in human indoor pose estimation tasks. However, mmWave radar relies on the collection of reflected signals from the target, and the radar signals containing information is difficult to be fully applied. This has been a long-standing hindrance to the improvement of pose estimation accuracy. To address this major challenge, this paper introduces a probability map guided multi-format feature fusion model, ProbRadarM3F. This is a novel radar feature extraction framework using a traditional FFT method in parallel with a probability map based positional encoding method. ProbRadarM3F fuses the traditional heatmap features and the positional features, then effectively achieves the estimation of 14 keypoints of the human body. Experimental evaluation on the HuPR dataset proves the effectiveness of the model proposed in this paper, outperforming other methods experimented on this dataset with an AP of 69.9 %percent\%%. The emphasis of our study is focusing on the position information that is not exploited before in radar singal. This provides direction to investigate other potential non-redundant information from mmWave rader.

Index Terms:
mmWave radar, probability map, positional encoding, radar heatmap, multi-format feature fusion, human skeletal pose estimation

I Introduction

Smart medical care is paramount in hospitals and nursing homes in today’s society. To improve healthcare outcomes and ensure patient safety, there is an increasing need for human pose monitoring. The Internet of Things (IoT) has become instrumental in this regard, providing continuous and efficient monitoring capabilities [1, 2]. As shown in Fig. 1, IoT with human activity monitoring is essential to improve the convenience, safety, and interactivity of medical care. Traditionally, monitoring systems have relied on RGB cameras to estimate human pose and track human activity [3]. However, these methods pose significant privacy concerns and can be less effective in low-light or obscured environments.

Recent advancements focus on reducing privacy concerns by using non-visual signal-based sensing methods [4, 5]. These signals do not contain human facial information or visual photographs, thus addressing the need for human privacy in IoT application scenarios [6, 7]. Millimeter-wave (mmWave) radar has emerged as a prominent technology due to its high resolution and cost-effectiveness. MmWave radar can accurately estimate human skeletal poses without capturing identifiable visual information, making it ideal for preserving privacy. This paper focuses on mmWave radar-based human skeletal pose estimation, a critical component in smart sensor systems for medical care or smart home IoT applications.

Refer to caption
Figure 1: Consider an illustration of a medical IoT application: In this scenario, if a person receiving care at home experiences a fall, it is immediately detected, triggering an alarm at a remote care center and prompting the automatic dispatch of an ambulance. Our paper focuses on utilizing mmWave radar to estimate the human skeletal pose, a critical sensor component in this example of a medical IoT application. This approach preserves the user’s privacy while still providing the necessary monitoring information to the care center.

MmWave radar, a robust yet low-cost sensor, has been applied for environment perception in advanced driver-assistance systems (ADAS) [8, 9, 10, 11, 12, 13] and cooperative intelligent transportation systems (C-ITS) [14] in recent years. Considerable frontier research on mmWave radar has focused on achieving human activity sensing tasks. However, significant challenges remain.

In the classical radar processing pipeline, various levels of data representation exist. For human skeletal pose estimation, point clouds and heatmaps are the most widely used forms of data representation. However, point cloud generation relies on hand-crafted parameters, and the process itself loses Doppler information and some positional data, leading to difficulties in capturing finer joint features. Heatmaps are often generated using the Fast Fourier Transform (FFT) method [15], which can extract range, Doppler, and angle of arrival (AoA) information from the raw radar signal to create different types of heatmaps. However, much of the information in the heatmap is considered redundant and not valued. Recently, many pose estimation works have been realized through a point cloud +++ neural network pipeline or a heatmap +++ neural network pipeline. Some studies have focused on improving the deep learning capability of networks and algorithms to address the mmWave radar-based precision problem. However, the importance of capturing the information present in the radar signal itself is often neglected, leading to the loss of significant effective information.

To address the inefficiencies in traditional heatmap-based methods, a new method is proposed, constructing an efficient feature extraction framework. Specifically, the method incorporates two feature extraction branches. One branch adheres to conventional procedures, utilizing the FFT method to generate heatmaps for extracting range, Doppler, azimuth, and elevation information from mmWave radar signals. The other branch introduces a probability map generation and positional encoding method, designed to enhance the utilization of positional information in radar signals. A model is also proposed for fusing probability-guided positional features with traditional heatmap features. The contributions of this work are summarized below:

  • \bullet

    A mmWave radar-based multi-format feature fusion model, ProbRadarM3F, is presented for human skeletal pose estimation. The model is designed for raw mmWave radar signals, with features extracted from two branches: the FFT branch and the ProbPE branch. In the ProbPE branch, probability maps are generated, introducing a positional encoding method. To the best of our knowledge, this study represents the first attempt to extract additional probabilistic and positional features from mmWave radar signals to aid in pose estimation using a probability map-guided positional encoding approach. By fusing features in different formats, the keypoint prediction accuracy is significantly increased. ProbRadarM3F can serve as a new baseline for research and application in radar position information exploitation.

  • \bullet

    To validate the effectiveness of ProbRadarM3F by fair comparison with [16], our model was built upon the model used in [16] and experiments were conducted on the same dataset, HuPR [16], which provides purely raw radar signals. The results demonstrate that the design of a probability map-guided positional encoding strategy greatly improves the recognition performance of the model, leading to a significant enhancement in the precision of human skeletal pose prediction.

The subsequent sections of this paper are organized as follows: Section II provides a comprehensive review of the research and methodology of mmWave radar for indoor sensing applications, particularly for human skeletal pose prediction. Section III details the implementation of the proposed ProbRadarM3F model. In Section IV, experiments conducted with ProbRadarM3F on the HuPR dataset are presented, and the results are evaluated. The advantages of its pose prediction capabilities are demonstrated. Finally, Section V summarizes the approach and findings and suggests potential directions for future research.

Refer to caption
Figure 2: Illustration of our proposed ProbRadarM3F model. This model consists of two main branches: the FFT branch and the ProbPE branch. Different colored arrows represent data streams from different radars. The initial input is the raw ADC data from two vertically placed mmWave radars. After processing the raw radar data into a radar data cube, the FFT branch processes the data to extract features from range-Doppler-azimuth heatmaps and range-Doppler-elevation heatmaps based on 4D-FFT. The ProbPE branch applies positional encoding to extract features from generated radar probability maps. In the following module, the multi-frame and multi-format features are fused, and a cross- and self-attention module is introduced to generate skeletal pose estimation.

II Related Works

II-A Millimeter-Wave Radar Indoor Sensing

The field of millimeter-wave (mmWave) radar for indoor sensing has gained substantial advancements in recent years, driven by the growing demand for non-invasive, privacy-preserving technologies for human presence and activity detection. MmWave radar offers unique advantages for indoor sensing applications. The mmWave frequency can penetrate non-metallic objects while providing high-resolution imaging capabilities [17]. Its utility is increasingly evident in through-wall sensing [18], human detection and recognition [19], and vital sign monitoring [20, 21, 22].

Initially, researchers used radar signal processing to extract waveform features, combining them with target distance, orientation, altitude, and motion information to perform simple action classification tasks [23]. Later, more advanced technologies such as deep learning were introduced for more accurate indoor sensing projects based on basic mmWave radar data. For instance, Wang et al. proposed a sequence-to-sequence (seq2seq) 3D temporal convolutional network with a self-attention method to estimate human hand pose [24]. Yi et al. utilized a multi-person detection model based on long short-term memory (LSTM) to determine human presence and localize their positions using a single 60GHz mmWave radio [25]. Wu et al. pioneered the work of segmenting human silhouettes from millimetre-wave RF signals by locating human reflections in radar frames and extracting features from surrounding signals with a human detection module and from aggregated frames with an attention-based mask generation module [26].

Although these studies have been highly successful, they have focused more on changes in the structure of deep learning networks and neglected how to extract more valid features from the radar signal. Given the inherent limitations of mmWave radar data, this oversight may result in the loss of valuable information during processing. Therefore, in this paper, we pay special attention to the position information contained in the radar signals, which is long unnoticed but rich in information. The positional feature is extracted and combined with traditional feature to depict human body poses.

II-B MmWave based Human Skeletal Pose Estimation

Since the application demands both privacy and precision, the advent of mmWave radar technology has markedly expanded the horizons of human pose estimation and recognition. A notable research direction in human pose estimation is the recognition of skeletal pose. The advantage of human skeletal pose estimation lies in its ability to provide a simplified but informative representation of human pose and movements, facilitating efficient and accurate recognition of human activity. Skeletal pose prediction derived from mmWave radar signals features critical spatial and temporal information about human activity without the privacy concerns associated with optical imaging.

RF-Pose [27] is one of the pioneering works that first considered human skeleton reconstruction. Zhao et al. proposed a deep network from FWCM signals to estimate coarse parts of the human skeleton under the supervision of visual information. They subsequently improved the model to successfully predict 14 human joints, including the head, neck, shoulders, elbows, wrists, hips, knees, and feet. RF-Pose and its follow-up work utilized expensive customized hardware systems. Nowadays, more studies are based on low-cost mass-produced industrial millimeter-wave radar and have demonstrated good performance. Ding et al. presented a kinematic-constrained learning architecture that incorporates kinematic constraints with neural network learning for skeleton estimation based on range-Doppler heatmaps [28]. Kong et al. performed a convolution operation on the Range-Doppler Profile to detect the corresponding ranges and designed a two-stream deep learning architecture to extract body shape and motion features to predict skeleton joint coordinates and reconstruct body posture [29]. Cao et al. developed a model that incorporates part-level range-Doppler maps for individual body parts with local kinematic constraints and global constraints for reconstructing the human skeleton [30]. In this work, we also take advantage of human skeletal pose. Our extracted positional features based on probability maps are beneficial in predicting human key joints positions and predicting human posture.

III Proposed Methods

In this section, detailed descriptions of ProbRadarM3F for human skeletal pose estimation, built upon [16], with additional fusion of multi-format features from mmWave radar data, are provided. The structure of the model is illustrated in Fig. 2. After processing the raw signal from mmWave radar, features are extracted from the radar data through the FFT branch and the ProbPE branch. Multi-frame features in different formats from the two branches are fused. Finally, the coordinates of human skeletal joints are predicted from the fused features via a cross-self-attention mechanism.

III-A FFT Branch: Radar Data Processing and Range-Doppler-Angle Map Generating

The dataset used provides only unprocessed raw radar data acquired by the DAC1000 system. Therefore, the radar signal is pre-processed to convert the raw analogue-to-digital converter (ADC) data into a structured format, specifically a radar data cube, for efficient signal processing and analysis.

The raw ADC data, captured in binary format, is initially one-dimensional, representing interleaved ADC samples across multiple channels. It is structured by organizing the data into a two-dimensional matrix to separate the individual low voltage differential signaling (LVDS) channels. By segregating the samples from even and odd channels, the real and imaginary components of the complex signal are constructed, ensuring the retention of phase information necessary for subsequent processing tasks. Ultimately, the complex data is reformatted into a radar data cube to accurately reflect the sequence of radar data acquisition. This reorganization involves structuring the data into blocks corresponding to each chirp across multiple transmit (TX) and receive (RX) channels. The radar data cube accurately retains information about the spatial and temporal dimensions of the observed scene.

As shown in the FFT branch in Fig. 2, a comprehensive analysis is initiated by performing a 4-dimensional FFT (4D FFT) along the four axes: ADC samples, chirps, horizontal antennas, and vertical antennas as follows:

F(h,i,j,k)=n=0N1m=0M1p=0P1q=0Q1f(n,m,p,q)𝐹𝑖𝑗𝑘superscriptsubscript𝑛0𝑁1superscriptsubscript𝑚0𝑀1superscriptsubscript𝑝0𝑃1superscriptsubscript𝑞0𝑄1𝑓𝑛𝑚𝑝𝑞\displaystyle F(h,i,j,k)=\sum_{n=0}^{N-1}\sum_{m=0}^{M-1}\sum_{p=0}^{P-1}\sum_% {q=0}^{Q-1}f(n,m,p,q)italic_F ( italic_h , italic_i , italic_j , italic_k ) = ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_p = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_q = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q - 1 end_POSTSUPERSCRIPT italic_f ( italic_n , italic_m , italic_p , italic_q ) (1)
×ej2π(hnN+imM+jpP+kqQ),absentsuperscript𝑒𝑗2𝜋𝑛𝑁𝑖𝑚𝑀𝑗𝑝𝑃𝑘𝑞𝑄\displaystyle\times e^{-j2\pi\left(\frac{hn}{N}+\frac{im}{M}+\frac{jp}{P}+% \frac{kq}{Q}\right)},× italic_e start_POSTSUPERSCRIPT - italic_j 2 italic_π ( divide start_ARG italic_h italic_n end_ARG start_ARG italic_N end_ARG + divide start_ARG italic_i italic_m end_ARG start_ARG italic_M end_ARG + divide start_ARG italic_j italic_p end_ARG start_ARG italic_P end_ARG + divide start_ARG italic_k italic_q end_ARG start_ARG italic_Q end_ARG ) end_POSTSUPERSCRIPT ,

where f(n,m,p,q)𝑓𝑛𝑚𝑝𝑞{f(n,m,p,q)}italic_f ( italic_n , italic_m , italic_p , italic_q ) represents the original input signal across ADC samples n𝑛{n}italic_n, chirps m𝑚{m}italic_m, horizontal antennas p𝑝{p}italic_p, and vertical antennas q𝑞{q}italic_q, and F(h,i,j,k)𝐹𝑖𝑗𝑘{F(h,i,j,k)}italic_F ( italic_h , italic_i , italic_j , italic_k ) denotes the results of the Fourier transform. N,M,P,Q𝑁𝑀𝑃𝑄{N,M,P,Q}italic_N , italic_M , italic_P , italic_Q are the sizes of the dimensions, while h,i,j,k𝑖𝑗𝑘{h,i,j,k}italic_h , italic_i , italic_j , italic_k are the indices in the frequency domain for the corresponding dimensions. This process yields detailed range-Doppler-azimuth-elevation maps. Additionally, a certain amount of chirps within a specific velocity range is uniformly sampled in the Doppler dimension to filter out irrelevant information in the radar signal. This strategy effectively eliminates extraneous signal components, thereby streamlining the dataset for enhanced processing efficiency.

The dataset utilized in this study provides data collected by two vertically placed radars. Due to the limited resolution and detection range in elevation angle directions of the adopted radars, the elevation information from each radar is averaged in the processing as follows:

Favg=1Qkq=0Q1F(km,kn,kp,kq).subscript𝐹avg1𝑄superscriptsubscriptsubscript𝑘𝑞0𝑄1𝐹subscript𝑘𝑚subscript𝑘𝑛subscript𝑘𝑝subscript𝑘𝑞\displaystyle F_{\mathrm{avg}}=\frac{1}{Q}\sum_{k_{q}=0}^{Q-1}F(k_{m},k_{n},k_% {p},k_{q}).italic_F start_POSTSUBSCRIPT roman_avg end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q - 1 end_POSTSUPERSCRIPT italic_F ( italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) . (2)

While the azimuth information of the horizontally placed radar is retained, the azimuth information of the vertically placed radar is considered as elevation angle information for the overall data. Eventually, the range-Doppler-azimuth map from the horizontal radar and the range-Doppler-elevation map from the vertical radar are obtained. To combine dynamic information across varying chirps within the range-Doppler-angle maps, the features from multiple chirps are merged by employing the M-Net [31]. The M-Net module utilizes a neural network to fuse different chirp information within a frame, instead of the traditional method, to solve for Doppler velocities, thereby outputting the merged features of the frame.

III-B ProbPE Branch: Probability Map Generating and Positional Encoding

As shown in Fig. 2, in the ProbPE branch, the radar data cube is initially subjected to a two-dimensional FFT (2D FFT). The FFT is executed in the fast-time dimension (the dimension of digitized chirp samples) to obtain range information. Due to the significant impact of velocity frequency effects in the slow-time dimension, where multiple frames correspond to the same range unit, a FFT is also conducted in this dimension to obtain the Doppler frequency.

To minimize irrelevant information that could increase computational load or interfere with feature extraction, a constant false-alarm rate (CFAR) method [32] is employed. This method is crucial in distinguishing between targets and interference noise based on intensity differences. A two-dimensional CFAR (2D-CFAR) is applied to range-Doppler maps to select the elements containing more valid information. In the 2D-CFAR detection process, targets are detected as accurately as possible to reduce false alarms. Guard cells and reference cells are introduced near each detection cell, as shown in the lower left corner of Fig. 3. Cell Averaging CFAR (CA-CFAR) is a common variant of CFAR, which uses the following formula to calculate the threshold for each cell under examination:

T=α1Ni=1NXi,𝑇𝛼1𝑁superscriptsubscript𝑖1𝑁subscript𝑋𝑖\displaystyle T=\alpha\cdot\frac{1}{N}\sum_{i=1}^{N}X_{i},italic_T = italic_α ⋅ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (3)

where T𝑇{T}italic_T represents the threshold for the cell, Xisubscript𝑋𝑖{X_{i}}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the amplitude values of the reference cells surrounding the cell under test, N𝑁{N}italic_N represents the number of reference cells considered, and α𝛼{\alpha}italic_α is a scaling factor determined by the desired false alarm rate. If the intensity of the target surpasses that of interference noise, as Itarget>Inoise×Tsubscript𝐼targetsubscript𝐼noise𝑇{I_{\mathrm{target}}>I_{\mathrm{noise}}\times T}italic_I start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT > italic_I start_POSTSUBSCRIPT roman_noise end_POSTSUBSCRIPT × italic_T, it indicates the presence of the target.

Subsequently, probability maps are derived from the filtered range bins, which serve as a foundational step in understanding the spatial distribution of potential targets. The concatenation set for the filtered range bins from the two radars is taken separately. The phase differences among various antennas yield distinct spikes, indicating frequency components from different directions. Angle information is obtained by performing FFT on target data from distinct virtual antennas. Angle FFT is executed to derive azimuth from the horizontal radar and elevation from the vertical radar. An averaging calculation in the Doppler dimension is performed because the focus at this point is on probability and position information. For the non-overlap** range bins filtered out from two radars, the Doppler dimensions are averaged directly to fill the corresponding range-angle vectors, facilitating the subsequent multiplication operations. Thus, range-azimuth and range-elevation vectors from the two radars are extracted, respectively. These vectors contain information on the radar’s intensity at specific ranges and angles, effectively characterizing the probability of a target’s presence at a given location. Consequently, radar probability maps are constructed, as illustrated in Fig. 3.

Refer to caption
Figure 3: An illustration of probability map generation. It starts by filtering out the range bin through 2D CFAR on the range-Doppler heatmaps. Angle information is then extracted from the selected range, producing range-azimuth and range-elevation vectors. These vectors are normalized, transposed, and multiplied to create probability maps, indicating the likelihood of the target appearing at a specific position. The probability map serves as a guide for the positional encoding method.

The normalization of these vectors is performed to ensure uniformity in the data. Following normalization, the range-azimuth and range-elevation vectors at identical ranges are transposed and multiplied to generate range-azimuth-elevation probability maps:

Prae(r,θ,ϕ)=V^ra(r,θ)TV^re(r,ϕ),subscript𝑃rae𝑟𝜃italic-ϕsubscript^𝑉rasuperscript𝑟𝜃𝑇subscript^𝑉re𝑟italic-ϕ\displaystyle P_{\mathrm{rae}}(r,\theta,\phi)=\hat{V}_{\mathrm{ra}}(r,\theta)^% {T}\cdot\hat{V}_{\mathrm{re}}(r,\phi),italic_P start_POSTSUBSCRIPT roman_rae end_POSTSUBSCRIPT ( italic_r , italic_θ , italic_ϕ ) = over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT roman_ra end_POSTSUBSCRIPT ( italic_r , italic_θ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT roman_re end_POSTSUBSCRIPT ( italic_r , italic_ϕ ) , (4)

where Prae(r,θ,ϕ)subscript𝑃rae𝑟𝜃italic-ϕ{P_{\mathrm{rae}}(r,\theta,\phi)}italic_P start_POSTSUBSCRIPT roman_rae end_POSTSUBSCRIPT ( italic_r , italic_θ , italic_ϕ ) represents the probability map indicating the likelihood of detecting a target at range r𝑟{r}italic_r, azimuth θ𝜃{\theta}italic_θ, and elevation ϕitalic-ϕ{\phi}italic_ϕ. V^ra(r,θ)^𝑉ra𝑟𝜃{\hat{V}\mathrm{ra}(r,\theta)}over^ start_ARG italic_V end_ARG roman_ra ( italic_r , italic_θ ) represents the normalized range-azimuth vector, and V^re(r,ϕ)^𝑉re𝑟italic-ϕ{\hat{V}\mathrm{re}(r,\phi)}over^ start_ARG italic_V end_ARG roman_re ( italic_r , italic_ϕ ) denotes the normalized range-elevation vector.

These probability maps not only indicate the probability of detected elements but also contain significant geometric information, suggesting further exploitable position data. In advanced deep learning methodologies, the Transformer [33] enhances spatial perception through a sine coding formula for positional encoding. DETR [34] incorporates the Transformer model into target recognition, applying sine positional encoding to 2D images.

In our project, for each range r𝑟ritalic_r, the azimuth-elevation probability map can be treated as a two-dimensional representation. Therefore, sine positional encoding is introduced to extract positional information from the maps. The sine encoding formula is as follows:

PE(posθ,2i)r=sin(posθ/100002id),𝑃subscriptsuperscript𝐸𝑟subscriptpos𝜃2𝑖subscriptpos𝜃superscript100002𝑖𝑑\displaystyle{PE}^{r}_{\left(\mathrm{{pos}}_{\theta},2i\right)}=\sin\left(% \mathrm{pos}_{\theta}/10000^{\frac{2i}{d}}\right),italic_P italic_E start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( roman_pos start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , 2 italic_i ) end_POSTSUBSCRIPT = roman_sin ( roman_pos start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT / 10000 start_POSTSUPERSCRIPT divide start_ARG 2 italic_i end_ARG start_ARG italic_d end_ARG end_POSTSUPERSCRIPT ) , (5)
PE(posθ,2i+1)r=cos(posθ/100002id),𝑃subscriptsuperscript𝐸𝑟subscriptpos𝜃2𝑖1subscriptpos𝜃superscript100002𝑖𝑑\displaystyle{PE}^{r}_{\left(\mathrm{{pos}}_{\theta},2i+1\right)}=\cos\left(% \mathrm{pos}_{\theta}/10000^{\frac{2i}{d}}\right),italic_P italic_E start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( roman_pos start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , 2 italic_i + 1 ) end_POSTSUBSCRIPT = roman_cos ( roman_pos start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT / 10000 start_POSTSUPERSCRIPT divide start_ARG 2 italic_i end_ARG start_ARG italic_d end_ARG end_POSTSUPERSCRIPT ) , (6)
PE(posϕ,2i)r=sin(posϕ/100002id),𝑃subscriptsuperscript𝐸𝑟subscriptpositalic-ϕ2𝑖subscriptpositalic-ϕsuperscript100002𝑖𝑑\displaystyle{PE}^{r}_{\left(\mathrm{pos}_{\phi},2i\right)}=\sin\left(\mathrm{% pos}_{\phi}/10000^{\frac{2i}{d}}\right),italic_P italic_E start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( roman_pos start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , 2 italic_i ) end_POSTSUBSCRIPT = roman_sin ( roman_pos start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 10000 start_POSTSUPERSCRIPT divide start_ARG 2 italic_i end_ARG start_ARG italic_d end_ARG end_POSTSUPERSCRIPT ) , (7)
PE(posϕ,2i+1)r=cos(posϕ/100002id),𝑃subscriptsuperscript𝐸𝑟subscriptpositalic-ϕ2𝑖1subscriptpositalic-ϕsuperscript100002𝑖𝑑\displaystyle{PE}^{r}_{\left(\mathrm{pos}_{\phi},2i+1\right)}=\cos\left(% \mathrm{pos}_{\phi}/10000^{\frac{2i}{d}}\right),italic_P italic_E start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( roman_pos start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , 2 italic_i + 1 ) end_POSTSUBSCRIPT = roman_cos ( roman_pos start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT / 10000 start_POSTSUPERSCRIPT divide start_ARG 2 italic_i end_ARG start_ARG italic_d end_ARG end_POSTSUPERSCRIPT ) , (8)

where (posθ,posϕ)subscriptpos𝜃subscriptpositalic-ϕ(\mathrm{pos}_{\theta},\mathrm{pos}_{\phi})( roman_pos start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , roman_pos start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) represents the position being computed and d𝑑ditalic_d represents the dimension of the positional encoding vector. This formula intricately encodes positional information by encoding each coordinate, posθsubscriptpos𝜃\mathrm{pos}_{\theta}roman_pos start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and posϕsubscriptpositalic-ϕ\mathrm{pos}_{\phi}roman_pos start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, into a unique 32-dimensional vector through the application of sine and cosine functions. This approach captures the essence of spatial variations across the probability maps.

By integrating positional encoding with the original probability features, the model is endowed with an enhanced capacity to recognize and interpret spatial relationships, significantly improving its performance on subsequent tasks. The features are then fed into the subsequent feature extraction module, which aligns with the structure of the other branch.

III-C Multi-format Feature Fusion and Estimation Head

The features obtained from both branches are based on single-frame radar data. During human body movement, both previous and subsequent frames can be used as a reference for the current frame’s action information. To better utilize temporal information, the multi-frame information is jointly processed, considering the inherent temporal continuity of human motion. Data from multiple frames before and after the target frame are integrated, thus harnessing richer temporal features.

To integrate multi-format spatio-temporal features, multiple 3D convolutional layers are used to aggregate information. These 3D convolutional blocks help extract and consolidate features across both spatial and temporal dimensions. For each spatial scale, a 3D convolution block aggregates the residual temporal information, yielding three-layer encoded features. In addressing the challenge of integrating multi-scale features from distinct processing branches, the output dimensions of encoded features across layers are standardized. This uniformity facilitates the direct summation of positional and probability features with features from the other branch. Unlike the conventional concatenation method, this direct addition method preserves spatial coherency, particularly for positional features. The fused features are obtained as follows:

Ffi=F1i+F2i(i=1,2,3),superscriptsubscript𝐹𝑓𝑖superscriptsubscript𝐹1𝑖superscriptsubscript𝐹2𝑖𝑖123\displaystyle F_{f}^{i}=F_{1}^{i}+F_{2}^{i}\quad(i=1,2,3),italic_F start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_i = 1 , 2 , 3 ) , (9)

where i𝑖iitalic_i indicates the layers after 3D convolution. Ffsubscript𝐹𝑓F_{f}italic_F start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT represents the fused feature of each layer, and F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represent features from two branches.

Following the pioneering work [16], the layer features obtained by summing go through a cross- and self-attention decoding module, specifically designed to enhance human pose detection by leveraging contextual information across and within the frames captured by both horizontal and vertical radar systems. The core consists of multiple decoder layers, each designed to handle features at varying scales and complexities. At each scale, the network combines basic 2D blocks for initial feature transformation and flattens to get an attention map in each attention system. As shown in the lower right corner of Fig. 2, the cross-attention mechanism combines the same layer of features from both radars. In the cross-attention mechanism, the features from the horizontal radar are used as key and value, and the features from the vertical radar as queries. The roles of key, query, and value are then switched to perform the same operation. Since the horizontal and vertical radar capture features in different aspects, i.e., azimuth and elevation data, a residual connection is crucial to avoid directly correlating these features and to ensure training stability. In addition, all keys, queries, and values in the self-attention system are from the same radar to enhance the internal structure of individual features. There are no skipped connections to generate self-participating residual features.

The predicted keypoint heatmaps are obtained through the attention decoding module from the multi-layer features. Knowing the location of certain keypoints can help in predicting the location of other keypoints. Therefore, similar to [16], the pose refinement module of the Graph Convolutional Network (GCN) is used to refine the keypoint heatmap [35]. A 3-layer GCN is used to perform feature propagation and inference, refining the keypoint predictions using the mutual position information of the physical connections between keypoints. The generated new heatmap is used as a common reference to generate the final keypoint locations. There is a mismatch between the coordinate systems for radar signals and keypoint heatmaps, making it impossible to directly use keypoint coordinates to locate radar features.

To facilitate the end-to-end training of the model, a strategy of imposing a pixel-wise binary cross-entropy loss on both the initial keypoint heatmaps and the GCN-refined keypoint heatmaps is employed. The objective function is formulated as follows:

L=Lbce(H^,G)+Lbce(H,G),𝐿subscript𝐿bce^𝐻𝐺subscript𝐿bce𝐻𝐺\displaystyle L=L_{\mathrm{bce}}(\hat{H},G)+L_{\mathrm{bce}}(H,G),italic_L = italic_L start_POSTSUBSCRIPT roman_bce end_POSTSUBSCRIPT ( over^ start_ARG italic_H end_ARG , italic_G ) + italic_L start_POSTSUBSCRIPT roman_bce end_POSTSUBSCRIPT ( italic_H , italic_G ) , (10)

where H𝐻{H}italic_H and H^^𝐻{\hat{H}}over^ start_ARG italic_H end_ARG represent the initial keypoint heatmaps and refined keypoint heatmaps, respectively, and G𝐺{G}italic_G represents the generated groundtruth keypoint heatmaps based on Gaussian distribution. To illustrate, the binary cross-entropy loss Lbce(H,G)subscript𝐿bce𝐻𝐺L_{\mathrm{bce}}(H,G)italic_L start_POSTSUBSCRIPT roman_bce end_POSTSUBSCRIPT ( italic_H , italic_G ) is defined by the equation:

Lbce(H,G)=c,i,jGc,i,jlog(Hc,i,j)subscript𝐿bce𝐻𝐺subscript𝑐𝑖𝑗subscript𝐺𝑐𝑖𝑗subscript𝐻𝑐𝑖𝑗\displaystyle L_{\mathrm{bce}}(H,G)=-\sum_{c,i,j}G_{c,i,j}\log(H_{c,i,j})italic_L start_POSTSUBSCRIPT roman_bce end_POSTSUBSCRIPT ( italic_H , italic_G ) = - ∑ start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT roman_log ( italic_H start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT ) (11)
+(1Gc,i,j)log(1Hc,i,j),1subscript𝐺𝑐𝑖𝑗1subscript𝐻𝑐𝑖𝑗\displaystyle+(1-G_{c,i,j})\log(1-H_{c,i,j}),+ ( 1 - italic_G start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT ) roman_log ( 1 - italic_H start_POSTSUBSCRIPT italic_c , italic_i , italic_j end_POSTSUBSCRIPT ) ,

where c,i,j𝑐𝑖𝑗c,i,jitalic_c , italic_i , italic_j is derived from the channel, width and height of the joint prediction heatmaps H𝐻Hitalic_H and groudtruth heatmaps G𝐺Gitalic_G, which follow Gaussian distribution.

IV Experiments and Results

Refer to caption
Figure 4: Visualisation of comparison in state-of-the-art approaches and our proposed model.
Refer to caption
Figure 5: Visualisation of predicted keypoints in inaccurate condition.

IV-A Dataset and Evaluation

The HuPR dataset was chosen for our experiments [16]. This dataset acquires data from two identical radars. One radar sensor is rotated 90° in the antenna plane with respect to the other, with one radar focusing on the horizontal plane and the other on the vertical plane. Unlike some radar datasets that provide processed data in formats such as point clouds [36, 37, 38], heatmaps [39], etc., which lose the purity of radar data, HuPR provides raw radar ADC signals. This allows more possibilities for our data processing and feature extraction methods.

The HuPR dataset includes data from 235 acquisition sequences in an indoor environment. Each sequence contains RGB camera frames, horizontal radar frames, and vertical radar frames that are one minute in duration. The two radars and the camera are synchronized and configured to capture 10 frames per second (FPS), so each sequence has a set of 600 camera-radar-radar frames. In each sequence, one person performs a static action, a standing hand wave, and a walking hand wave. These poses imitate and demonstrate the basic movements of a person in a medical care scenario, such as performing arm and walking rehabilitation therapy. The dataset also provides ground truth generated by the human pose estimation network HRNet [40] from RGB frames.

In alignment with established norms in the field, our evaluation framework for 2D keypoints employs average precision (AP) metrics calculated across various levels of object keypoint similarity (OKS) [41]. The OKS is calculated using the following formula:

OKSp=iexp(dpi2/2Sp2σi2)δ(vpi=1)iδ(vpi=1),subscriptOKS𝑝subscript𝑖superscriptsubscript𝑑subscript𝑝𝑖22superscriptsubscript𝑆𝑝2superscriptsubscript𝜎𝑖2𝛿subscript𝑣subscript𝑝𝑖1subscript𝑖𝛿subscript𝑣subscript𝑝𝑖1\displaystyle\mathrm{OKS}_{p}=\frac{\sum_{i}\exp(-d_{p_{i}}^{2}/2S_{p}^{2}% \sigma_{i}^{2})\delta\left(v_{p_{i}}=1\right)}{\sum_{i}\delta\left(v_{p_{i}}=1% \right)},roman_OKS start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - italic_d start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 italic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_δ ( italic_v start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ ( italic_v start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 ) end_ARG , (12)

where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the ID of the targeted keypoint. dpisubscript𝑑subscript𝑝𝑖d_{p_{i}}italic_d start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents the Euclidean distance between the predicted keypoint and the ground truth keypoint for the i𝑖iitalic_ith keypoint on person p𝑝pitalic_p. Spsubscript𝑆𝑝S_{p}italic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the scale of the person based on its area, calculated from the ground truth box of the person. σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the normalization factor of the corresponding skeletal point, reflecting the influence of the current skeletal point on the whole. δ(vpi=1)𝛿subscript𝑣subscript𝑝𝑖1\delta\left(v_{p_{i}}=1\right)italic_δ ( italic_v start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 ) indicates that the predicted keypoint pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is visible within the observation range.

This allows the evaluation of the model’s accuracy in detecting 2D keypoints across 14 critical human body positions, including the head, neck, shoulders, elbows, wrists, hips, knees, and ankles. To offer a nuanced understanding of the model’s performance, three distinct AP metrics are used: AP50𝐴superscript𝑃50AP^{50}italic_A italic_P start_POSTSUPERSCRIPT 50 end_POSTSUPERSCRIPT, AP75𝐴superscript𝑃75AP^{75}italic_A italic_P start_POSTSUPERSCRIPT 75 end_POSTSUPERSCRIPT, and AP𝐴𝑃APitalic_A italic_P. These metrics represent varying degrees of OKS stringency, with AP50𝐴superscript𝑃50AP^{50}italic_A italic_P start_POSTSUPERSCRIPT 50 end_POSTSUPERSCRIPT and AP75𝐴superscript𝑃75AP^{75}italic_A italic_P start_POSTSUPERSCRIPT 75 end_POSTSUPERSCRIPT indicating more lenient and stringent OKS constraints, respectively. The metric denoted as AP𝐴𝑃APitalic_A italic_P calculates the mean average precision across a range of 10 OKS thresholds, specifically at 0.5, 0.55, and incrementally up to 0.95, providing a comprehensive overview of the model’s performance across a broad spectrum of precision requirements.

IV-B Implementation Details

We take 193 sequences from HuPR for training, 21 for validation, and 21 for test. The sequence are chosen as HuPR network, which is considered as baseline, for fair comparison of results.

We implement ProbRadarM3F using Python and Pytorch. The training of our network is processed on a single NVIDIA V100 GPU. The network is trained using the step decay learning rate strategy, with an initial learning rate of 0.00005 and a reduction of 0.999 times per 2000 iterations. The Adam is employed as optimizer, and we set the batch size to 24. Some other parameters in the experiment are set as follows. In CFAR processing, the settings of the guard unit and the reference unit are set to 16 and 8. The depth of positional encoding for the probability map is set to 32. The number of input frames is 8 for joint processing of multi-frame data. The convolutional of network contains basic blocks from ResNet [42]. The size of convolution kernel for 3D convolutional processing is 3×3×33333\times 3\times 33 × 3 × 3 and ReLU activation function is used. These parameters are selected based on optimal performance in experiments.

IV-C Results and Analysis

Experiments on the HuPR dataset were performed to evaluate the effectiveness of the proposed strategies. Table II compares the average precision values of our proposed methods with RF-Pose [27] and the baseline method [16]. ProbRadarM3F shows advantages in AP at various precision levels, illustrating that our method outperforms RF-Pose on the same dataset. Compared to our baseline method, which is state-of-the-art, our method achieves higher scores on every metric, especially in AP𝐴𝑃APitalic_A italic_P and AP75𝐴superscript𝑃75AP^{75}italic_A italic_P start_POSTSUPERSCRIPT 75 end_POSTSUPERSCRIPT, with gains of 6.5%percent\%% and 12.9%percent\%%, respectively. The increase in AP75𝐴superscript𝑃75AP^{75}italic_A italic_P start_POSTSUPERSCRIPT 75 end_POSTSUPERSCRIPT is a stringent indicator of our method’s ability to accurately predict human keypoints.

The specific precision of each keypoint is detailed in Table I. Our method has improved precision in every joint keypoint compared to the state-of-the-art. The best performance is observed in the hip joints, where AP𝐴𝑃APitalic_A italic_P reaches 92.9%percent\%%. The greatest improvement is seen in the shoulder joints, with an AP𝐴𝑃APitalic_A italic_P increase of 8.2%percent\%%. Reflected signals from the torso joints are strong and contain significant information with little noise, resulting in more accurate predictions. However, estimating arm and wrist joints remains challenging due to richer and finer arm movements and the radar’s limitations in capturing information from small reflective surfaces away from the torso. Despite this, our work shows accuracy gains of 6.7%percent\%% in the elbow and 5.9%percent\%% in the wrist.

Fig. 4 illustrates the performance of our model under different types of actions in comparison. GT represents the ground truth keypoints generated from HRNet [40] that precisely follow the actions in RGB frames, while HuPR and Ours are the predictions of the baseline and our ProbRadarM3F for the same frame, respectively. As shown, the predicted keypoints mostly align with the ground truth. To present realistic results, some frames where the prediction is not accurate enough are displayed in Fig. 5. It can be seen that our model loses accuracy when the target moves quickly. Although it predicts the torso and head locations well, there is still room for improvement in the wrist and foot joints. This observation highlights a critical area for future improvement, particularly in enhancing the model’s sensitivity to high-motion extremities.

Therefore, analyzing both the overall and individual results, our work effectively improves the precision of mmWave radar-based human skeletal pose estimation. This demonstrates that the probability map-guided positional encoding method effectively mines information from radar signals that might have been previously overlooked.

TABLE I: Comparison of keypoint Accurate Precison
AP𝐴𝑃APitalic_A italic_P
Model Head Neck Shoulder Elbow Wrist Hip Knee Ankle
RF-Pose [27] 61.0 65.3 52.5 16.1 6.3 73.5 65.7 62.0
HuPR [16] 77.5 81.9 70.3 45.5 22.3 88.1 82.2 73.1
Ours 81.1 83.6 78.1 52.2 28.2 92.9 88.1 75.8
TABLE II: Comparison of State-of-the-art Approaches and Our Proposed Model
Model AP𝐴𝑃APitalic_A italic_P AP50𝐴superscript𝑃50AP^{50}italic_A italic_P start_POSTSUPERSCRIPT 50 end_POSTSUPERSCRIPT AP75𝐴superscript𝑃75AP^{75}italic_A italic_P start_POSTSUPERSCRIPT 75 end_POSTSUPERSCRIPT
RF-Pose [27] 41.4 82.9 37.0
HuPR [16] 63.4 97.0 74.0
Ours 69.9 98.5 86.9

IV-D Ablation Study

A series of ablation studies were conducted to evaluate the individual contributions of various components within our proposed framework, ProbRadarM3F. These studies were performed on the HuPR dataset under consistent training, validation, and testing set settings.

  • \bullet

    To determine the impact of the ProbPE branch, the network was operated with only the FFT branch, excluding the integration of features from the positional encoding guided by probability maps. The experimental results are shown in Table III. In the absence of the ProbPE branch, the model achieved lower precision in keypoint prediction compared to the complete ProbRadarM3F model, particularly under stringent OKS constraints. Specifically, excluding the ProbPE branch resulted in a decrease of 3.8%percent\%% in AP𝐴𝑃APitalic_A italic_P and a more pronounced 7.7%percent\%% reduction in AP75)AP^{75})italic_A italic_P start_POSTSUPERSCRIPT 75 end_POSTSUPERSCRIPT ). The importance of the ProbPE branch is evident, highlighting the significant role of position information in the model’s prediction performance.

  • \bullet

    In addition to isolating the effect of the ProbPE branch as a whole, the influence of probability maps in the ProbPE branch was specifically assessed. The complete model conducts positional encoding guided by probability maps, which are pivotal in generating refined positional features. To quantify the contribution of these probability maps, an ablation experiment was conducted by excluding the probability map generation step and directly applying positional encoding to the data after DOA. As shown in Table III, precision improved compared to using the FFT branch independently. However, the absence of probability maps resulted in a noticeable decrease in performance, leading to a reduction of 2.3%percent2.32.3\%2.3 % in AP𝐴𝑃APitalic_A italic_P when compared to the complete model. The results demonstrate the importance of probability maps. Without the guidance of probability maps, positional encoding becomes less effective. Consequently, the probability maps are not merely auxiliary but integral to ensuring that the positional encoding maximally captures and utilizes the spatial context inherent in the radar signals. The probability maps enhance the model’s ability to accurately determine the probability of target presence at specific azimuth and elevation coordinates, which is crucial for accurately encoding positional information.

TABLE III: Ablation Study
Model AP𝐴𝑃APitalic_A italic_P AP50𝐴superscript𝑃50AP^{50}italic_A italic_P start_POSTSUPERSCRIPT 50 end_POSTSUPERSCRIPT AP75𝐴superscript𝑃75AP^{75}italic_A italic_P start_POSTSUPERSCRIPT 75 end_POSTSUPERSCRIPT
Ours(without ProbPE Branch) 66.1 97.1 79.2
Ours(without Probability Maps Generation) 67.6 98.4 84.2
Ours(Complete ProbRadarM3F Model) 69.9 98.5 86.9

V Conclusion

In our paper, we have explored the use of mmWave radar for estimating human skeletal poses. This is an important component of smart sensor systems used in medical care IoT applications, particularly in addressing privacy concerns. We introduced ProbRadarM3F, a new model that uses millimeter-wave radar for human skeletal pose estimation. This model significantly improves the extraction and use of hidden information from radar signals. We also introduced the ProbPE branch, which generates probability maps based on estimating the likelihood of a specific position. These probability maps are then used to effectively extract position features using a positional encoding method. Additionally, we combined positional features with features obtained from the FFT branch to enhance the human body keypoint features produced by the model. Our experiments on the HuPR dataset demonstrated that ProbRadarM3F outperforms existing methods, indicating the effectiveness of our approach. Our results highlight that there is valuable information, such as positional data, present in the radar heatmap or even in the radar signal itself, which should not be considered redundant.

References

  • [1] T. Perumal, E. Ramanujam, S. Suman, A. Sharma, and H. Singhal, “Internet of things centric-based multiactivity recognition in smart home environment,” IEEE Internet of Things Journal, vol. 10, no. 2, pp. 1724–1732, 2022.
  • [2] M. A. Akkaş, R. Sokullu, and H. E. Çetin, “Healthcare and patient monitoring using iot,” Internet of Things, vol. 11, p. 100173, 2020.
  • [3] C. Zheng, W. Wu, C. Chen, T. Yang, S. Zhu, J. Shen, N. Kehtarnavaz, and M. Shah, “Deep learning-based human pose estimation: A survey,” ACM Computing Surveys, vol. 56, no. 1, pp. 1–37, 2023.
  • [4] X. Feng, K. A. Nguyen, and Z. Luo, “A survey of deep learning approaches for wifi-based indoor positioning,” Journal of Information and Telecommunication, vol. 6, no. 2, pp. 163–216, 2022.
  • [5] Z. Wang, X. Li, F. Liu, M. Ma, X. Feng, and Y. Guo, “A survey on human behavior recognition applications using frequency modulated continuous wave radar,” in Proceedings of the 2022 10th International Conference on Information Technology: IoT and Smart City, 2022, pp. 133–139.
  • [6] Z. Cao, G. Mei, X. Guo, and G. Wang, “Virteach: mmwave radar point cloud based pose estimation with virtual data as a teacher,” IEEE Internet of Things Journal, 2024.
  • [7] J. Yang, Y. Zhou, H. Huang, H. Zou, and L. Xie, “Metafi: Device-free pose estimation via commodity wifi for metaverse avatar simulation,” in 2022 IEEE 8th World Forum on Internet of Things (WF-IoT).   IEEE, 2022, pp. 1–6.
  • [8] W. Xiong, J. Liu, Y. Xia, T. Huang, B. Zhu, and W. Xiang, “Contrastive learning for automotive mmWave radar detection points based instance segmentation,” in Proceedings of the IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), 2022, pp. 1255–1261.
  • [9] J. Liu, W. Xiong, L. Bai, Y. Xia, T. Huang, W. Ouyang, and B. Zhu, “Deep instance segmentation with automotive radar detection points,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 84–94, January 2023.
  • [10] J. Liu, Q. Zhao, W. Xiong, T. Huang, Q.-L. Han, and B. Zhu, “SMURF: Spatial multi-representation fusion for 3D object detection with 4D imaging radar,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 799–812, October 2024.
  • [11] W. Xiong, J. Liu, T. Huang, Q.-L. Han, Y. Xia, and B. Zhu, “LXL: LiDAR excluded lean 3D object detection with 4D imaging radar and camera fusion,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 79–92, October 2024.
  • [12] Y. Yang, J. Liu, T. Huang, Q.-L. Han, G. Ma, and B. Zhu, “RaLiBEV: Radar and LiDAR BEV fusion learning for anchor box free object detection systems,” 2022, arXiv:2211.06108.
  • [13] J. Liu, G. Ding, J. Xia, Yuxuan Sun, T. Huang, L. Xie, and B. Zhu, “Which framework is suitable for online 3D multi-object tracking for autonomous driving with automotive 4D imaging radar?” in Proceedings of the IEEE 35th Intelligent Vehicles Symposium (IV), 2024, arXiv:2309.06036.
  • [14] T. Huang, J. Liu, X. Zhou, D. C. Nguyen, M. R. Azghadi, Y. Xia, Q.-L. Han, and S. Sun, “V2x cooperative perception for autonomous driving: Recent advances and challenges,” 2023, arXiv:2310.03525.
  • [15] C. Iovescu and S. Rao, “The fundamentals of millimeter wave sensors,” Texas Instruments, pp. 1–8, 2017.
  • [16] S.-P. Lee, N. P. Kini, W.-H. Peng, C.-W. Ma, and J.-N. Hwang, “Hupr: A benchmark for human pose estimation using millimeter wave radar,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 5715–5724.
  • [17] B. van Berlo, A. Elkelany, T. Ozcelebi, and N. Meratnia, “Millimeter wave sensing: A review of application pipelines and building blocks,” IEEE Sensors Journal, vol. 21, no. 9, pp. 10 332–10 368, 2021.
  • [18] Z. Zheng, J. Pan, Z. Ni, C. Shi, S. Ye, and G. Fang, “Human posture reconstruction for through-the-wall radar imaging using convolutional neural networks,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2021.
  • [19] J. Zhang, R. Xi, Y. He, Y. Sun, X. Guo, W. Wang, X. Na, Y. Liu, Z. Shi, and T. Gu, “A survey of mmwave-based human sensing: Technology, platforms and applications,” IEEE Communications Surveys & Tutorials, 2023.
  • [20] Z. Li, T. **, D. Guan, and H. Xu, “Metaphys: Contactless physiological sensing of multiple subjects using ris-based 4-d radar,” IEEE Internet of Things Journal, vol. 10, no. 14, pp. 12 616–12 626, 2023.
  • [21] Q. Li, J. Liu, R. Gravina, W. Zang, Y. Li, and G. Fortino, “A uwb-radar-based adaptive method for in-home monitoring of elderly,” IEEE Internet of Things Journal, vol. 11, no. 4, pp. 6241–6252, 2024.
  • [22] H. Wang, J. Chen, D. Zhang, Z. Lu, C. Wu, Y. Hu, Q. Sun, and Y. Chen, “Contactless radar heart rate variability monitoring via deep spatio-temporal modeling,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 111–115.
  • [23] M. Pauli, B. Göttel, S. Scherr, A. Bhutani, S. Ayhan, W. Winkler, and T. Zwick, “Miniaturized millimeter-wave radar sensor for high-accuracy applications,” IEEE transactions on microwave theory and techniques, vol. 65, no. 5, pp. 1707–1715, 2017.
  • [24] A. Sengupta, F. **, R. Zhang, and S. Cao, “mm-pose: Real-time human skeletal posture estimation using mmwave radars and cnns,” IEEE Sensors Journal, vol. 20, no. 17, pp. 10 032–10 044, 2020.
  • [25] H. Yi, C. Li, Q. Cao, X. Shen, S. Li, G. Wang, and Y.-W. Tai, “Mmface: A multi-metric regression network for unconstrained face reconstruction,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7663–7672.
  • [26] Z. Wu, D. Zhang, C. Xie, C. Yu, J. Chen, Y. Hu, and Y. Chen, “Rfmask: A simple baseline for human silhouette segmentation with radio signals,” IEEE Transactions on Multimedia, 2022.
  • [27] M. Zhao, T. Li, M. Abu Alsheikh, Y. Tian, H. Zhao, A. Torralba, and D. Katabi, “Through-wall human pose estimation using radio signals,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7356–7365.
  • [28] W. Ding, Z. Cao, J. Zhang, R. Chen, X. Guo, and G. Wang, “Radar-based 3d human skeleton estimation by kinematic constrained learning,” IEEE Sensors Journal, vol. 21, no. 20, pp. 23 174–23 184, 2021.
  • [29] H. Kong, X. Xu, J. Yu, Q. Chen, C. Ma, Y. Chen, Y.-C. Chen, and L. Kong, “m3track: mmwave-based multi-user 3d posture tracking,” in Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services, 2022, pp. 491–503.
  • [30] Z. Cao, W. Ding, R. Chen, J. Zhang, X. Guo, and G. Wang, “A joint global–local network for human pose estimation with millimeter wave radar,” IEEE Internet of Things Journal, vol. 10, no. 1, pp. 434–446, 2022.
  • [31] Y. Wang, Z. Jiang, Y. Li, J.-N. Hwang, G. Xing, and H. Liu, “Rodnet: A real-time radar object detection network cross-supervised by camera-radar fused object 3d localization,” IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 4, pp. 954–967, 2021.
  • [32] M. A. Richards, J. Scheer, W. A. Holm, and W. L. Melvin, “Principles of modern radar,” 2010.
  • [33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [34] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proceedings of the 16th European Conference of Computer Vision (ECCV), 2020, pp. 213–229.
  • [35] J. Wang, X. Long, Y. Gao, E. Ding, and S. Wen, “Graph-pcnn: Two stage human pose estimation with graph pose refinement,” in Proceedings of the 16th European Conference of Computer Vision (ECCV), 2020, pp. 492–508.
  • [36] S. An and U. Y. Ogras, “Mars: mmwave-based assistive rehabilitation system for smart healthcare,” ACM Transactions on Embedded Computing Systems (TECS), vol. 20, no. 5s, pp. 1–22, 2021.
  • [37] H. Cui, S. Zhong, J. Wu, Z. Shen, N. Dahnoun, and Y. Zhao, “Milipoint: A point cloud dataset for mmwave radar,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [38] A. D. Singh, S. S. Sandha, L. Garcia, and M. Srivastava, “Radhar: Human activity recognition from point clouds generated through a millimeter-wave radar,” in Proceedings of the 3rd ACM Workshop on Millimeter-wave Networks and Sensing Systems, 2019, pp. 51–56.
  • [39] C. Xie, D. Zhang, Z. Wu, C. Yu, Y. Hu, and Y. Chen, “Rpm: Rf-based pose machines,” IEEE Transactions on Multimedia, 2023.
  • [40] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5693–5703.
  • [41] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the 13th European Conference of Computer Vision (ECCV), 2014, pp. 740–755.
  • [42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.