Sensor Data Augmentation from Skeleton Pose Sequences for Improving Human Activity Recognition

Parham Zolfaghari12, Vitor Fortes Rey23, Lala Ray23, Hyun Kim4, Sungho Suh23, Paul Lukowicz23 P. Zolfaghari and V. Fortes Rey - These authors contributed equally to this work.Corresponding author: [email protected] 1Department of Computer Science, Saarland University, Saarbrücken, Germany 2German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany 3Department of Computer Science, RPTU Kaiserslautern-Landau, Germany 4Department of Electrical and Information Engineering, Seoul National University of Science and Technology, Korea

Abstract

The proliferation of deep learning has significantly advanced various fields, yet Human Activity Recognition (HAR) has not fully capitalized on these developments, primarily due to the scarcity of labeled datasets. Despite the integration of advanced Inertial Measurement Units (IMUs) in ubiquitous wearable devices like smartwatches and fitness trackers, which offer self-labeled activity data from users, the volume of labeled data remains insufficient compared to domains where deep learning has achieved remarkable success. Addressing this gap, in this paper, we propose a novel approach to improve wearable sensor-based HAR by introducing a pose-to-sensor network model that generates sensor data directly from 3D skeleton pose sequences. our method simultaneously trains the pose-to-sensor network and a human activity classifier, optimizing both data reconstruction and activity recognition. Our contributions include the integration of simultaneous training, direct pose-to-sensor generation, and a comprehensive evaluation on the MM-Fit dataset. Experimental results demonstrate the superiority of our framework with significant performance improvements over baseline methods.

Index Terms:

Human activity recognition, data augmentation, pose estimation, multi-modal learning

I Introduction

Human activity recognition (HAR) has emerged as a cornerstone technology in a myriad of applications, ranging from personal fitness to healthcare and industrial automation. The ubiquity of smart wearable devices, such as watches and fitness trackers, has made continuous monitoring of physical activities not only possible but also prevalent. These devices, equipped with advanced sensors, provide a rich source of data that, when analyzed, can offer personalized health and fitness recommendations, guiding users toward their wellness goals.

Beyond fitness, the implications of precise HAR are profound, extending into elderly care for fall detection [1], patient monitoring in healthcare [2], and even worker safety in manufacturing line [3]. Accurate activity recognition facilitates the development of intelligent systems that are responsive to human needs, enhancing safety, productivity, and overall quality of life.

Despite its potential, HAR faces significant challenges, chiefly due to the limited availability of labeled datasets. In contrast to computer vision tasks, a significant bottleneck is the scarcity of annotated sensor data required for training accurate and robust recognition models. Manually labeling extensive datasets for diverse human activities is time-consuming, labor-intensive, and expensive. Additionally, deploying sensors on multiple body parts, though essential for capturing comprehensive motion data, exacerbates the complexities and costs associated with data collection and annotation.

To address these challenges, researchers have explored alternative avenues. Data augmentation stands as a powerful technique to combat the dearth of labeled data, enhancing the diversity and volume of training datasets, and thereby improving the performance of machine learning models. In scenarios where data is scarce or expensive to obtain, augmentation has proven to be effective, particularly in image and speech recognition tasks. In particular, recently, several sensor data generation methods from video sequences have emerged in the HAR community. Existing works [4, 5, 6] estimate 2D or 3D joint positions from videos and infer joint orientations to compute inertial measurement unit (IMU) data. While these methods improved the spatial accuracy of generated sensor data and enhanced the performance of wearable sensor-based HAR, they still struggle with capturing the intricacies of sensor characteristics and may not fully exploit the potential for cross-modality transfer.

In this paper, we propose a novel approach to improve wearable sensor-based HAR performance. We introduce a pose-to-sensor network model that generates IMU sensor data directly from 3D skeleton pose sequences. Unlike existing approaches, the proposed method trains the pose-to-sensor network and the HAR classifier at the same time. The pose-to-sensor network is trained to minimize the reconstruction loss between the ground truth of sensor data and generated sensor data and the classification loss of the human activity classifier with real and generated data. This concurrent training process allows the pose-to-sensor network to generate synthetic sensor data that not only closely resembles real sensor data but also enhances the performance of the activity classifier. By minimizing both the reconstruction loss between ground truth and generated sensor data and the classification loss of the HAR classifier, our method leverages the synergistic relationship between these components to optimize overall performance. To evaluate the proposed method, we evaluate the proposed method on the well-known open-access benchmark datasets, MM-Fit [7] and UTD-MHAD [8], which provide multimodal data including the skeleton, IMU, and label data, enabling the synthesis of realistic and diverse training samples. Experimental results demonstrate that the proposed framework provides better performance improvement in terms of accuracy and macro F1 score compared to existing methods, including IMUTube [4], Chen et al. [8], and Memmesheimer et al. [9], and baseline which trains the pose-to-sensor network model without considering the classifier.

The main contributions of our proposed method are summarized as follows.

•

We introduce a novel approach that integrates the training of the pose-to-sensor network with a human activity classifier, promoting a synergistic optimization that improves both data reconstruction and activity recognition.
•

By directly generating sensor data from 3D skeleton pose sequences, the proposed method addresses the limitations of existing methods that map pose joints to specific sensor positions, potentially capturing more subtle sensor characteristics for HAR.
•

We conduct comprehensive evaluations on the MM-Fit [7] and UTD-MHAD [8] datasets to demonstrate the superiority of the proposed framework. Comparative analyses showcase significant performance improvements over existing methods and baseline methods that solely focus on pose-to-sensor network training without considering the classifier.

The remainder of this paper is organized as follows: Section II reviews related work improving HAR performance by generating IMU sensor data augmentation. Section III describes the methodology, including the data synthesis process and the end-to-end training pipeline. Section IV presents the experimental setup, results, and comparative analysis. Finally, Section V concludes the paper with a summary of our findings and a discussion of potential future work.

II Related Works

Recently, generative models have been utilized to improve the performance of wearable sensor-based HAR by augmenting sensor data [10, 11, 12]. In particular, several notable methods have emerged in the literature to facilitate the generation of virtual IMU data from video sequences. Generative methods, such as [13, 5], have employed machine learning techniques to derive IMU data from videos directly. In addition, trajectory-based approaches, as demonstrated by [4, 14], extracted 2D joint positions from videos and estimated 3D pose joint positions from 2D pose sequences. Subsequently, they estimated joint orientations using forward kinematics. The resulting orientations enable the transformation of 3D joint positions into frame-of-reference of the IMU, facilitating the computation of acceleration and angular velocity.

In response to challenges associated with labeled data acquisition, the concept of virtual IMU data generation has gained traction. Cross-modality transfer approaches, such as those proposed by [13, 5, 4], extract virtual IMU data from 2D RGB videos of human activities. This not only addresses limitations in labeled wearable data collection but also contributes to the construction of personalized HAR systems for individual user needs [15]. Virtual IMU data generation improves the accuracy and robustness of HAR models, promoting broader adoption across diverse domains.

Among the innovative systems, IMUTube [4] stands out as a comprehensive solution for extracting virtual IMU data from 2D RGB videos. Operating as a processing pipeline, IMUTube integrates computer vision, graphics, and machine learning models to convert large-scale video datasets into virtual IMU data suitable for training sensor-based HAR systems. Its adaptive video selection, 3D human motion tracking, and virtual IMU data extraction and calibration components collectively contribute to generating high-quality virtual IMU data. Notably, the system’s versatility has been demonstrated in improving model performance through the integration of real and virtual IMU data. However, existing approaches primarily focus on generating real-like IMU data from pose sequences or video data, neglecting the potential to improve the performance of the activity classifier. These methods lack a simultaneous training approach that optimizes the pose-to-sensor network model to minimize both reconstruction loss and classification loss, limiting their ability to fully leverage the available data for enhanced activity recognition.

In this landscape, the proposed method in this paper introduces a novel approach to sensor data generation from 3D skeleton pose sequences, focusing on the simultaneous training of a pose-to-sensor network model and a human activity classifier. Unlike existing methods that primarily identify corresponding sensor data from specific sensor positions on human body parts, the proposed approach aims to optimize the pose-to-sensor network model by minimizing both reconstruction loss and classification loss. By addressing these limitations and innovatively leveraging 3D skeleton pose sequences, our proposed method offers a promising solution for enhancing the performance and robustness of wearable sensor-based HAR systems. Subsequent sections introduce the details of this methodology and present experimental results, demonstrating its superior performance compared to baseline approaches that neglect classifier considerations.

Refer to caption — Figure 1: Schematic Overview of the Proposed Method. This diagram illustrates the training architecture wherein the feature extraction and classification modules are concurrently trained with both real and synthetic IMU accelerometer data. Integral to the pipeline is a regression model that generates synthetic data and facilitates the enhanced training of the feature extraction and classification modules. The overall training process is governed by a compound weighted sum loss, optimizing the synergy between the modules for improved performance

III Method

This section describes our method to enhance HAR which focuses on the efficient use of available real and simulated sensor data. While most other methods consider sensor data simulation and training a HAR classifier as different, independent steps, ours performs both in an end-to-end fashion.

Initially, we detail the preprocessing steps applied to prepare the data for training, ensuring its suitability for the subsequent learning process. We then introduce the architecture of our end-to-end pipeline, which concurrently leverages synthetic and real IMU data to enhance the training of the HAR model.

III-A Pre-processing

The pre-processing stage is critical for synchronizing and standardizing our data sources. Thus, first, we match the sampling rate between the pose data and the sensor one. As poses are obtained from the video, which can have frame rates of 30 or below, while IMUs can reach higher sampling rates such as 100 Hz, we bring both modalities to a fixed 100 samples per second using linear interpolation. Subsequently, we standardize the accelerometer data to a mean of zero and a standard deviation of one per channel, using statistics obtained in the training set. Simultaneously, we normalize the skeleton data to ensure uniformity across the dataset. Normalization of the skeleton data is performed using the method proposed by Rey et al. [5]. We calculate the Euclidean distance between the neck and mid-hip joints at a single time $t$ , denoted as $dist_{t}$ . To mitigate the effect of outliers, this distance is computed over a window $w$ of three seconds, and the scale at time $t$ is determined by the median of distances within this window:

scale_{t}=median(dist_{t-w/2},...,dist_{t+w/2})

(1)

and we define a function to scale any value $v_{t}$

scale(v_{t},scale_{t})=-1.0+\frac{v_{t}}{scale_{t}}*2.0

(2)

We compute new values for each joint setting the mid-hip joint as a reference

NewJoint^{t}=scale(joint^{t}-midhip^{t},scale_{t})

(3)

Since the data structure of the UTD-MHAD dataset is different from that of MMFit by having separate files and already synchronized for each class we simply applied linear interpolation to the 30 Hz skeleton data to match the 50 Hz inertial data of UTD-MHAD. Afterward, the standard normalization method listed above is followed to prepare the final data.

III-B End-to-end pipeline

The cornerstone of our proposed method is an end-to-end pipeline designed to enhance the training efficacy of the feature extractor module, thereby yielding more accurate activity classification. Central to this pipeline is a regression model that employs temporal convolutional network (TCN) blocks [16] as its core. This model is tailored to process the 3D positions of the number of joints outputting synthesized IMU accelerometer sensor data of a joint.

The components of our method are illustrated in Fig. 1. Our approach comprises various interconnected networks, all trained concurrently. The first component, a regression model denoted as $R$ , receives pre-processed pose sequences as input and produces simulated sensor data. Formally, applying $R$ to the pre-processed pose data $x_{pose}$ generates our simulated sensor data $\tilde{x}_{sensor}$ . To guide the model to generate realistic sensor data, we incorporate the mean squared error (MSE) term between real and simulated sensor data into the overall loss function:

L_{MSE}=\lVert x_{sensor}-\tilde{x}_{sensor}\rVert^{2}

(4)

Inspired by [5], we customized the architecture of our regression model to better process 3D joint position data. The core of our regression model comprises five TCN blocks, each followed by a linear layer. Two notable adjustments are made from the original TCN block design presented in Rey et al. [5]. First, we employ 2D convolution layers instead of 1D, a change necessitated by our use of 3D joint positions as input data. Second, we substitute the ReLU activation function with Leaky ReLU for enhanced performance, applying it after two 2D convolutions alongside a dropout layer. Consistent with the configuration in [5], the initial TCN block in our model does not incorporate dropout. Table I shows the full architecture of the proposed model. In case of UTD-MHAD dataset we modified the architecture by removing the TCN Block 5 and adding a view layer to before the FC layer to adopt to the data shape.

TABLE I: Regression model architecture

Layer

TCN Block

(in_ch, out_ch, kernel_size,

dilation, dropout)

TCN Block 1

(3, 32, 3, 1, 0)

TCN Block 2

(32, 32, 3, 2, 0.2)

TCN Block 3

(32, 32, 3, 4, 0.2)

TCN Block 4

(32, 32, 3, 1, 0.2)

TCN Block 5

(16, 16, 1, 1, 0.1)

Fully Connected

Linear(out_features=3*window)

TABLE II: Feature Extraction Model Architecture and Classification

Layer

Configuration

1D Conv

Input channels: 3,

Output channels: 9,

Kernel size: 9,

Stride: 9//2

Leaky ReLU

1D Batch Norm

f = 9

Dropout

0.2

1D Conv

Input channels: 9,

Output channels: 9,

Kernel size: 9,

Stride: 9//2

Leaky ReLU

1D Batch Norm

f = 9

Dropout

0.2

1D Conv

Input channels: 9,

Output channels: 9,

Kernel size: 9,

Stride: 9//2

Leaky ReLU

1D Batch Norm

f = 9

Dropout

0.2

1D Maxpool

Kernel size: 2, Stride: 2

Fully Connected

f_out = 100

Fully Connected

f_out = n classes

Both real and simulated sensor data are then processed by the same feature extraction $F$ module and then by our classifier $C$ . We investigated promoting alignment between feature vectors derived from real and synthetic accelerometer data by including in our loss the cosine similarity between both, as expressed by:

L_{similarity}=L_{CS}(F(x_{sensor}),F(\tilde{x}_{sensor}))

(5)

where $x_{sensor}$ and $\tilde{x}_{sensor}$ denote real sensor data and synthetic sensor data, respectively.

Simultaneously, for activity classification accuracy, we employ cross-entropy loss with both real and synthetic feature vectors, defined as:

\begin{split}L_{activity}=&-\sum_{l=1}^{N_{activity}}w_{i}y_{l}\log C(F(x_{% sensor_{l}}))\\ &-\sum_{l=1}^{N_{activity}}w_{i}y_{l}\log C(F(\tilde{x}_{sensor_{l}}))\end{split}

(6)

Our feature extraction model consists of three 1D convolution layers with maxpooling after the first and second convolution layers, culminating in a linear layer that prepares the feature vector for the classification module. The classification is performed by a single-layer fully connected network. Table II details the architecture of the feature extractor and classification models. In the case of the UTD-MHAD dataset we adopted the classification model from Chen et al. [8].

Unlike conventional approaches, all networks of our proposed framework are trained concurrently. This concurrent training strategy, in which the regression model, feature extractor, and classifier are collectively optimized, plays a key role in smoothly integrating real and simulated data into a robust and effective system. The total loss function encapsulates the contributions of the reconstruction loss $L_{MSE}$ for sensor data generation, the activity classification loss $L_{activity}$ for activity classification, and the cosine similarity loss $L_{similarity}$ for feature alignment.

L_{final}=L_{MSE}+\alpha L_{activity}+\beta L_{similarity}

(7)

where $\alpha$ and $\beta$ are weighting factors that balance the contribution of each loss component to the total loss.

IV Experiments and Results

This section provides a comprehensive overview of the empirical evaluation conducted to assess the effectiveness of our proposed method. We commence by detailing the MM-Fit dataset, which serves as the foundational data source for our study. Following this, we outline the baseline method against which we benchmark our approach, establishing a context for comparative analysis.

This section culminates with a presentation of the results derived from our evaluation, illustrating the impact of our modifications and the overall performance of our proposed method in the context of Human Activity Recognition (HAR).

IV-A Dataset

The MM-Fit dataset forms the core of our study, tracking participants engaged in a variety of workout activities. It provides a comprehensive suite of data capturing 2D and 3D skeletal poses and Inertial Measurement Unit (IMU) sensor readings from the wrist and other body locations across 21 workout sessions. Each session is composed of three sets, with each set containing ten exercises and ten repetitions. During the rest intervals between sets, the sensors continue to record, capturing the natural rest behavior of participants. The dataset categorizes eleven different activities, including periods of ’no workout’.

The UTD-MHAD dataset offers a comprehensive collection of 27 diverse human actions that includes 20 upper body and 7 lower body motions, ranging from simple gestures like hand waves and clap** to complex movements like basketball shooting and jogging. A Kinect camera and wearable inertial sensor is used for data collection. The Kinect camera captures high-resolution color and depth images at a frame rate of approximately 30 frames per second, providing detailed visual data. Complementing this, the wearable inertial sensor, positioned either on the subject’s wrist or thigh depending on the nature of the action, records precise motion data including acceleration, angular velocity, and magnetic strength at a sampling rate of 50 Hz.

For our research, we focus specifically on the 3D skeleton pose data, obtained from an RGB camera operating at 30 Hz, and the left wrist accelerometer data from a smartwatch with a 100 Hz sampling rate. As in [5], our regression model receives as input the three left-arm joints (wrist, elbow, and shoulder).

To ensure consistency and comparability with previous studies, we have adhered to the same training, validation, and test splits as employed in the original MM-Fit paper, testing on the fixed unseen users test set. We used 3-second sliding windows (300 samples) with a 0.2-second stride and assigned for each window the activity that happened the most during it as its label. This approach allows for direct comparison of our results with the established benchmarks in the field.

IV-B Baselines

This subsection outlines two baseline methods used for activity classification. The first baseline method utilizes only real accelerometer data from the left wrist. It involves a feature extraction model that identifies key features from the accelerometer data, which are then used by a classifier to predict the activity type. Both the feature extractor and the classifier follow the same architecture as our proposed method. This approach is akin to the one used in the MM-Fit [7] study and serves as a comparison to demonstrate the enhancements our proposed method provides.

The second baseline is an approach where we first train a regression model to generate synthetic sensor data from 3D joint poses and then use the synthetic data provided by said model, together with real sensor data to train the activity classifier. By integrating synthetic data, this model aims to augment the training dataset and improve classification performance. However, unlike our proposed method, this process involves separate steps for data synthesis and classifier training.

Both baselines are critical for evaluating the effectiveness of our proposed method. By comparing these two approaches, we aim to showcase how including classification in the sensor generation procedure can improve both the quality of the generated sensor data as well as the overall classification performance for real data in terms of F1 score and accuracy, emphasizing our contributions to the field of Human Activity Recognition.

In our approach, we explore the efficacy of integrating both real and synthetic IMU accelerometer data to enhance the training of the feature extraction and classification modules. Initially, we developed a regression model trained independently to accurately predict IMU accelerometer data based on the arm’s joint positions, specifically the wrist, elbow, and shoulder. Following the successful generation of synthetic accelerometer data, we merge it with the real sensor data. This combined dataset is then utilized to train the feature extraction and classification modules, aiming to leverage the diversity and comprehensiveness of the augmented data set for improved model performance.

IV-C Implementation Details

To ensure a consistent starting point for model training, we initialize the weights of the convolution and fully connected layers using Kaiming initialization. Recognizing the potential impact of initialization randomness, we conducted five runs of the experiment for MMFit and ten runs for UTD-MHAD each with a predefined seed to comprehensively explore the search space. The Adam optimizer, with a learning rate of $10^{-3}$ , was selected for training. To prevent over-fitting and ensure optimal generalization, we implemented an early stop** mechanism based on the F1 score on the validation set, with a patience parameter of 25 epochs for MMFit and 30 epochs for UTD-MHAD, respectively. The models were trained for a maximum of 100 epochs for MMFit and 200 epochs or until the early stop** criterion was met.

The introduction of early stop**, based on validation set performance, plays a critical role in our experimental design, allowing us to halt training when the model ceases to show improvement, thereby conserving computational resources and avoiding over-fitting.

The results from these experiments highlight the efficacy of our proposed modifications and the robustness of our end-to-end pipeline. The improvements in F1 score and accuracy, detailed further in this section, substantiate our hypothesis that integrating synthetic accelerometer data and optimizing the feature extraction process can significantly enhance HAR performance.

IV-D Results

TABLE III: Comparison of Method Performance for MM-fit

Method	F1 Score	Accuracy
IMUTube [4]	$0.7697\pm 0.0019$	$0.8981\pm 0.0019$
Baseline (real data)	$0.9025\pm 0.0323$	$0.9559\pm 0.0125$
Regression-first	$0.8848\pm 0.0183$	$0.9490\pm 0.0070$
Proposed Method ( $\beta=10$ )	$0.9131\pm 0.0266$	$0.9494\pm 0.0065$
Proposed Method ( $\beta=0$ )	$0.9196\pm 0.0038$	$0.9618\pm 0.0017$

TABLE IV: Comparison of Method Performance on the UTD-MHAD [8]. Results marked with * are reported performance from the reference papers.

Method	F1 Score	Accuracy
Chen et al. [8] *	-	0.6720
Memmesheimer et al. [9] *	-	0.7286
Baseline (real data)	$0.6285\pm 0.0171$	$0.6702\pm 0.0137$
Regression-first	$0.6650\pm 0.0204$	$0.6911\pm 0.0121$
Proposed Method ( $\beta=10$ )	$0.7342\pm 0.0131$	$0.7581\pm 0.0020$
Proposed Method ( $\beta=0$ )	0.7388 $\pm$ 0.0101	0.7635 $\pm$ 0.0081

This section delves into the comparative analysis of the proposed method against the baselines. The discussion centers around the implications of the findings in terms of F1 score and accuracy, providing insights into the efficacy of synthetic data in training models for Human Activity Recognition (HAR).

Proposed and Baseline Methods Comparison: As we can see in Table III and Table IV, our proposed approach provides improvements in F1 scores and accuracy metrics, which strongly suggests that the integration of synthetic accelerometer data within the training process substantially enhances the performance of both the feature extraction and classification models. This observation is particularly pronounced in the context of our end-to-end pipeline, which seamlessly incorporates synthetic data alongside real sensor inputs.

TABLE V: Comparison of Sensor Generation Quality for MM-Fit

Regression Training	MSE in the Test set
Regression-first	$0.4890\pm 0.0123$
End-to-end pipeline	$0.4620\pm 0.0055$

TABLE VI: Comparison of Sensor Generation Quality for UTD-MHAD [8]

Regression Training	MSE in the Test set
Regression-first	$0.3910\pm 0.0134$
End-to-end pipeline	$0.3281\pm 0.0122$

Proposed and Regression-first Methods Comparison: In addition to juxtaposing the proposed method with the baseline, we explored an alternative scenario wherein the regression model is first trained independently to generate synthetic accelerometer data, which is then used alongside real data to train the feature extraction and classification models. This sequential approach, while theoretically sound, did not yield the same level of improvement as the integrated end-to-end pipeline. In fact, it is interesting to notice that performing the regression first did not improve results when compared to the baseline. In our tests we have access to the full training set and thus it is possible that the regression model alone could not generate simulated data of sufficient quality. The discrepancy (see again Tables III and IV) in performance can be attributed to the dynamic feedback loop established in the end-to-end training process, where the simultaneous adaptation of the regression model and the classification framework to each other’s outputs fosters a more synergistic learning environment. This interdependence ensures that the synthetic data is not only accurate but also optimally aligned with the objectives of the feature extraction and classification tasks.

This gap can be seen quantitatively when comparing the MSE of the generated data in the test set. As we can see in Tables V and VI, our regression model provided a smaller mean MSE in the test set along with a smaller standard deviation. We can also see qualitatively in Fig. 3 that our approach provides better coverage of high-frequency components in the signal, even if this increases the overall noise in the signal.

Proposed and Existing Methods Comparison: We conducted a comparative analysis between our method and IMUTube [4] on the MM-Fit dataset. Given that our method generates sensor data using the original MM-Fit dataset exclusively, we followed the same protocol as IMUTube. Specifically, we utilized the pipeline outlined in [4] to create simulated accelerometer data for the left wrist using the original MM-Fit videos. This serves as a baseline measure of the quality of simulated data for this dataset, as IMUTube is not explicitly designed to optimize the quality of generated sensor data, unlike our regression model. Following the acquisition and calibration steps outlined in [4], we trained our baseline model using a combination of half real sensor data and half IMUTube-generated data for each batch.

As depicted in Table III, our proposed method outperforms IMUTube, which performs worse than the baseline. This is consistent with our previous findings: the quality of simulated data affects classification performance and, therefore, it is reasonable that simply applying IMUTube to the videos from the dataset itself only degrades the classifier performance. This can be overcome if one is using external videos to obtain additional data or, as is the case for our method, there is end-to-end optimization of sensor generation and activity recognition.

In addition, we evaluated the proposed method against existing methods, Chen et al. [8] and Memmesheimer et al. [9], on the UTD-MHAD dataset. As shown in Table IV, the results by the existing methods were reported performance from the references, and the proposed method outperformed the existing methods.

Implications and Future Directions: The findings from this study underscore the critical role of synthetic data in enhancing HAR systems, particularly in scenarios plagued by the scarcity of labeled datasets. The end-to-end pipeline proposed herein not only demonstrates the feasibility of such an approach but also sets a new benchmark for the integration of synthetic and real data in training sophisticated machine learning models. However, it is important to acknowledge that the accuracy of the generated sensor data is inherently tied to the quality of the input pose sequences. In this work, we utilized the pose sequences provided by the open-access benchmark dataset; however, in real-world settings, accurate 3D pose estimation techniques are crucial and can significantly impact the quality of the generated sensor data. Thus, in future work, we plan to explore more robust 3D pose estimation techniques and conduct sensitivity analyses to evaluate the robustness of the proposed method to variations in pose estimation accuracy.

V Conclusions

In this study, we proposed an innovative approach to HAR by introducing a regression model, Pose2IMU, for accelerometer data and employing a novel combination of weighted loss functions. Our proposed method significantly enhanced the training and performance of activity classification models by incorporating synthetic accelerometer data derived from 3D skeleton poses. The use of TCN blocks tailored for processing 3D joint positions, along with adjustments in activation functions and optimization strategies, demonstrated a clear improvement in model accuracy and F1 scores compared to baseline methodologies.

The results affirmed the potential of synthetic data augmentation and sophisticated loss functions in overcoming the challenges posed by the limited availability of labeled HAR datasets. By effectively leveraging synthetic data, our method not only improved the depth and breadth of training data but also introduced a novel perspective on feature extraction and classification in the context of wearable sensor data.

Looking ahead, there are several avenues for extending this work. Exploring alternative end-to-end pipelines that incorporate multi-task learning could offer additional insights into the simultaneous optimization of related tasks, potentially leading to further improvements in HAR systems. In addition, we can adopt an adversarial learning scheme [17] between generated sensor data and real sensor data to not only match the generated sensor data with the real sensor data but also estimate the distributions of the real sensor data to improve the HAR performance.

Acknowledgments

The research reported in this paper was supported by the BMBF in the project VidGenSense (01IW21003) and Carl-Zeiss Stiftung under the Sustainable Embedded AI project (P2021-02-009).

References

[1] T. Yoshida, K. Kano, K. Higashiura, K. Yamaguchi, K. Takigami, K. Urano, S. Aoki, T. Yonezawa, and N. Kawaguchi, “A data-driven approach for online pre-impact fall detection with wearable devices,” in Sensor-and Video-Based Activity and Behavior Computing: Proceedings of 3rd International Conference on Activity and Behavior Computing (ABC 2021). Springer, 2022, pp. 133–147.
[2] K. Sangeethalakshmi, U. Preethi, S. Pavithra et al., “Patient health monitoring system using iot,” Materials Today: Proceedings, vol. 80, pp. 2228–2231, 2023.
[3] S. Suh, V. F. Rey, S. Bian, Y.-C. Huang, J. M. Rožanec, H. T. Ghinani, B. Zhou, and P. Lukowicz, “Worker activity recognition in manufacturing line using near-body electric field,” IEEE Internet of Things Journal, 2023.
[4] H. Kwon, C. Tong, H. Haresamudram, Y. Gao, G. D. Abowd, N. D. Lane, and T. Ploetz, “Imutube: Automatic extraction of virtual on-body accelerometry from video for human activity recognition,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 4, no. 3, pp. 1–29, 2020.
[5] V. Fortes Rey, K. K. Garewal, and P. Lukowicz, “Translating videos into synthetic training data for wearable sensor-based activity recognition systems using residual deep convolutional networks,” Applied Sciences, vol. 11, no. 7, p. 3094, 2021.
[6] P. S. Santhalingam, P. Pathak, H. Rangwala, and J. Kosecka, “Synthetic smartwatch imu data generation from in-the-wild asl videos,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 7, no. 2, pp. 1–34, 2023.
[7] D. Strömbäck, S. Huang, and V. Radu, “Mm-fit: Multimodal deep learning for automatic exercise logging across sensing devices,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 4, no. 4, pp. 1–22, 2020.
[8] C. Chen, R. Jafari, and N. Kehtarnavaz, “Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” in 2015 IEEE International conference on image processing (ICIP). IEEE, 2015, pp. 168–172.
[9] R. Memmesheimer, N. Theisen, and D. Paulus, “Gimme signals: Discriminative signal encoding for multimodal activity recognition,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 10 394–10 401.
[10] X. Li, J. Luo, and R. Younes, “Activitygan: Generative adversarial networks for data augmentation in sensor-based human activity recognition,” in Adjunct Proceedings of the 2020 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers, 2020, pp. 249–254.
[11] Y. Hu, “Bsdgan: Balancing sensor data generative adversarial networks for human activity recognition,” in 2023 International Joint Conference on Neural Networks (IJCNN). IEEE, 2023, pp. 1–8.
[12] S. Zuo, V. F. Rey, S. Suh, S. Sigg, and P. Lukowicz, “Unsupervised statistical feature-guided diffusion model for sensor-based human activity recognition,” arXiv preprint arXiv:2306.05285, 2023.
[13] V. F. Rey, P. Hevesi, O. Kovalenko, and P. Lukowicz, “Let there be imu data: generating training data for wearable, motion sensor based activity recognition from monocular rgb videos,” in Adjunct proceedings of the 2019 ACM international joint conference on pervasive and ubiquitous computing and proceedings of the 2019 ACM international symposium on wearable computers, 2019, pp. 699–708.
[14] F. Xiao, L. Pei, L. Chu, D. Zou, W. Yu, Y. Zhu, and T. Li, “A deep learning method for complex human activity recognition using virtual wearable sensors,” in Spatial Data and Intelligence: First International Conference, SpatialDI 2020, Virtual Event, May 8–9, 2020, Proceedings 1. Springer, 2021, pp. 261–270.
[15] C. Xia and Y. Sugiura, “Virtual imu data augmentation by spring-joint model for motion exercises recognition without using real data,” in Proceedings of the 2022 ACM International Symposium on Wearable Computers, 2022, pp. 79–83.
[16] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
[17] S. Suh, V. F. Rey, and P. Lukowicz, “Adversarial deep feature extraction network for user independent human activity recognition,” in 2022 IEEE International Conference on Pervasive Computing and Communications (PerCom). IEEE, 2022, pp. 217–226.