Unsupervised Statistical Feature-Guided Diffusion Model for Sensor-based Human Activity Recognition

Si Zuo Aalto UniversityEspooFinland Vitor Fortes Rey DFKI and RPTU Kaiserslautern-LandauKaiserslauternGermany Sungho Suh DFKI and RPTU Kaiserslautern-LandauKaiserslauternGermany [email protected] Stephan Sigg Aalto UniversityEspooFinland  and  Paul Lukowicz DFKI and RPTU Kaiserslautern-LandauKaiserslauternGermany
(2024; 20 February 2007; 12 March 2009; 5 June 2009)
Abstract.

Human activity recognition (HAR) from on-body sensors is a core functionality in many AI applications: from personal health, through sports and wellness to Industry 4.0. A key problem holding up progress in wearable sensor-based HAR, compared to other ML areas, such as computer vision, is the unavailability of diverse and labeled training data. Particularly, while there are innumerable annotated images available in online repositories, freely available sensor data is sparse and mostly unlabeled. We propose an unsupervised statistical feature-guided diffusion model specifically optimized for wearable sensor-based human activity recognition with devices such as inertial measurement unit (IMU) sensors. The method generates synthetic labeled time-series sensor data without relying on annotated training data. Thereby, it addresses the scarcity and annotation difficulties associated with real-world sensor data. By conditioning the diffusion model on statistical information such as mean, standard deviation, Z-score, and skewness, we generate diverse and representative synthetic sensor data. We conducted experiments on public human activity recognition datasets and compared the method to conventional oversampling and state-of-the-art generative adversarial network methods. Experimental results demonstrate that this can improve the performance of human activity recognition and outperform existing techniques.

Human activity recognition, Sensor data generation, Unsupervised learning, Statistical feature-guided diffusion model
copyright: acmlicensedjournalyear: 2024doi: XXXXXXX.XXXXXXXjournal: IMWUTjournalvolume: 0journalnumber: 0article: 0publicationmonth: 5submissionid: 8900ccs: Computing methodologies Unsupervised learningccs: Mathematics of computing Time series analysis

1. Introduction

Wearable sensor-based human activity recognition (HAR) plays a crucial role in various domains, including healthcare (Bachlin et al., 2009; Fischer et al., 2021), sports (Sundholm et al., 2014; Zhou et al., 2022), and security (Derawi et al., 2010). The accurate recognition of human activities enables the development of intelligent systems and applications that can assist individuals, monitor their well-being, and improve safety. It is a core component of the human-centric AI vision.

While other areas of Machine Learning, such as Computer Vision or Natural Language Processing (NLP), have made dramatic progress over the last decade, facilitating a range of real-world applications, the performance of HAR systems is lagging behind. To a large extent, this is due to the shortage of labeled training data. Training data for HAR from wearable sensors is available in much smaller amounts than images, videos, or texts, and it is also more costly and challenging to annotate (Fortes Rey et al., 2022). Unlike vision data, which can be annotated with tools that build on available image or video datasets (Lin et al., 2019; Russell et al., 2008), wearable sensor data annotation is prohibitively expensive and time-consuming, which hinders the progress of wearable sensor-based HAR research and limits its practical applications. Moreover, time-series sensor data for HAR often lacks the rich semantic information that is abundant in computer vision data. At the same time, wearable sensor data offers advantages over visual information in many situations. Unlike visual data, which may capture identifiable features or personal information, wearable sensor-based HAR data typically deals with raw measurements or abstract data that are not easily linked to an individual’s identity. Compared to video or image, lighting conditions, and camera placement do not affect the quality of wearable sensor data. Other benefits include a smaller data size and less energy consumption. For these reasons, the popularity of Inertial Measurement Unit (IMU) sensors has been steadily increasing and it is found in mobile devices (e.g. smartwatches and smartphones), drones, robotics, motion capture, etc.

Synthetic training data generation (and data set augmentation) is a solution to the lack of labeled data. Compared to Computer Vision and NLP, less work is focused on wearable sensor data generation. Furthermore, methods applied to Computer Vision and NLP are often not directly applicable to wearable sensor data. Consequently, in order to overcome the scarcity of annotated datasets for wearable sensor data, new effective methods are needed to generate labeled data.

To address this challenge, translating video data into IMU representations (Rey et al., 2019; Kwon et al., 2020) has been studied. These efforts have leveraged generative methods (Rey et al., 2019; Fortes Rey et al., 2021) and trajectory-based approaches (Kwon et al., 2020; Xiao et al., 2021) to extract IMU data from videos, expanding the applicability of sensor-based HAR. Despite the successes of HAR performance improvement, limitations exist, particularly regarding input video quality and vision-based limitations, such as camera ego-motion and object occlusion.

Conversely, traditional data augmentation techniques, common in various domains, have been adapted for wearable sensor data (Kim and Jeong, 2021; Jeong et al., 2021). While computer vision relies on simple affine transformations, accelerometer signals require alternative strategies, such as signal processing methods like jittering and scaling (Um et al., 2017; Ohashi et al., 2017; Mathur et al., 2018). In addition, conventional oversampling, such as the Synthetic Minority Over-sampling Technique (SMOTE) (Chawla et al., 2002) and the Majority Weighted Minority Oversampling Technique (MWMOTE) (Barua et al., 2012) have been used to address data scarcity. However, since these methods are not specifically designed for data generation in HAR, they often overlook temporal dependencies and statistical properties inherent in wearable sensor data. This can limit their effectiveness in capturing the true underlying distribution of the data. As a consequence, the generated synthetic data may fail to accurately represent the complexities of real-world wearable sensor data.

Moreover, generative adversarial network (GAN)-based methods, such as TimeGAN (Yoon et al., 2019), have been used to generate realistic time-series data. However, GAN-based methods require a significant amount of labeled data for training, which is often difficult to obtain in wearable sensor-based scenarios. Additionally, the generated data may suffer from mode collapse or lack of diversity, limiting their effectiveness in improving the performance of HAR models.

Furthermore, biomechanical simulation-based approaches have gained popularity in generating realistic sensor data. The sensor data can be generated by capturing the real motion data and building a 3D motion model with a simulation platform. However, this method requires a substantial volume of real data to build an accurate model, which involves several steps and is time-consuming. There is a chance that the movements in the simulation model won’t be as accurate as they would be in real life because of fidelity issues in the model or inaccurate input settings.

To overcome these limitations, in this paper, we propose an unsupervised statistical feature-guided diffusion model for wearable sensor-based HAR to address the challenge of costly and hard-to-annotate wearable sensor data. Particularly, we leverage the abundance of unlabeled wearable sensor data that can be easily obtained in real-world scenarios. By utilizing unsupervised learning, we generate synthetic sensor data that can enhance the performance of HAR models without relying on labeled data for training.

The diffusion model is conditioned by statistical information such as mean, standard deviation, Z-score, and skewness. By capturing the statistical properties of real-world wearable sensor data, it can generate synthetic data that closely resembles the characteristics of real wearable sensor data. Unlike traditional generative models that require class labels, our approach does not depend on labeled data, making it highly applicable to scenarios where labeled data is scarce or unavailable. Our framework consists of two steps: the training of the unsupervised statistical feature-guided diffusion model on a large amount of unlabeled data, and the training of an independent human activity classifier using a combination of a small amount of labeled real data as well as the synthetic data generated by the pretrained diffusion model. This two-step process enables us to leverage the benefits of both unsupervised learning and supervised classification, leading to improved HAR performance. Moreover, since our model captures motion information from statistical features rather than labels, unlike other data synthesis methods (e.g., TimeGAN (Yoon et al., 2019)), there is no need to train separate generation models for each activity class. Data from various classes can be generated using a single trained diffusion model, significantly enhancing training efficiency.

To evaluate the effectiveness of the approach, we conduct experiments on three public openly-accessible datasets: MM-FIT (Strömbäck et al., 2020), PAMAP2 (Reiss and Stricker, 2012), and Opportunity (Chavarriaga et al., 2013). We compare the method with conventional oversampling techniques, such as SMOTE (Chawla et al., 2002) and SVM-SMOTE (Nguyen et al., 2011), and a GAN-based method, TimeGAN (Yoon et al., 2019). The experimental results demonstrate superiority both in terms of performance metrics for HAR and the diversity of synthetically generated data.

Our main contributions are:

  • We propose an unsupervised statistical feature-guided diffusion model (SF-DM) for wearable sensor data generation in HAR, capturing the features of time-series sensor data more effectively.

  • The diffusion model generates synthetic sensor data conditioned on statistical information, such as mean, standard deviation, Z-score, and skewness, without relying on class label information. Thus, a single trained diffusion model can be utilized to generate data from various classes.

  • Our approach is applicable to real-world scenarios where labeled data is scarce or unavailable and requires no post-processing of the generated data; it can be utilized directly to train a HAR model.

  • Experimental results on public openly-accessible datasets. We demonstrate improved performance in HAR over conventional oversampling methods and GAN-based methods.

The rest of the paper is organized as follows. Section 2 introduces the related works. Section 3 provides the details of the proposed diffusion model. Section 4 presents quantitative experimental results on the three datasets. Section 5 discusses the potential of the proposed method and available applications in HAR. Finally, Section 6 concludes the paper.

2. Related Work

2.1. Wearable Sensor-based Human Activity Recognition

Sensor-based systems have dominated the applications of monitoring our daily activities, given the privacy concerns associated with placing cameras in our personal space (Gheid and Challal, 2016; Qi et al., 2018; Zhang et al., 2019) and highlighting the importance of sensor-based HAR. In Park et al.’s work (Park et al., 2023), they proposed a MultiCNN-FilterLSTM model to provide a resource-efficient method for HAR on smart devices. (Dahou et al., 2023) developed a discrete wavelet transform method coupled with a multilayer residual convolutional neural network (MLCNNwav) for sensor-based HAR, increasing generalization and minimizing processing complexity. Ferrari et al. (Ferrari et al., 2023) proposed a personalization technique combined with machine learning to improve the generalization ability of the model.

Given the effectiveness and widespread adoption of Transformer architecture (Vaswani et al., 2017), it is also introduced in sensor-based HAR. Dirgova et al. (Dirgová Luptáková et al., 2022) adapted the transformer for time-series analysis and achieved a classification accuracy of 99.2% on a public smartphone motion sensor dataset that covers a wide range of activities. (Xiao et al., 2022) proposed a self-attention-based Two-stream Transformer Network (TTN) to capture the temporal and spatial information, respectively. Pramanik et al. (Pramanik et al., 2023) presented a deep reverse transformer-based attention mechanism, exploiting a top-down feature fusion. The reverse attention regularizes the attention modules and adjusts the learning rate adaptively. Feng et al. (Feng et al., 2024) introduced the Adversarial Time-Frequency Attention (ATFA) framework to effectively address data heterogeneity issues caused by increased sensor use and diverse user contexts for sensor-based HAR.

Due to the scarcity of labeled wearable sensor datasets, HAR systems are also trained with unlabeled data to improve performance and robustness. Semi-supervised and unsupervised learning are introduced to tackle the challenge. In Balabka et al.’s work (Balabka, 2019), they trained the adversarial autoencoder with unlabeled data and validated the model with a small amount of labeled data. Oh et al. (Oh et al., 2021) combined the existing active learning with semi-supervised learning and achieved outstanding performance with less labeled data. For unsupervised learning, Sheng et al. (Sheng and Huber, 2020) used K-means clustering and autoencoder to accurately classify fully unlabeled wearable sensor data. LLMIE-UHAR (Gao et al., 2024) leveraged large language models (LLMs) and Iterative Evolution (IE) to achieve an unsupervised way for HAR.

To address the scarcity of annotated data and accommodate the diverse real-world settings in which HAR is conducted (sensor modalities, downstream tasks, etc.), transfer learning and contrastive learning are employed. RecycleML (** function, enabling automatic translation of signals from the source sensor domain to the target sensor domain, and vice versa. COCOA (Deldari et al., 2022) explored sensor-based cross-modal contractive learning, achieving quality representations from multisensory data through cross-correlation computations across various data modalities and mitigating the similarity among irrelevant instances. FOCAL (Liu et al., 2024) proposed a novel contrastive learning framework for multimodal time-series sensing signals by introducing orthogonality constraint, which outperforms the SOTA in downstream tasks including HAR.

2.2. Sensor Data Generation for Human Activity Recognition

The scarcity of labeled data in wearable sensor applications presents a significant challenge, prompting the exploration of synthetic data generation as a viable solution. While extensive research has been conducted in computer vision and natural language processing domains, the field of wearable sensor data generation remains relatively understudied, necessitating tailored methods to accommodate the unique characteristics of such data. Three representative approaches have emerged to address the challenge of generating wearable sensor data: 1) traditional oversampling techniques, 2) virtual IMU data generation from alternative modalities, and 3) generating sensor data using generative adversarial networks (GAN).

Traditionally, data augmentation techniques have been employed to address data scarcity issues in various domains (Kim and Jeong, 2021; Jeong et al., 2021). While conventional approaches in computer vision often rely on simple affine transformations for data augmentation, such as translation and rotation (Shorten and Khoshgoftaar, 2019; Maharana et al., 2022), the temporal nature of accelerometer signals in sensor-based HAR necessitates alternative strategies. Signal-processing methods, including jittering, scaling, and random sampling, have been proposed to augment accelerometer signals, effectively enhancing the diversity of training datasets (Um et al., 2017; Ohashi et al., 2017; Mathur et al., 2018; Cheng et al., 2023). In addition to the signal processing methods, traditional oversampling techniques like SMOTE and MWMOTE have been employed to mitigate data scarcity. However, these methods, while employed in sensor-based HAR, may overlook temporal dependencies and statistical properties inherent in wearable sensor data, limiting their effectiveness in capturing the complexities of real-world data distributions (Alharbi et al., 2022).

Secondly, efforts to bridge the gap between video and Inertial Measurement Unit (IMU) data have gained traction, with recent works focusing on translating video data into IMU representations (Rey et al., 2019; Kwon et al., 2020; Santhalingam et al., 2023). These endeavors leverage generative methods (Rey et al., 2019; Fortes Rey et al., 2021; Santhalingam et al., 2023) and trajectory-based approaches (Kwon et al., 2020; Xiao et al., 2021; Banos et al., 2012) to extract IMU data from videos, expanding the applicability of sensor-based HAR beyond traditional IMU-equipped scenarios. Generative methods employ machine learning to derive IMU data directly from videos (Rey et al., 2019; Fortes Rey et al., 2021), whereas trajectory-based methods estimate joint orientations from 3D joint positions extracted from videos (Kwon et al., 2020; Xiao et al., 2021). These approaches have demonstrated success in generating synthetic IMU data for human activity recognition tasks, enhancing model performance and generalizability across diverse scenarios. Despite the efficacy of systems like IMUTube, challenges remain, particularly concerning the quality of input videos. The limitations of vision-based systems are evident when videos exhibit camera egomotion or include irrelevant scenes, requiring meticulous video selection.

Thirdly, GAN-based methods have shown promise in generating realistic time-series data, combining unsupervised and supervised training approaches (Yao et al., 2018; Wang et al., 2018; Yoon et al., 2019). SenseGAN (Yao et al., 2018) and SensoryGANs (Wang et al., 2018), for instance, have introduced frameworks for generating synthetic sensor data, effectively improving human activity recognition in resource-limited environments. TimeGAN (Yoon et al., 2019) and ActivityGAN (Li et al., 2020) have demonstrated superior performance in maintaining temporal dynamics and augmenting sensor-based HAR datasets, respectively. Furthermore, Balancing Sensor Data Generative Adversarial Networks (BSDGAN) (Hu, 2023) address imbalanced datasets in HAR, effectively enhancing recognition accuracy for activity recognition models. A time-series GAN (TS-GAN) (Yang et al., 2023) based on LSTM networks for augmenting sensor-based health data was proposed to improve the performance of deep learning-based classification models. TS-GAN utilized an LSTM-based generator and discriminator, incorporating a sequential-squeeze-and-excitation module and gradient penalty from Wasserstein GANs for stability. However, these GAN-based methods demand a substantial amount of labeled data for training, a challenge in wearable sensor-based scenarios. Mode collapse and a lack of diversity in generated data are additional concerns that may limit their efficacy in improving HAR model performance.

Lastly, the biomechanical simulation-based approaches are attracting increasing interest from researchers. Jiang et al. (Jiang et al., 2021) utilized OpenSim, an open-source software system for biomechanical modeling, simulation, and analysis, to simulate individuals with various physiological characteristics performed movements to augment the IMU dataset. Uhlenberg et al. (Uhlenberg et al., 2023) generated synthetic accelerations as well as angular velocities with a simulation framework to enable a comprehensive analysis of gait events. Tang et al. (Tang et al., 2024) utilized the simulation platform OpenSim and forward kinematic methods to generate a substantial volume of synthetic IMU data for fall detection. However, the implementation of the method is complicated and time-consuming, it involves several steps. Firstly, recordings of multi-view actions from participants wearing IMU sensors need to be collected, followed by a pose estimation. Then calculate coordinates so that the simulation platform can accurately represent and simulate physical interactions and movements. It has the potential for inaccuracies or discrepancies between simulated and real-world movements due to the limitations in the simulation model’s fidelity or inaccuracies in input parameters.

2.3. Diffusion Probabilistic Models

The idea of the diffusion model is inspired by non-equilibrium statistical physics, which is to gradually eliminate structure in a data distribution using an iterative forward diffusion approach. Then, a reverse diffusion process that reinstates structure in the data is learned, producing an adaptable and manageable generative model (Sohl-Dickstein et al., 2015). Recently, Diffusion Probabilistic Models (DM) beat GAN (Goodfellow et al., 2014) and have achieved state-of-the-art results in image synthetic by ensuring high quality and diversity (Ho et al., 2020; Nichol and Dhariwal, 2021; Dhariwal and Nichol, 2021; Rombach et al., 2021; Ho et al., 2022; Gu et al., 2021). The authors in (Ho et al., 2020) represent the diffusion process (forward process) as a Markov chain that transforms the original data distribution into a Gaussian distribution and the reverse process (denoising process) learns to generate samples by removing noise step by step with a deep learning model. The U-Net (Ronneberger et al., 2015) architecture is considered to be a powerful deep learning model for denoising (Ho et al., 2020, 2022; Jolicoeur-Martineau et al., 2020). To further improve the denoising performance, a stack of residual layers and attention mechanisms are introduced to U-Net-like models (Rombach et al., 2021; Song et al., 2021; Kim et al., 2022). The application of the diffusion model often necessitates thousands of computation steps to obtain a high-quality new sample, thereby imposing significant limitations on its practical usability. Methods to enhance the sampling speed of a diffusion model are discretization optimization (Dockhorn et al., 2022), utilizing a non-Markovian process (Song et al., 2020) and partial sampling (Song et al., 2020).

In addition to image synthesis, diffusion models have been applied to other tasks including image in-painting (Lugmayr et al., 2022), 3D shape generation (Zhou et al., 2021), text generation (Gong et al., 2022), audio synthesis (Kong et al., 2020), molecular conformation generation (Xu et al., 2022), etc. Shao et al. (Shao and Sanchez, 2023) utilize a diffusion model with redesigned UNet to generate synthetic sensor data for HAR by incorporating the label information. Huang et al. (Huang et al., 2023) propose an adaptive conditional diffusion model to improve the HAR performance based on channel state information (CSI) by augmenting CSI based on visualized spectrum information. In contrast, we propose the first unsupervised statistical feature-guided diffusion model for sensor data generation in HAR, which operates without the need for labeled information and can be directly applied to raw sensor data.

3. Method

Refer to caption
Figure 1. Overview of the statistical feature-guided diffusion model (SF-DM). Step 1: pretrain SF-DM with unlabeled real data; Step 2: train the HAR classifier with synthetic data generated by SF-DM and finetune the classifier with real data.

We propose to improve the performance of HAR models through a two-step process, as depicted in Fig. 1. First, we pretrain the diffusion model using real unlabeled sensor data with statistical features. Second, we train the HAR classifier using synthetic sensor data generated by the well-trained diffusion model (SF-DM), followed by fine-tuning the classifier with real data.

3.1. Structure of SF-DM

The unsupervised statistical feature-guided diffusion model consists of two main components: the diffusion and denoising model, as well as the conditioner. As depicted in Fig. 2, we employ an encoder-decoder framework for SF-DM.

Refer to caption
Figure 2. The architecture of the diffusion model. The diffusion model consists of the diffusion and denoising modules. The brief architecture of the denoising module is depicted on the right side. The denoising module receives statistical features as a conditioner and the architecture of the denoising module is based on the U-Net architecture.

In the diffusion stage, we first randomly sample a time step t𝑡titalic_t with the range (0,T]0𝑇(0,T]( 0 , italic_T ] where T𝑇Titalic_T is the maximum diffusion step. The level of noise varies according to the current diffusion step t𝑡titalic_t. With a larger step, the noise becomes more pronounced. The noisy data is created by a weighted sum of noise and data (see Eq. 1).

The denoising model consists of an encoder-decoder framework. The encoder of SF-DM, which learns features from the input, includes three convolutional blocks and a max-pooling layer. The convolutional block has three different inputs: statistical features from the conditioner, noisy data ITsubscript𝐼𝑇I_{T}italic_I start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT from diffusion step T, and diffusion step embedding. Before feeding the statistical features to the convolutional layer, we project them to match the shape of the noisy data. The output of diffusion step embedding from the convolutional layer is added to that of ITsubscript𝐼𝑇I_{T}italic_I start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and concatenated with the output of statistical features from the convolutional layer for providing additional information. We employ convolutional layers with a kernel size of 9×\times×9 and a max-pooling layer with a kernel size of 2×\times×2 and stride 2. The decoder incorporates an upsampling layer that restores the resolution to match that of the previous layer and deconvolutional layers (Zeiler et al., 2010) (with a kernel size of 9×\times×9) which disseminates the information contained in one data point across multiple data points. An output projection layer is followed to match the dimension of the output from the diffusion model with the input real data. A detailed structure of the proposed model is shown in Fig. 3.

Refer to caption
Figure 3. Unsupervised statistical feature-guided diffusion model. ITsubscript𝐼𝑇I_{T}italic_I start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT refers to the sensor data after diffusion step T𝑇Titalic_T.

3.2. Statistical features selection

To guide the generation of synthetic sensor data, we leverage statistical features extracted by the conditioner from real data. Compared with only using labels, these statistical features provide richer information about the activity. The features include mean, standard deviation, Z-score (xμσ)𝑥𝜇𝜎\left(\frac{x-\mu}{\sigma}\right)( divide start_ARG italic_x - italic_μ end_ARG start_ARG italic_σ end_ARG ), where x𝑥xitalic_x is an observed value, μ𝜇\muitalic_μ represents the mean of all values, and σ𝜎\sigmaitalic_σ indicates the standard deviation of the sample), and skewness (γ=𝔼[(xμσ)3]𝛾𝔼delimited-[]superscript𝑥𝜇𝜎3\gamma=\mathbb{E}[\left(\frac{x-\mu}{\sigma}\right)^{3}]italic_γ = blackboard_E [ ( divide start_ARG italic_x - italic_μ end_ARG start_ARG italic_σ end_ARG ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ]), which measures the asymmetry of the probability distribution. Importantly, label information is not required, which means the SF-DM can be widely applied to unsupervised training with unlabeled datasets. We concatenate three features before input to the diffusion model.

3.3. HAR with SF-DM

Refer to caption
Figure 4. The architecture of HAR classifier.

In this work, we employ a simple convolutional neural network (CNN) for HAR, as illustrated in Fig. 4. The architecture comprises three convolutional layers (conv1, conv2, conv3) with a stride of 2, each layer followed by a ReLU activation function, along with a max-pooling layer with dilation 1. Additionally, there are five fully connected layers (FC1, FC2, FC3, FC4, FC5), each followed by a ReLU activation function, and an output layer. We use the same structure for both pretraining and fine-tuning.

As depicted in Fig. 1, the training procedure of the HAR classifier with SF-DM includes two key steps:

  1. (1)

    SF-DM training: As SF-DM does not require label information, we first train SF-DM on a large volume of unlabeled real data

  2. (2)

    HAR classifier training: We initially pretrain the HAR classifier using synthetic sensor data generated by the pretrained SF-DM and then finetune the classifier with real data (Algorithm 1)

For SF-DM training, in each training iteration, we sample a batch of real data and calculate the corresponding statistical features, including mean μ𝜇\muitalic_μ, standard deviation σ𝜎\sigmaitalic_σ, Z-score z𝑧zitalic_z, and skewness γ𝛾\gammaitalic_γ, which have the same length as the input data sequences, and concatenate them as f𝑓fitalic_f. We then generate noisy data x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG from the sampled real sensor data as follows:

(1) x~=x×β[t]+ϵ×1β[t]~𝑥𝑥𝛽delimited-[]𝑡italic-ϵ1𝛽delimited-[]𝑡\tilde{x}=x\times\sqrt{\beta[t]}+\epsilon\times\sqrt{1-\beta[t]}over~ start_ARG italic_x end_ARG = italic_x × square-root start_ARG italic_β [ italic_t ] end_ARG + italic_ϵ × square-root start_ARG 1 - italic_β [ italic_t ] end_ARG

In Eq. 1, x𝑥xitalic_x denotes the real sensor data, β𝛽\betaitalic_β indicates the noise level, t𝑡titalic_t represents the diffusion step, and ε𝜀\varepsilonitalic_ε denotes the random noise that has the same shape as the input real data. The X~~𝑋\tilde{X}over~ start_ARG italic_X end_ARG together with statistical features are then fed into the SF-DM.

The diffusion model generates synthetic data by removing noise, and thus, the SFDM𝑆𝐹𝐷𝑀SF-DMitalic_S italic_F - italic_D italic_M is trained by minimizing a reconstruction loss between the original real wearable sensor data and the generated data.

(2) Lrec(x,x~,f;θE)=l=1n|D(x~l,fl)xl|nsubscript𝐿rec𝑥~𝑥𝑓subscript𝜃𝐸superscriptsubscript𝑙1𝑛𝐷subscript~𝑥𝑙subscript𝑓𝑙subscript𝑥𝑙𝑛L_{\mbox{\tiny{rec}}}(x,\tilde{x},f;\theta_{E})=\frac{\sum_{l=1}^{n}|D(\tilde{% x}_{l},f_{l})-x_{l}|}{n}italic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT ( italic_x , over~ start_ARG italic_x end_ARG , italic_f ; italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_D ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | end_ARG start_ARG italic_n end_ARG

In Eq. 2, x𝑥xitalic_x represents the unlabeled real sensor data, x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG is the input noisy data, f𝑓fitalic_f denotes the statistical features including mean, standard deviation, and Z-score of x𝑥xitalic_x, D𝐷Ditalic_D indicates the diffusion model with the decoder and encoder, and n𝑛nitalic_n is the number of data samples. In this procedure, the SF-DM can be trained in an unsupervised statistical feature-guided way.

Algorithm 1 Training procedure for unsupervised statistical feature-guided diffusion model (SF-DM) and human activity classifier with generated data
1:  Step 1: SF-DM training
1:  Batch size m𝑚mitalic_m, Adam hyperparameter ηEsubscript𝜂𝐸\eta_{E}italic_η start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT
2:  for Number of training iterations for Step 1 do
3:     Sample a batch x𝑥xitalic_x from the training dataset X𝑋Xitalic_X.
4:     Calculate the statistical features: μ𝜇\muitalic_μ, σ𝜎\sigmaitalic_σ, Z𝑍Zitalic_Z, γ𝛾\gammaitalic_γ
5:     fconcat(μ,σ,Z,γ)𝑓𝑐𝑜𝑛𝑐𝑎𝑡𝜇𝜎𝑍𝛾f\leftarrow concat(\mu,\sigma,Z,\gamma)italic_f ← italic_c italic_o italic_n italic_c italic_a italic_t ( italic_μ , italic_σ , italic_Z , italic_γ )
6:     Generate noise data ϵN(0,1)similar-toitalic-ϵ𝑁01\epsilon\sim N(0,1)italic_ϵ ∼ italic_N ( 0 , 1 ) from real data with the noise scale β𝛽\betaitalic_β and the diffusion step t𝑡titalic_t:x~xβ[t]+ϵ1β[t]~𝑥𝑥𝛽delimited-[]𝑡italic-ϵ1𝛽delimited-[]𝑡\tilde{x}\leftarrow x\sqrt{\beta[t]}+\epsilon\sqrt{1-\beta[t]}over~ start_ARG italic_x end_ARG ← italic_x square-root start_ARG italic_β [ italic_t ] end_ARG + italic_ϵ square-root start_ARG 1 - italic_β [ italic_t ] end_ARG
7:     θEθEηEθELrec(x,x~,f;θE)subscript𝜃𝐸subscript𝜃𝐸subscript𝜂𝐸subscriptsubscript𝜃𝐸subscript𝐿rec𝑥~𝑥𝑓subscript𝜃𝐸\theta_{E}\leftarrow\theta_{E}-\eta_{E}\nabla_{\theta_{E}}L_{\mbox{\tiny{rec}}% }(x,\tilde{x},f;\theta_{E})italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT ( italic_x , over~ start_ARG italic_x end_ARG , italic_f ; italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) \trianglerightEq. 2
8:  
9:  Step 2: HAR classifier training
9:  Batch size m𝑚mitalic_m, Adam hyperparameter ηCsubscript𝜂𝐶\eta_{C}italic_η start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT
10:  for Number of training iterations do
11:     Sample a batch (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) from the training dataset X𝑋Xitalic_X and corresponding activity label Y𝑌Yitalic_Y.
12:     Calculate the statistical features: μ𝜇\muitalic_μ, σ𝜎\sigmaitalic_σ, Z𝑍Zitalic_Z, γ𝛾\gammaitalic_γ
13:     fconcat(μ,σ,Z,γ)𝑓𝑐𝑜𝑛𝑐𝑎𝑡𝜇𝜎𝑍𝛾f\leftarrow concat(\mu,\sigma,Z,\gamma)italic_f ← italic_c italic_o italic_n italic_c italic_a italic_t ( italic_μ , italic_σ , italic_Z , italic_γ )
14:     Make a random noise vector: ωN(0,1)similar-to𝜔𝑁01\omega\sim N(0,1)italic_ω ∼ italic_N ( 0 , 1 )
15:     θCθCηCθCLsyn(ω,f,y;θC)subscript𝜃𝐶subscript𝜃𝐶subscript𝜂𝐶subscriptsubscript𝜃𝐶subscript𝐿syn𝜔𝑓𝑦subscript𝜃𝐶\theta_{C}\leftarrow\theta_{C}-\eta_{C}\nabla_{\theta_{C}}L_{\mbox{\tiny{syn}}% }(\omega,f,y;\theta_{C})italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT syn end_POSTSUBSCRIPT ( italic_ω , italic_f , italic_y ; italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) \trianglerightEq. 3
16:  for Number of training iterations do
17:     Sample a batch (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) from the training dataset X𝑋Xitalic_X and corresponding activity label Y𝑌Yitalic_Y.
18:     θCθCηCθCLreal(x,y;θC)subscript𝜃𝐶subscript𝜃𝐶subscript𝜂𝐶subscriptsubscript𝜃𝐶subscript𝐿real𝑥𝑦subscript𝜃𝐶\theta_{C}\leftarrow\theta_{C}-\eta_{C}\nabla_{\theta_{C}}L_{\mbox{\tiny{real}% }}(x,y;\theta_{C})italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT real end_POSTSUBSCRIPT ( italic_x , italic_y ; italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) \trianglerightEq. 4

Next, the HAR classifier is pretrained on synthetic data from SF-DM. From each batch of real data, we first calculate the statistical features and initialize random noise with a shape that matches the input of the SF-DM. Combining random noise and statistical features, SF-DM is able to produce synthetic sensor data sequences and patterns that resemble real data generated by the same type of sensor. The HAR classifier is trained with the synthetic sensor data and the corresponding activity label.

(3) Lsyn(ω,f,y;θC)=l=1ncyllogC(E(ωl,fl)))L_{\mbox{\tiny{syn}}}(\omega,f,y;\theta_{C})=-{\sum_{l=1}^{n_{c}}y_{l}\log C(E% (\omega_{l},f_{l})))}italic_L start_POSTSUBSCRIPT syn end_POSTSUBSCRIPT ( italic_ω , italic_f , italic_y ; italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT roman_log italic_C ( italic_E ( italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) )

where x𝑥xitalic_x represents the input data, y𝑦yitalic_y is the corresponding class label, E𝐸Eitalic_E denotes the pretrained diffusion model, C𝐶Citalic_C is the activity classifier, ω𝜔\omegaitalic_ω is the random noise for an input of the diffusion model, f𝑓fitalic_f denotes the statistical features of x𝑥xitalic_x, and ncsubscript𝑛𝑐n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT indicates the number of classes.

To further improve the HAR performance, we fine-tune the HAR classifier with real sensor data.

(4) Lreal(x,y;θC)=l=1ncyllogC(xl)subscript𝐿real𝑥𝑦subscript𝜃𝐶superscriptsubscript𝑙1subscript𝑛𝑐subscript𝑦𝑙𝐶subscript𝑥𝑙L_{\mbox{\tiny{real}}}(x,y;\theta_{C})=-{\sum_{l=1}^{n_{c}}y_{l}\log C(x_{l})}italic_L start_POSTSUBSCRIPT real end_POSTSUBSCRIPT ( italic_x , italic_y ; italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT roman_log italic_C ( italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT )

In summary, the unsupervised statistical feature-guided diffusion model (SF-DM) and HAR classifier are trained in a two-step process to improve wearable sensor-based human activity recognition. The SF-DM is trained in an unsupervised statistical feature-guided way, while the HAR classifier is pretrained with synthetic sensor data and then fine-tuned with real sensor data. The training details for the unsupervised statistical feature-guided diffusion model (SF-DM) and HAR classifier are provided in Algorithm 1.

4. Experimental Results

Table 1. Comparison of MM-Fit, PAMAP2, and Opportunity (locomotion) datasets
MM-Fit PAMAP2 Opportunity
Num. of activities 10 18 4
Num. of subjects 21 9 4
Sensor Accelerometer Accelerometer Accelerometer
Frequency(Hz) 100 100 30
Total Duration (Sec.) 48540 27248.27 20240.4
Position wrist wrist knee

4.1. Datasets

In this section, we conducted comprehensive evaluations of SF-DM on three different public datasets: MM-Fit (Strömbäck et al., 2020), PAMAP2 (Reiss and Stricker, 2012), and Opportunity (Chavarriaga et al., 2013). The detailed information of the datasets is summarized in Table 1.

MM-Fit comprises wearable sensor data collected from diverse time-synchronized devices during ten full-body exercises by multiple subjects. The data is collected from different devices, including depth cameras, smartphones, smartwatches, and earbuds, capturing accelerometer, gyroscope, and magnetometer modalities. For our experiment, we utilized the accelerometer data of the smartwatch worn on the left hand with a sampling frequency of 100 Hz.

PAMAP2 contains data from nine subjects (one female, eight right-handed) wearing three inertial measurement units and a heart rate monitor while engaging in 18 different physical activities. In our experiment, we used the protocol set. We chose the accelerometer sensor data from the IMU sensor placed on the dominant hand, sampled at 100 Hz. Subject 109 was excluded as they performed only a few motions. We restricted our analysis to those motions executed by all the selected subjects (ironing, lying, sitting, standing, walking, ascending stairs, descending stairs, vacuum cleaning, and non-activity). Data from subjects 101-106 were used for training, data from subject 107 was chosen for validation, and subject 108’s data was used for testing.

Opportunity dataset captures on-body sensor data during naturalistic human activities, categorized into Drill (sequential pre-defined activities) and ADL (a high-level task with flexible atomic activity sequence) sessions. A challenge of the Opportunity dataset is that it comprises recordings of 4 participants using solely on-body sensors, and five unsegmented recordings for each subject are provided. We only consider the locomotion and data from the accelerometer RKN_ (placed on the right leg below the knee). The reason is that there is a large amount of missing data from the sensor that was placed on the wrist, and we chose the locomotion task as the target activity for classification. During the implementation, we followed the recommendations of the paper that proposed the sets. Only data from subjects 1, 2, and 3 were used. We kept ADL4 and ADL5 from subjects 2 and 3 for testing, ADL5 from subject 1 for validation, and the remaining for training.

To ensure fairness, we conducted a five-fold cross-validation, separating data based on subjects. For example, for MM-Fit dataset, training data include subjects ’01’, ’02’, ’03’, ’04’, ’06’, ’07’, ’08’, ’14’, ’15’, ’16’, ’17’, and ’18’, with subject ’19’ for validation, and testing conducted on subjects ’09’, ’10’, and ’11’. The window size (number of data points per window) for MM-Fit, PAMAP2 and Opportunity is 400, 200, and 200, respectively. The label of window data is assigned based on the majority class, determined by the class that has the highest number of data points within the window. Across all datasets, we use the Euclidean norm (x2+y2+z2superscript𝑥2superscript𝑦2superscript𝑧2\sqrt{x^{2}+y^{2}+z^{2}}square-root start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG) of the accelerometer data collected from the three axes (x, y, z) as input.

4.2. Implementation Details

All experiments were conducted in Python using the PyTorch framework on a Linux system with a Tesla P100 GPU. We chose the Adam optimizer (Kingma and Ba, 2014) and initialized the learning rate with a value of 0.0002. The step size of the forward diffusion process is controlled by a variance schedule {βt(0.0001,0.05)}matrixsubscript𝛽𝑡0.00010.05\begin{Bmatrix}\beta_{t}\in(0.0001,0.05)\end{Bmatrix}{ start_ARG start_ROW start_CELL italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0.0001 , 0.05 ) end_CELL end_ROW end_ARG } where t𝑡titalic_t ranges from 1 to T. The maximum diffusion step is set to be 50. The batch size was 128, and training spanned 200 epochs with early stop** using a patience of 20 epochs to counter overfitting.

4.3. Comparison Baseline

To evaluate the classification performance, we compare our model with conventional oversampling techniques (SMOTE (Chawla et al., 2002) and SVM-SMOTE (Nguyen et al., 2011; Balaha and Hassan, 2023)) and a GAN-based approach (TimeGAN (Yoon et al., 2019)). It is crucial to mention that existing GAN-based models (Yao et al., 2018; Wang et al., 2018; Li et al., 2020; Hu, 2023; Yang et al., 2023) have demonstrated limitations of diversity on generated data and pose significant challenges in the training stage due to mode collapse and unstable convergence. Only TimeGAN could be trained stably and successfully generate wearable sensor data for HAR. Hence, we exclusively selected TimeGAN as a representative GAN-based model for comparison. Furthermore, we introduce a class conditional diffusion model (CC-DM, see Fig. 5), which is solely guided by label information. Given that the structure of our proposed diffusion model employs an encoder-decoder framework, similar to the architecture of U-Net and conditioned only on label information, we consider it comparable to the method presented in Shao et al.’s work (Shao and Sanchez, 2023).

To ensure fairness, all models followed the same two-step training procedure: (1) pretraining the data generation model, (2) training the HAR classifier on synthetic data followed by fine-tuning with 20% of real data. It is worth mentioning that we did not train separate models for each class with SF-DM. The concept is to generate IMU data based on specific statistical information rather than using labeled data. In contrast, for TimeGAN, we trained a separate model for each class in all three datasets.

We repeated each experiment ten times and averaged results for classification accuracy and Macro F1 score.

Refer to caption
Figure 5. HAR with Class Conditional Diffusion Model.

To effectively demonstrate the impact of SF-DM on enhancing the performance of HAR, we additionally train the diffusion model using varying quantities of real data, considering scenarios where data is scarce. We conduct a comparison of the Macro F1 score between two models: one trained solely on a specific quantity of real data (baseline), and another model that undergoes pre-training on synthetic sensor data from SF-DM (also trained on the same amount of real data) followed by fine-tuning using the same real dataset (SF-DM[Corresp.P]). We train SF-DM on the full dataset while having the HAR classifier trained on a certain proportion of the real data (the proportion of real data used: [0.2, 0.3, 0.4, 0.5, 1]) - SF-DM[Proportion:1]. An ablation study is conducted on CC-DM. We train CC-DMs with proportional real data and evaluate their classification performance. The evaluation of generative models is subjective. In the assessment of the diffusion model, we use accuracy and Macro F1 score on the HAR task.

Refer to caption
(a) MM-Fit dataset
Refer to caption
(b) Squats
Refer to caption
(c) Lunges
Refer to caption
(d) Bicep curls
Refer to caption
(e) Situps
Refer to caption
(f) Dumbbell rows
Refer to caption
(g) Dumbbell shoulder press
Refer to caption
(h) Tricep extensions
Refer to caption
(i) Lateral shoulder raises
Figure 6. Examples of generated sensor data (Y-axis: norm acceleration of three axes (x, y, z)). The data in red indicates the synthetic sensor data from the diffusion model, while the blue plot represents the real sensor data.
Refer to caption
(a) PAMAP2 dataset
Refer to caption
(b) lying
Refer to caption
(c) sitting
Refer to caption
(d) standing
Refer to caption
(e) walking
Refer to caption
(f) ascending stairs
Refer to caption
(g) decending stairs
Refer to caption
(h) iroing
Refer to caption
(i) vacuum cleaning
Figure 7. Examples of generated sensor data (Y-axis: norm acceleration of three axes (x, y, z)). The data in red indicates the synthetic sensor data from the diffusion model, while the blue plot represents the real sensor data.
Refer to caption
(a) Opportunity dataset
Refer to caption
(b) Lie
Refer to caption
(c) Sit
Refer to caption
(d) Stand
Refer to caption
(e) Walk
Figure 8. Examples of generated sensor data (Y-axis: norm acceleration of three axes (x, y, z)). The data in red indicates the synthetic sensor data from the diffusion model, while the blue plot represents the real sensor data.

4.4. Comparison Results

Visual examples of generated synthetic sensor data from MM-Fit, PAMAP2 and Opportunity datasets are illustrated in Fig. 6, Fig. 7 and Fig. 8. Fig. 6(b) to Fig. 6(i) depict the sample results from different classes in MM-Fit dataset. It is worth mentioning that the synthetic sensor data from the model are able to capture the general trends of real data while exhibiting slight variations in finer details across all classes. In the generated results from PAMAP2 and Opportunity datasets (shown in Fig. 7 and Fig. 8), although the synthetic data appears to have a higher frequency, it still captures the signal tendencies quite effectively across different classes.

Table 2. Comparison of human activity recognition results on different datasets with 20%
Baseline SMOTE (Chawla et al., 2002) SVM-SMOTE (Nguyen et al., 2011; Balaha and Hassan, 2023) TimeGAN (Yoon et al., 2019) CC-DM (Shao and Sanchez, 2023) SF-DM
MM-Fit Accuracy 0.831±plus-or-minus\pm± 0.0055 0.830±plus-or-minus\pm± 0.0065 0.836±plus-or-minus\pm± 0.0051 0.828±plus-or-minus\pm± 0.0068 0.834±plus-or-minus\pm± 0.0072 0.849±plus-or-minus\pm± 0.0072
Macro F1 0.320±plus-or-minus\pm±0.0394 0.271±plus-or-minus\pm±0.0463 0.366±plus-or-minus\pm±0.0169 0.249±plus-or-minus\pm±0.0516 0.347±plus-or-minus\pm±0.0410 0.386±plus-or-minus\pm±0.0495
PAMAP2 Accuracy 0.471±plus-or-minus\pm±0.0182 0.464±plus-or-minus\pm±0.0054 0.481±plus-or-minus\pm±0.0086 0.482±plus-or-minus\pm±0.0081 0.464±plus-or-minus\pm±0.0085 0.494±plus-or-minus\pm±0.0081
Macro F1 0.384±plus-or-minus\pm±0.0253 0.376±plus-or-minus\pm±0.0073 0.392±plus-or-minus\pm±0.0109 0.413±plus-or-minus\pm±0.0122 0.384±plus-or-minus\pm±0.0135 0.413±plus-or-minus\pm±0.0078
Oppo. Accuracy 0.408±plus-or-minus\pm±0.01951 0.495±plus-or-minus\pm±0.0165 0.482±plus-or-minus\pm±0.0244 0.449±plus-or-minus\pm±0.0301 0.430±plus-or-minus\pm±0.0234 0.509±plus-or-minus\pm±0.0264
Macro F1 0.258±plus-or-minus\pm±0.0303 0.364±plus-or-minus\pm±0.0191 0.340±plus-or-minus\pm±0.0369 0.329±plus-or-minus\pm±0.0436 0.276±plus-or-minus\pm±0.0433 0.386±plus-or-minus\pm±0.0437

Table 2 presents comparison results of HAR in terms of accuracy and macro F1 score for several methods on different datasets when only 20% of the labeled sensor data is available for training. Compared to the baseline, the classifier pretrained using our diffusion approach SF-DM consistently yields significant performance improvements in terms of accuracy and macro F1 score across all datasets. Across all datasets, SF-DM exhibits accuracy improvements ranging from 2.1% to 24.8% and macro F1 score improvements from 7.5% to 49.6%, compared to the baseline. Specifically, SF-DM achieved an improvement of up to 24.8% in accuracy score and 49.6% in macro F1 score compared to the baseline on Opportunity dataset. Moreover, in those datasets, we also outperform other oversampling approaches and even TimeGAN by a margin ranging from 1.5% to 13.4% in terms of accuracy and with a maximum improvement of 55.0% in macro F1 score.

Regarding Opportunity dataset, our method significantly outperforms the baselines, SMOTE, TimeGAN, and CC-DM across both accuracy and macro F1 score.

We also highlight the efficiency of training with SF-DM compared to TimeGAN. Since the proposed model is trained in an unsupervised way, there is no need to train separate models for each activity while still maintaining the quality of the generated data. However, with models like TimeGAN, it is essential to have trained models for each class. Take MM-Fit dataset as an example, it took around 2 hours 25 minutes for TimeGAN to train a model for a single activity with 20% of the data while for SF-DM, the training time was 5 hours 45 minutes for all activities on the same platform. Considering the MM-Fit dataset includes 10 activities, the total training time for TimeGAN on it would be around 24 hours, which is more than 4 times longer than using the SF-DM.

4.5. Ablation Study

Table 3. Ablation study for the diffusion model on MM-Fit, PAMAP2, and Opportunity datasets by changing the proportion of labeled real sensor data.
proportion model MM-Fit PAMAP2 Opportunity
Acc. M.F1 Acc. M.F1 Acc. M.F1
0.2 baseline 0.830±plus-or-minus\pm±0.0050 0.322±plus-or-minus\pm±0.0498 0.478±plus-or-minus\pm±0.0070 0.391±plus-or-minus\pm±0.0162 0.421±plus-or-minus\pm±0.0276 0.272±plus-or-minus\pm±0.0406
CC-DM 0.828±plus-or-minus\pm±0.0059 0.321±plus-or-minus\pm±0.0593 0.480±plus-or-minus\pm±0.0172 0.396±plus-or-minus\pm±0.0180 0.391±plus-or-minus\pm±0.0229 0.201±plus-or-minus\pm±0.0378
SF-DM[Corresp.P.] 0.830±plus-or-minus\pm±0.0080 0.353±plus-or-minus\pm±0.0641 0.490±plus-or-minus\pm±0.0120 0.409±plus-or-minus\pm±0.0262 0.467±plus-or-minus\pm±0.0313 0.305±plus-or-minus\pm±0.0464
SF-DM[Proportion: 1] 0.834±plus-or-minus\pm±0.0026 0.335±plus-or-minus\pm±0.0212 0.492±plus-or-minus\pm±0.0141 0.410±plus-or-minus\pm±0.0131 0.507±plus-or-minus\pm±0.0422 0.416±plus-or-minus\pm±0.0771
0.3 baseline 0.853±plus-or-minus\pm±0.0135 0.406±plus-or-minus\pm±0.0907 0.499±plus-or-minus\pm±0.0085 0.444±plus-or-minus\pm±0.0081 0.391±plus-or-minus\pm±0.0239 0.215±plus-or-minus\pm±0.0442
CC-DM 0.867±plus-or-minus\pm±0.0142 0.511±plus-or-minus\pm±0.0646 0.510±plus-or-minus\pm±0.0122 0.462±plus-or-minus\pm±0.0202 0.482±plus-or-minus\pm±0.0358 0.362±plus-or-minus\pm±0.0514
SF-DM[Corresp.P.] 0.852±plus-or-minus\pm±0.0040 0.446±plus-or-minus\pm±0.0232 0.514±plus-or-minus\pm±0.0095 0.452±plus-or-minus\pm±0.0081 0.546±plus-or-minus\pm±0.0183 0.435±plus-or-minus\pm±0.0345
SF-DM[Proportion: 1] 0.874±plus-or-minus\pm±0.0067 0.517±plus-or-minus\pm±0.0210 0.517±plus-or-minus\pm±0.0096 0.453±plus-or-minus\pm±0.0141 0.585±plus-or-minus\pm±0.0153 0.537±plus-or-minus\pm±0.0163
0.4 baseline 0.874±plus-or-minus\pm±0.0091 0.530±plus-or-minus\pm±0.0559 0.523±plus-or-minus\pm±0.0174 0.467±plus-or-minus\pm±0.0239 0.403±plus-or-minus\pm±0.0346 0.238±plus-or-minus\pm±0.0590
CC-DM 0.871±plus-or-minus\pm±0.0125 0.524±plus-or-minus\pm±0.0489 0.525±plus-or-minus\pm±0.0138 0.436±plus-or-minus\pm±0.0174 0.496±plus-or-minus\pm±0.0392 0.360±plus-or-minus\pm±0.0652
SF-DM[Corresp.P.] 0.882±plus-or-minus\pm±0.0071 0.643±plus-or-minus\pm±0.0184 0.507±plus-or-minus\pm±0.0068 0.453±plus-or-minus\pm±0.0113 0.560±plus-or-minus\pm±0.0269 0.471±plus-or-minus\pm±0.0477
SF-DM[Proportion: 1] 0.879±plus-or-minus\pm±0.0014 0.565±plus-or-minus\pm±0.0078 0.525±plus-or-minus\pm±0.0128 0.453±plus-or-minus\pm±0.0142 0.568±plus-or-minus\pm±0.0446 0.515±plus-or-minus\pm±0.0492
0.5 baseline 0.857±plus-or-minus\pm±0.0186 0.451±plus-or-minus\pm±0.0957 0.520±plus-or-minus\pm±0.0057 0.453±plus-or-minus\pm±0.0057 0.456±plus-or-minus\pm±0.0459 0.302±plus-or-minus\pm±0.0751
CC-DM 0.883±plus-or-minus\pm±0.0044 0.567±plus-or-minus\pm±0.0190 0.495±plus-or-minus\pm±0.0135 0.424±plus-or-minus\pm±0.0139 0.593±plus-or-minus\pm±0.0244 0.511±plus-or-minus\pm±0.0386
SF-DM[Corresp.P.] 0.877±plus-or-minus\pm±0.0114 0.530±plus-or-minus\pm±0.0605 0.501±plus-or-minus\pm±0.0095 0.445±plus-or-minus\pm±0.0146 0.586±plus-or-minus\pm±0.0292 0.509±plus-or-minus\pm±0.0470
SF-DM[Proportion: 1] 0.884±plus-or-minus\pm±0.0062 0.567±plus-or-minus\pm±0.0121 0.500±plus-or-minus\pm±0.0116 0.444±plus-or-minus\pm±0.0170 0.627±plus-or-minus\pm±0.0073 0.573±plus-or-minus\pm±0.0164
1 baseline 0.907±plus-or-minus\pm±0.0039 0.643±plus-or-minus\pm±0.0164 0.551±plus-or-minus\pm±0.0094 0.497±plus-or-minus\pm±0.0130 0.516±plus-or-minus\pm±0.0201 0.408±plus-or-minus\pm±0.0346
CC-DM 0.901±plus-or-minus\pm±0.0040 0.653±plus-or-minus\pm±0.0132 0.556±plus-or-minus\pm±0.0137 0.504±plus-or-minus\pm±0.0168 0.552±plus-or-minus\pm±0.0247 0.453±plus-or-minus\pm±0.0409
SF-DM[Corresp.P.] 0.882±plus-or-minus\pm±0.0062 0.534±plus-or-minus\pm±0.0296 0.554±plus-or-minus\pm±0.0165 0.471±plus-or-minus\pm±0.0217 0.636±plus-or-minus\pm±0.0433 0.562±plus-or-minus\pm±0.0029
SF-DM[Proportion: 1] 0.906±plus-or-minus\pm±0.0085 0.659±plus-or-minus\pm±0.0308 0.554±plus-or-minus\pm±0.0140 0.507±plus-or-minus\pm±0.0172 0.647±plus-or-minus\pm±0.0169 0.598±plus-or-minus\pm±0.0115

Ablation study results in Table 3 indicate the impact of varying proportions of labeled real data for training on the three datasets. From the prospect of Macro F1 score, SF-DM [Proportion: 1] outperforms the baseline and other methods in all the cases on the Opportunity dataset and most of the cases on other datasets, which highlights the adequacy and effectiveness of statistical features.

When reducing the amount of unlabeled data available to the diffusion models, we observe decreased improvements and, in some cases, even performance degradation compared to SF-DM [Proportion: 1] or below the baseline. This could be attributed to the instability associated with training diffusion models with limited data. When faced with a scarcity of training samples, the model’s ability to generalize effectively can be compromised, leading to performance fluctuations and difficulty in accurately capturing the underlying data distribution.

We also noticed that when the proportion of real data increases from 0.2-0.3 to 0.4-0.5, the performance of the method slightly falls behind the baseline, especially on PAMAP2 dataset. One possible explanation is that the addition of real data may introduce noise or irrelevant information that adversely affects the model’s ability to generalize effectively. As the proportion of real data increases, there is a higher likelihood of encountering outliers or instances that do not match the underlying patterns captured by the model, leading to performance degradation. In this case, we recommend utilizing the method when a small amount of data is available.

Interestingly, in cases where SF-DM shows clearer improvements, such as in Opportunity, using only partial data with our statistical information (SF-DM [Corresp: P.]) is overall better than using the label information for diffusion (CC-DM), which can often fall below the baseline results. This indicates the potential of our approach for scenarios with limited labeled data.

Besides, for most of the activities, the model pretrained with synthetic data from SF-DM has a better recognition rate. For data in some categories that have similar signals, for instance, lying, sitting, and standing, synthetic data may not fully capture the subtle differences between these activities, leading to a degradation in performance when the model encounters such instances during testing.

The quantitative evaluation results in terms of Macro F1 score on the three datasets of the ablation study in Table 3 are also shown visually in Fig. 9. It is clear that increasing the data ratio from 0.1 to 0.3 results in a significant increase in the Macro F1 score for most of the models across all datasets. This indicates that a higher proportion of data contributes positively to the model’s performance, potentially allowing it to capture more diverse patterns and improve its ability to generalize across different instances. However, this trend is not consistently observed when continuously increasing the ratio, particularly when transitioning from 0.3 to 0.5. In this interval, the performance of the models varies, indicating that simply increasing the data ratio does not guarantee continued improvement in model performance. This variability suggests the impact of other factors, such as the quality and diversity of the data.

To summarize, the improved performance of the proposed model indicates the capability of SF-DM to capture valuable features from sensor data, guided by selected statistical information. By eliminating the need for labeled data during diffusion model training, we improve the generalizability of the model. As a result, trained SF-DM can be used to generate large amounts of synthetic sensor data that contain the general trends of the real data with slight variations. We also checked the confusion matrices from both the baseline classifier and the classifier pretrained with synthetic data from SF-DM. We noticed that, for most of the categories, the model pretrained with synthetic data from SF-DM has a better recognition rate. For data in some categories that have similar signals, for instance, lie and sit, the classifier pretrained with synthetic data had a lower recognition rate in some trials.

Refer to caption
(a) MM-Fit
Refer to caption
(b) PAMAP2
Refer to caption
(c) Opportunity
Figure 9. Ablation study for the proposed diffusion model on MM-Fit, PAMAP2, and Opportunity datasets by changing the proportion of labeled real sensor data. X-axis: proportion of labeled real data usage; Y-axis: Macro F1 score.

5. Discussion

To further improve the performance of sensor data generation with the diffusion model and its application in HAR, several approaches can be considered.

Firstly, the potential for feature diversification is significant. While the accelerometer sensor data has been employed for feature extraction, exploiting information from other modalities could provide richer motion insights. Incorporating RGB and depth data from cameras or other sensor types could offer a more comprehensive view of motion dynamics. This augmentation can capture intricate movement subtleties and improve the quality of synthetic sensor data. Furthermore, the scope of the diffusion model is not confined to single-modality data. Extending the model’s capability to generate multi-modal data is a promising step towards robust HAR enhancement. Integrating data from diverse sources like audio, video, or even textual descriptions can significantly have the potential to enhance the quality and realism of synthetic sensor data, improving the understanding of complex human activities, and facilitating more accurate and robust analysis in applications such as human activity recognition and motion tracking.

Secondly, the current model SF-DM can be extended to generate latent representations rather than raw sensor data. This strategy leverages the diffusion model’s ability to learn shared and relevant features across multiple modalities. By training different autoencoders, the latent representation obtained from the diffusion model can be effectively translated into various data modalities by simply providing modality-specific information. This extension enriches the diversity and effectiveness of learned representations, potentially leading to more powerful models for HAR.

Furthermore, unlike vision-based data, there are fewer useful metrics available for evaluating the quality of sensor data. Due to the nature of sensor data, factors such as temporal dynamics, signal noise, and sensor fusion should be considered. Novel evaluation metrics can be devised to capture the nuanced characteristics of HAR datasets (e.g., data from lying, sitting, and standing). The development and utilization of appropriate evaluation metrics specifically tailored to assess the quality and validity of sensor data generated in the context of HAR enable better assurance of measurement accuracy and validation of performance improvements.

6. Conclusion

In this paper, we proposed a novel unsupervised statistical feature-guided diffusion model for sensor-based HAR. Operating within an encoder-decoder framework, our diffusion model employs statistical features, including mean, standard deviation, Z-score, and skewness, to generate high-quality sensor data without relying on class label information. Quantitative evaluations demonstrate the model’s consistent superiority over traditional oversampling methods and TimeGAN, with accuracy improvements ranging from 2.1% to 24.8% and macro F1 enhancements of 7.5% to 49.6%. Visually, the synthetic sensor data generated by our model were visually scrutinized and found to faithfully capture the essence of real data, albeit with subtle variations in detail across different classes.

The significance of our work extends beyond HAR, as our approach has the potential to be applied in various domains that require time-series data, especially in scenarios with limited labeled data. In future work, we aim to optimize the training procedure of our framework and utilize class distribution information. In addition, we plan to extend it to sensor data generation from other modalities, such as video data, and explore additional statistical features from both the time and frequency domains.

References

  • (1)
  • Alharbi et al. (2022) Fayez Alharbi, Lahcen Ouarbya, and Jamie A Ward. 2022. Comparing sampling strategies for tackling imbalanced data in human activity recognition. Sensors 22, 4 (2022), 1373.
  • Bachlin et al. (2009) Marc Bachlin, Meir Plotnik, Daniel Roggen, Inbal Maidan, Jeffrey M Hausdorff, Nir Giladi, and Gerhard Troster. 2009. Wearable assistant for Parkinson’s disease patients with the freezing of gait symptom. IEEE Transactions on Information Technology in Biomedicine 14, 2 (2009), 436–446.
  • Balabka (2019) Dmitrijs Balabka. 2019. Semi-supervised learning for human activity recognition using adversarial autoencoders. In Adjunct proceedings of the 2019 ACM international joint conference on pervasive and ubiquitous computing and proceedings of the 2019 ACM international symposium on wearable computers. 685–688.
  • Balaha and Hassan (2023) Hossam Magdy Balaha and Asmaa El-Sayed Hassan. 2023. Comprehensive machine and deep learning analysis of sensor-based human activity recognition. Neural Computing and Applications 35, 17 (2023), 12793–12831.
  • Banos et al. (2021) Oresti Banos, Alberto Calatroni, Miguel Damas, Hector Pomares, Daniel Roggen, Ignacio Rojas, and Claudia Villalonga. 2021. Opportunistic activity recognition in IoT sensor ecosystems via multimodal transfer learning. Neural Processing Letters (2021), 1–29.
  • Banos et al. (2012) Oresti Banos, Alberto Calatroni, Miguel Damas, Héctor Pomares, Ignacio Rojas, Hesam Sagha, Jose del R Mill, Gerhard Troster, Ricardo Chavarriaga, Daniel Roggen, et al. 2012. Kinect= imu? learning mimo signal map**s to automatically translate activity recognition systems across sensor modalities. In 2012 16th International Symposium on Wearable Computers. IEEE, 92–99.
  • Barua et al. (2012) Sukarna Barua, Md Monirul Islam, Xin Yao, and Kazuyuki Murase. 2012. MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on knowledge and data engineering 26, 2 (2012), 405–425.
  • Chavarriaga et al. (2013) Ricardo Chavarriaga, Hesam Sagha, Alberto Calatroni, Sundara Tejaswi Digumarti, Gerhard Tröster, José del R Millán, and Daniel Roggen. 2013. The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Pattern Recognition Letters 34, 15 (2013), 2033–2042.
  • Chawla et al. (2002) Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321–357.
  • Cheng et al. (2023) Dongzhou Cheng, Lei Zhang, Can Bu, Hao Wu, and Aiguo Song. 2023. Learning hierarchical time series data augmentation invariances via contrastive supervision for human activity recognition. Knowledge-Based Systems (2023), 110789.
  • Dahou et al. (2023) Abdelghani Dahou, Mohammed AA Al-qaness, Mohamed Abd Elaziz, and Ahmed M Helmi. 2023. MLCNNwav: Multi-level Convolutional Neural Network with Wavelet Transformations for Sensor-based Human Activity Recognition. IEEE Internet of Things Journal (2023).
  • Deldari et al. (2022) Shohreh Deldari, Hao Xue, Aaqib Saeed, Daniel V Smith, and Flora D Salim. 2022. Cocoa: Cross modality contrastive learning for sensor data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1–28.
  • Derawi et al. (2010) Mohammad Omar Derawi, Claudia Nickel, Patrick Bours, and Christoph Busch. 2010. Unobtrusive user-authentication on mobile phones using biometric gait recognition. In 2010 Sixth International Conference on Intelligent Information Hiding and Multimedia Signal Processing. IEEE, 306–311.
  • Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 8780–8794. https://proceedings.neurips.cc/paper_files/paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf
  • Dirgová Luptáková et al. (2022) Iveta Dirgová Luptáková, Martin Kubovčík, and Jiří Pospíchal. 2022. Wearable sensor-based human activity recognition with transformer model. Sensors 22, 5 (2022), 1911.
  • Dockhorn et al. (2022) Tim Dockhorn, Arash Vahdat, and Karsten Kreis. 2022. GENIE: Higher-order denoising diffusion solvers. arXiv preprint arXiv:2210.05475 (2022).
  • Feng et al. (2024) Haotian Feng, Qiang Shen, Rui Song, Lida Shi, and Hao Xu. 2024. ATFA: Adversarial Time–Frequency Attention network for sensor-based multimodal human activity recognition. Expert Systems with Applications 236 (2024), 121296.
  • Ferrari et al. (2023) Anna Ferrari et al. 2023. Deep learning and model personalization in sensor-based human activity recognition. Journal of Reliable Intelligent Environments 9, 1 (2023), 27–39.
  • Fischer et al. (2021) Hannah Friederike Fischer, Daniela Wittmann, Alejandro Baucells Costa, Bo Zhou, Gesche Joost, and Paul Lukowicz. 2021. Masquare: A Functional Smart Mask Design for Health Monitoring. In 2021 International Symposium on Wearable Computers. 175–178.
  • Fortes Rey et al. (2021) Vitor Fortes Rey, Kamalveer Kaur Garewal, and Paul Lukowicz. 2021. Translating videos into synthetic training data for wearable sensor-based activity recognition systems using residual deep convolutional networks. Applied Sciences 11, 7 (2021), 3094.
  • Fortes Rey et al. (2022) Vitor Fortes Rey, Sungho Suh, and Paul Lukowicz. 2022. Learning from the Best: Contrastive Representations Learning Across Sensor Locations for Wearable Activity Recognition. In Proceedings of the 2022 ACM International Symposium on Wearable Computers. 28–32.
  • Gao et al. (2024) Jiayuan Gao, Yingwei Zhang, Yiqiang Chen, Tengxiang Zhang, Boshi Tang, and Xiaoyu Wang. 2024. Unsupervised Human Activity Recognition Via Large Language Models and Iterative Evolution. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 91–95.
  • Gheid and Challal (2016) Zakaria Gheid and Yacine Challal. 2016. Novel Efficient and Privacy-Preserving Protocols For Sensor-Based Human Activity Recognition. In 13th International Conference on Ubiquitous Intelligence and Computing (UIC 2016). IEEE, 301–308.
  • Gong et al. (2022) Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. 2022. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933 (2022).
  • Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, M. Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NIPS.
  • Gu et al. (2021) Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. 2021. Vector Quantized Diffusion Model for Text-to-Image Synthesis. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 10686–10696.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 6840–6851. https://proceedings.neurips.cc/paper_files/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf
  • Ho et al. (2022) Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. 2022. Video Diffusion Models. ArXiv abs/2204.03458 (2022).
  • Hu (2023) Yifan Hu. 2023. BSDGAN: Balancing Sensor Data Generative Adversarial Networks for Human Activity Recognition. In 2023 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
  • Huang et al. (2023) Shuokang Huang, Po-Yu Chen, and Julie McCann. 2023. DiffAR: adaptive conditional diffusion model for temporal-augmented human activity recognition. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. 3812–3820.
  • Jeong et al. (2021) Chi Yoon Jeong, Hyung Cheol Shin, and Mooseop Kim. 2021. Sensor-data augmentation for human activity recognition with time-war** and data masking. Multimedia Tools and Applications 80 (2021), 20991–21009.
  • Jiang et al. (2021) Yanran Jiang, Peter Malliaras, Bernard Chen, and Dana Kulić. 2021. Model-based data augmentation for user-independent fatigue estimation. Computers in Biology and Medicine 137 (2021), 104839.
  • Jolicoeur-Martineau et al. (2020) Alexia Jolicoeur-Martineau, Remi Piche-Taillefer, Rémi Tachet des Combes, and Ioannis Mitliagkas. 2020. Adversarial score matching and improved sampling for image generation. ArXiv abs/2009.05475 (2020).
  • Kim et al. (2022) Jihoon Kim, Jiseob Kim, and Sungjoon Choi. 2022. Flame: Free-form language-based motion synthesis & editing. arXiv preprint arXiv:2209.00349 (2022).
  • Kim and Jeong (2021) Mooseop Kim and Chi Yoon Jeong. 2021. Label-preserving data augmentation for mobile sensor data. Multidimensional Systems and Signal Processing 32, 1 (2021), 115–129.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Kong et al. (2020) Zhifeng Kong, Wei **, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2020. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761 (2020).
  • Kwon et al. (2020) Hyeokhyen Kwon, Catherine Tong, Harish Haresamudram, Yan Gao, Gregory D Abowd, Nicholas D Lane, and Thomas Ploetz. 2020. Imutube: Automatic extraction of virtual on-body accelerometry from video for human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 3 (2020), 1–29.
  • Li et al. (2020) Xi’ang Li, **qi Luo, and Rabih Younes. 2020. ActivityGAN: Generative adversarial networks for data augmentation in sensor-based human activity recognition. In Adjunct proceedings of the 2020 ACM international joint conference on pervasive and ubiquitous computing and proceedings of the 2020 ACM international symposium on wearable computers. 249–254.
  • Lin et al. (2019) Hubert Lin, Paul Upchurch, and Kavita Bala. 2019. Block annotation: Better image annotation with sub-image decomposition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5290–5300.
  • Liu et al. (2024) Shengzhong Liu, Tomoyoshi Kimura, Dongxin Liu, Ruijie Wang, **yang Li, Suhas Diggavi, Mani Srivastava, and Tarek Abdelzaher. 2024. FOCAL: Contrastive learning for multimodal time-series sensing signals in factorized orthogonal latent space. Advances in Neural Information Processing Systems 36 (2024).
  • Lugmayr et al. (2022) Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. 2022. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11461–11471.
  • Maharana et al. (2022) Kiran Maharana, Surajit Mondal, and Bhushankumar Nemade. 2022. A review: Data pre-processing and data augmentation techniques. Global Transitions Proceedings 3, 1 (2022), 91–99.
  • Mathur et al. (2018) Akhil Mathur, Tianlin Zhang, Sourav Bhattacharya, Petar Velickovic, Leonid Joffe, Nicholas D Lane, Fahim Kawsar, and Pietro Lió. 2018. Using deep data augmentation training to address software and hardware heterogeneities in wearable and smartphone sensing devices. In 2018 17th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). IEEE, 200–211.
  • Nguyen et al. (2011) Hien M Nguyen, Eric W Cooper, and Katsuari Kamei. 2011. Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms 3, 1 (2011), 4–21.
  • Nichol and Dhariwal (2021) Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning. PMLR, 8162–8171.
  • Oh et al. (2021) Seungmin Oh, Akm Ashiquzzaman, Dongsu Lee, Yeonggwang Kim, and **sul Kim. 2021. Study on human activity recognition using semi-supervised active transfer learning. Sensors 21, 8 (2021), 2760.
  • Ohashi et al. (2017) Hiroki Ohashi, M Al-Nasser, Sheraz Ahmed, Takayuki Akiyama, Takuto Sato, Phong Nguyen, Katsuyuki Nakamura, and Andreas Dengel. 2017. Augmenting wearable sensor data with physical constraint for DNN-based human-action recognition. In ICML 2017 times series workshop. 6–11.
  • Park et al. (2023) Hyunseo Park, Nakyoung Kim, Gyeong Ho Lee, and Jun Kyun Choi. 2023. MultiCNN-FilterLSTM: Resource-efficient sensor-based human activity recognition in IoT applications. Future Generation Computer Systems 139 (2023), 196–209.
  • Pramanik et al. (2023) Rishav Pramanik, Ritodeep Sikdar, and Ram Sarkar. 2023. Transformer-based deep reverse attention network for multi-sensory human activity recognition. Engineering Applications of Artificial Intelligence 122 (2023), 106150.
  • Qi et al. (2018) Jun Qi, Po Yang, Martin Hanneghan, Stephen Tang, and Bo Zhou. 2018. A hybrid hierarchical framework for gym physical activity recognition and measurement using wearable sensors. IEEE Internet of Things Journal 6, 2 (2018), 1384–1393.
  • Reiss and Stricker (2012) Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In 2012 16th international symposium on wearable computers. IEEE, 108–109.
  • Rey et al. (2019) Vitor Fortes Rey, Peter Hevesi, Onorina Kovalenko, and Paul Lukowicz. 2019. Let there be IMU data: generating training data for wearable, motion sensor based activity recognition from monocular RGB videos. In Adjunct proceedings of the 2019 ACM international joint conference on pervasive and ubiquitous computing and proceedings of the 2019 ACM international symposium on wearable computers. 699–708.
  • Rombach et al. (2021) Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 10674–10685.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 234–241.
  • Russell et al. (2008) Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman. 2008. LabelMe: a database and web-based tool for image annotation. International journal of computer vision 77 (2008), 157–173.
  • Santhalingam et al. (2023) Panneer Selvam Santhalingam, Parth Pathak, Huzefa Rangwala, and Jana Kosecka. 2023. Synthetic smartwatch imu data generation from in-the-wild asl videos. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 7, 2 (2023), 1–34.
  • Shao and Sanchez (2023) Shuai Shao and Victor Sanchez. 2023. A study on diffusion modelling for sensor-based human activity recognition. In 2023 11th International Workshop on Biometrics and Forensics (IWBF). IEEE, 1–7.
  • Sheng and Huber (2020) Taoran Sheng and Manfred Huber. 2020. Unsupervised embedding learning for human activity recognition using wearable sensor data. In The Thirty-Third International Flairs Conference.
  • Shorten and Khoshgoftaar (2019) Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of big data 6, 1 (2019), 1–48.
  • Sohl-Dickstein et al. (2015) Jascha Narain Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ArXiv abs/1503.03585 (2015).
  • Song et al. (2020) Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).
  • Song et al. (2021) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations. https://openreview.net/forum?id=PxTIG12RRHS
  • Strömbäck et al. (2020) David Strömbäck, Sangxia Huang, and Valentin Radu. 2020. Mm-fit: Multimodal deep learning for automatic exercise logging across sensing devices. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 4 (2020), 1–22.
  • Sundholm et al. (2014) Mathias Sundholm, **gyuan Cheng, Bo Zhou, Akash Sethi, and Paul Lukowicz. 2014. Smart-mat: Recognizing and counting gym exercises with low-cost resistive pressure sensing matrix. In Proceedings of the 2014 ACM international joint conference on pervasive and ubiquitous computing. 373–382.
  • Tang et al. (2024) Jie Tang, Bin He, Junkai Xu, Tian Tan, Zhipeng Wang, Yanmin Zhou, and Shuo Jiang. 2024. Synthetic IMU datasets and protocols can simplify fall detection experiments and optimize sensor configuration. IEEE transactions on neural systems and rehabilitation engineering (2024).
  • Uhlenberg et al. (2023) Lena Uhlenberg, Adrian Derungs, and Oliver Amft. 2023. Co-simulation of human digital twins and wearable inertial sensors to analyse gait event estimation. Frontiers in Bioengineering and Biotechnology 11 (2023), 1104000.
  • Um et al. (2017) Terry T Um, Franz MJ Pfister, Daniel Pichler, Satoshi Endo, Muriel Lang, Sandra Hirche, Urban Fietzek, and Dana Kulić. 2017. Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. In Proceedings of the 19th ACM international conference on multimodal interaction. 216–220.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Wang et al. (2018) Jiwei Wang, Yiqiang Chen, Yang Gu, Yunlong Xiao, and Haonan Pan. 2018. Sensorygans: An effective generative adversarial framework for sensor-based human activity recognition. In 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
  • Xiao et al. (2021) Fanyi Xiao, Ling Pei, Lei Chu, Dan** Zou, Wenxian Yu, Yifan Zhu, and Tao Li. 2021. A deep learning method for complex human activity recognition using virtual wearable sensors. In Spatial Data and Intelligence: First International Conference, SpatialDI 2020, Virtual Event, May 8–9, 2020, Proceedings 1. Springer, 261–270.
  • Xiao et al. (2022) Shuo Xiao, Shengzhi Wang, Zhenzhen Huang, Yu Wang, and Haifeng Jiang. 2022. Two-stream transformer network for sensor-based human activity recognition. Neurocomputing 512 (2022), 253–268.
  • Xing et al. (2018) Tianwei Xing, Sandeep Singh Sandha, Bharathan Balaji, Supriyo Chakraborty, and Mani Srivastava. 2018. Enabling edge devices that learn from each other: Cross modal training for activity recognition. In Proceedings of the 1st International Workshop on Edge Systems, Analytics and Networking. 37–42.
  • Xu et al. (2022) Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. 2022. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923 (2022).
  • Yang et al. (2023) Zhenyu Yang, Yantao Li, and Gang Zhou. 2023. Ts-gan: Time-series gan for sensor-based health data augmentation. ACM Transactions on Computing for Healthcare 4, 2 (2023), 1–21.
  • Yao et al. (2018) Shuochao Yao, Yiran Zhao, Huajie Shao, Chao Zhang, Aston Zhang, Shaohan Hu, Dongxin Liu, Shengzhong Liu, Lu Su, and Tarek Abdelzaher. 2018. Sensegan: Enabling deep learning for internet of things with a semi-supervised framework. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies 2, 3 (2018), 1–21.
  • Yoon et al. (2019) **sung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. 2019. Time-series generative adversarial networks. Advances in neural information processing systems 32 (2019).
  • Zeiler et al. (2010) Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. 2010. Deconvolutional networks. In 2010 IEEE Computer Society Conference on computer vision and pattern recognition. IEEE, 2528–2535.
  • Zhang et al. (2019) Dalin Zhang, Lina Yao, Kaixuan Chen, Guodong Long, and Sen Wang. 2019. Collective protection: Preventing sensitive inferences via integrative transformation. In 2019 IEEE international conference on data mining (ICDM). IEEE, 1498–1503.
  • Zhou et al. (2022) Bo Zhou, Sungho Suh, Vitor Fortes Rey, Carlos Andres Velez Altamirano, and Paul Lukowicz. 2022. Quali-Mat: Evaluating the Quality of Execution in Body-Weight Exercises with a Pressure Sensitive Sports Mat. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 2 (2022), 1–45.
  • Zhou et al. (2021) Linqi Zhou, Yilun Du, and Jiajun Wu. 2021. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5826–5835.