Emotion Loss Attacking : Adversarial Attack Perception for Skeleton based on Multi-dimensional Features

Feng Liu
East China Normal University
[email protected]
Qing Xu
Bei**g University of Posts and Telecommunications
[email protected]
Qijian Zheng
East China Normal University
[email protected]

Abstract

Adversarial attack on skeletal motion is a hot topic. However, existing researches only consider part of dynamic features when measuring distance between skeleton graph sequences, which results in poor imperceptibility. To this end, we propose a novel adversarial attack method to attack action recognizers for skeletal motions. Firstly, our method systematically proposes a dynamic distance function to measure the difference between skeletal motions. Meanwhile, we innovatively introduce emotional features for complementary information. In addition, we use Alternating Direction Method of Multipliers(ADMM) to solve the constrained optimization problem, which generates adversarial samples with better imperceptibility to deceive the classifiers. Experiments show that our method is effective on multiple action classifiers and datasets. When the perturbation magnitude measured by l norms is the same, the dynamic perturbations generated by our method are much lower than that of other methods. What’s more, we are the first to prove the effectiveness of emotional features, and provide a new idea for measuring the distance between skeletal motions.

keywords:

adversarial attack, skeleton, optimization, emotion loss attacking, computational perception

1 Introduction

Affective computing[1] is one of the hotspots in today’s AI research, which includes and is not limited to the research of facial emotion recognition[2], speech emotion recognition[3][4], gesture emotion recognition[5], multimodal emotion recognition[6][7] and some personality recognition[8] based on dynamic expression recognition[9] and other related technologies[10]. As the research of skeleton-based action recognition is becoming more and more popular, its robustness in practical application scenarios has attracted extensive attention and exploration[11, 12, 13]. Researches in adversarial attack has revealed that deep learning methods are vulnerable to carefully devised data perturbations. Adversarial attacks on static data such as images and text have been widely studied[14], but the research on time-series data especially skeleton data is relatively lack and immature[15, 16, 17]. The adversarial attack on skeleton sequences mainly faces two challenges: low redundancy and perceptual sensitivity, which is unique from static data and other time-series data. A skeletal motion is composed of joints and physical relationships among joints. That means the action domain of skeletal motions is limited, and the requirements of imperceptibility for adversarial samples become more stringent. A skeletal motion usually has less than 100 Degrees of freedom (Dofs), much lower than images/meshes. What’s more, any sparsity based perturbation (on a single joint or a single frame) will greatly affect the dynamics (leading to jitter or bone-length violation) and is very obvious to the observer.

There are many typical attack methods for classified networks which can be divided into gradient based and optimization based. These methods are mostly based on images[18]that are also called European structure. However, skeletal motion is with non European structure. That means the distance measurement of skeletal motion can not be simply measured by l0, l2 and others. However, the existing researches[19, 20] don’t consider it and only use perturbations magnitude as the metric to judge the imperceptibility[20], which is unreasonable for skeletal motions. Therefore, we take the unique dynamics into consideration comprehensively when generating adversarial samples.

Research on emotion recognition is becoming more and more popular, such as speech-based emotion recognition[21], text-based emotion recognition[22], multimodal emotion recognition[23, 24] and action-based emotion recognition[25]. Previous studies have revealed that emotion can be reflected from action. Moreover, the dynamic features mentioned earlier can measure the visual difference between the two samples, and the emotional features can reveal the logical relationship between skeleton joints. Therefore, we try to introduce emotional feature as one of the indicators to measure the difference between samples, and explore the impact of emotional features.

To this end, in this paper we propose an adversarial attack method for skeletal motions. Firstly, we define a novel distance function based on multi-dimensional features that considers dynamics and introduce emotional features for the first time. The distance considers spatial dynamic such as bone length and bone angle and temporal dynamic like speed. In addition, we use Alternating Direction Method of Multipliers(ADMM) to ensure the imperceptibility. Our proposed attack method is evaluated on three kinds of state-of-the-art models. Extensive evaluations show that our attack method can achieve 100% success rate with almost no violation of the constraints mentioned above. To summarize, the contributions of this paper are as follows:

(1) We define a novel distance method based on multi-dimensional features to measure skeletal motions, which includes dynamic distance and innovatively incorporate emotional features as complementary information.

(2) We propose an effective optimization algorithm based on Alternating Direction Method of Multipliers(ADMM) to solve the primal constrained problem, which generates adversarial skeletal motion samples with perturbations as few as possible.

(3) We fully evaluate multiple state-of-the-art models and multiple datasets and verify that the adversarial samples generated by ours can successfully deceive models with fewer perturbations and lower imperceptibility.

Refer to caption — Figure 1: Visual comparison. The green joints represent the original sample and the red ones represent the adversarial sample. (a) shows the original sample, (b) shows attack results of C&W, (c) shows attack results of SMART, (d) shows attack results of our method.

2 Methodology

Given a skeleton sample $x$ , $l$ is the predicted label of $x$ of a trained classifier. We denote $\Theta(x)$ as the result of the probability of each class before softmax layer and $F$ as softmax function. We aim to find minimum perturbation added to original sample to get adversarial sample $x^{{}^{\prime}}$ and $F(\Theta(x))\neq F(\Theta(x^{{}^{\prime}}))$ . The problem can be formulated as:

\begin{split}min\quad&D(x,x^{{}^{\prime}})\\ subject\;to\quad&F(\Theta(x^{{}^{\prime}}))=l^{{}^{\prime}},x^{{}^{\prime}}\in% [0,1]^{n}\end{split}

(1)

where $D$ is the distance function to measure original sample and adversarial sample, $l^{{}^{\prime}}$ is the predicted label of adversarial sample $x^{{}^{\prime}}$ and $l^{{}^{\prime}}\neq l$ is a hard constraint. The hard constraint of classification is defined according to different attack modes. In addition, we use emotion feature as supplementary expression of skeletal motion.

2.1 Skeleton-based dynamic distance

For the motion $h\{m\}=\{m_{0},m_{1},\cdots,m_{t}\}$ , $m_{t}$ at time t consists of not only 3D coordinates of joints but the connected topological structure. A skeleton has its dynamic information and we can represent dynamics of a skeleton from spatial and temporal aspects.

From spatial perspective, we need static and dynamic information. To ensure imperceptibility bone lengths should remain the same in adversarial samples. So we use bone length as static spatial dynamic constraint. A bone corresponds to two joints so a bone can be represented as a vector $B$ =( $x_{s}-x_{t}$ , $y_{s}-y_{t}$ , $z_{s}-z_{t}$ ) where ( $x_{s},y_{s},z_{s}$ ) is coordinate of source joint and ( $x_{t},y_{t},z_{t}$ ) is target joint. In this regard, length of the $i_{th}$ bone at the $t_{th}$ frame is defined as $B_{i}^{t}$ = $\sqrt{(x_{s}-x_{t})^{2}+(y_{s}-y_{t})^{2}+(z_{s}-z_{t})^{2}}$ . Normally, the length of the bone should be the same between original sample and adversarial sample. It can be represented as $b(x,x^{{}^{\prime}})=\lvert B_{i}^{t}-B_{i}^{{}^{\prime}t}\rvert/B_{i}^{t}$ where $B_{i}^{t}$ and $B_{i}^{{}^{\prime}t}$ represent the original sample and the adversarial sample respectively.

We use angle between bones as measurement of change of skeletons. Every two connected bones form an angle, and the change of angles means the extent of bone rotation. So we introduce angle constraint into skeleton dynamic distance. To avoid gradient explosion, we compute change of the $i_{th}$ angle at $t$ frame $A_{i}^{t}$ by[12]. Similar to bone length, we use function $a(x,x^{{}^{\prime}})$ to measure changes of angles.

From temporal perspective, it is necessary to ensure temporal smooth of adversarial samples. So we introduce joint’s speed as an index of temporal measurement. We can estimate speed of the $i_{th}$ joint at frame $t$ by the Euclidean distance between two consecutive temporal frames. The speed of the $i_{th}$ joint at frame $t$ is computed as $S_{i}^{t}$ = $\sqrt{(x_{i}^{t+1}-x_{i}^{t})^{2}+(y_{i}^{t+1}-y_{i}^{t})^{2}+(z_{i}^{t+1}-z_{% i}^{t})^{2}}$ . The measure is represented as $s(x,x^{{}^{\prime}})=\lvert S_{i}^{t}-S_{i}^{{}^{\prime}t}\rvert/S_{i}^{t}$ where $S_{i}^{t}$ and $S_{i}^{{}^{\prime}t}$ represent the joint of original sample and adversarial sample respectively. Similar to spatial constraints, $\varepsilon_{s}$ is maximum change value.

Algorithm 1 Generating adversarial samples

1: Input: original sample

x

, maximum numbers of iterations

I

, Classification Loss Function

C

, Dynamic Distance Function

D_{d}

, Emotion Distance Function

D_{e}

2: Initialization:

x_{0}^{{}^{\prime}}

x

, Lagrangian Variable:

\lambda

3: while

i\leq I-1

x^{\prime}

(

i+1

argmin_{x}

L

(

x^{\prime}

(

i

\lambda

(

i

));

l_{d}

l_{e}

l_{c}

D_{d}

(

x^{\prime}

(

i+1

)),

D_{e}

(

x^{\prime}

(

i+1

)),

C

(

x^{\prime}

(

i+1

));

l_{c}

\lambda

\times

l_{c}

\frac{\gamma}{2}

(

\lvert\lvert l_{c}\rvert\rvert^{2}_{2})

;

loss

l_{d}

l_{e}

l_{c}

;

Backward

(

loss

);

\lambda

(

i+1

\lambda

(

i

) +

\gamma

C

(

x^{\prime}

(

i+1

));

10: end while

2.2 Classification loss

Untargeted Setting. In mode of untargeted attack, $F(\Theta(x^{{}^{\prime}}))\neq l$ means that predicted label of the classifier can be any label other than the ground truth $l$ . That means maximum value of possibility must not be $l$ , which can be represented as $max(\Theta(x^{{}^{\prime}}))>\Theta_{l}(x^{{}^{\prime}})$ . Based on this, We can denote classification loss as $max(\Theta(x^{{}^{\prime}}))-\Theta_{l}(x^{{}^{\prime}})>conf$ where $conf$ is expected value of wrong prediction of classifiers. Note that the inequality constraints in the original problem impose inequality constraints on the corresponding Lagrangian variables in the dual problem. So we convert inequality constraint to equality constraint as:

max(\Theta_{l}(x^{{}^{\prime}})-max(\Theta(x^{{}^{\prime}}))+conf)=0

(2)

Targeted Setting. Under the setting of targeted attack, we aim to get $F(\Theta(x^{{}^{\prime}}))=l_{t}$ . That is to say, $max(\Theta(x^{{}^{\prime}}))=\Theta_{l_{t}}(x^{{}^{\prime}})$ . So we can use $\Theta_{l_{t}}(x^{{}^{\prime}})-max_{l\neq l_{t}}(\Theta(x_{{}^{\prime}}))>conf$ as classification loss. The equation form is expressed as follows:

max(\Theta(x^{{}^{\prime}}))-\Theta_{l_{t}}(x^{{}^{\prime}})=0

(3)

2.3 Emotion loss

In addition to dynamics, skeletal motions may also contain emotional features. Therefore, we innovatively introduce emotions as non-dynamic features to measure the distance. Specifically, we use the emotion classifier in literature[26] to extract the emotional features of skeletal motions. The model embeds the skeletal motion sequence into images for training, and obtains the predicted emotional features through four group convolution. Group convolution allows the network to learn independently from different parts of the input, so as to determine the joint interval that has the greatest impact on the final category. We put gait-based skeletal motions and the generated adversarial samples into the pre-training model to obtain the distance loss of emotion in non-dynamic features. We use $E$ as emotional features. The specific formula is $e(x,x^{{}^{\prime}})=\lvert\lvert E(x)-E(x^{{}^{\prime}})\rvert\rvert$ .

2.4 Optimal dual method

The objective function of the constrained optimization problem formulated as equation 1. $D(x,x^{{}^{\prime}})$ is distance function and $D(x,x^{{}^{\prime}})$ = $b(x,x^{{}^{\prime}})+a(x,x^{{}^{\prime}})+s(x,x^{{}^{\prime}})+e(x,x^{{}^{% \prime}})$ . The hard constraint is loss of classification as denoted in equation 2 and equation 3. In optimization theory, the optimization problem of objective function under constraints can be transformed into a corresponding dual problem. Due to its strong duality, the solution of the original problem can be obtained by solving the dual problem. We introduce the Alternating Direction Method of Multiplier to solve the dual problem. We denote Lagrange expression as $L(x,\lambda)=b(x,x^{{}^{\prime}})+a(x,x^{{}^{\prime}})+s(x,x^{{}^{\prime}})+e(% x,x^{{}^{\prime}})+\lambda C(l,l^{{}^{\prime}})+\frac{\gamma}{2}\lvert\lvert C% (l,l^{{}^{\prime}})\rvert\rvert_{2}^{2}$ where $\lambda$ is Lagrange multiplier and $C$ is classification loss. In order to effectively find the local optimal solution, we use Adam optimization algorithm for Adam optimization algorithm always converges faster than vanilla SGD. The process of generating adversarial samples is described as Algorithm 1.

Table 1: The results of our method with untargeted attack mode on NTU RGB+D.

Models	$\gamma$	NTU RGB+D CV					NTU RGB+D CS
Models	$\gamma$	$\triangle$ B/B	$\triangle$ A/A	$\triangle$ S/S	SR	l2	$\triangle$ B/B	$\triangle$ A/A	$\triangle$ S/S	SR	l2
HCN	0.1	0.9%	4.2%	3.2%	100%	0.26	0.8%	4.1%	3.3%	100%	0.24
	1.0	1.3%	6.7%	4.2%	100%	0.21	1.3%	6.7%	4.2%	100%	0.21
	10.0	2.4%	15.1%	7.5%	100%	0.22	2.3%	13.7%	7.1%	100%	0.20
2s AGCN	0.1	0.4%	2.5%	1.8%	100%	0.07	0.6%	2.8%	1.6%	100%	0.06
	1.0	0.6%	2.7%	2.2%	100%	0.08	0.5%	2.1%	1.9%	100%	0.07
	10.0	2.3%	12.0%	6.9%	100%	0.15	1.4%	7.8%	4.1%	100%	0.12
SGN	0.1	0.6%	3.1%	2.1%	100%	0.15	0.8%	3.0%	1.8%	100%	0.14
	1.0	0.9%	4.8%	2.4%	100%	0.12	0.9%	4.9%	1.7%	100%	0.12
	10.0	2.6%	12.6%	7.5%	100%	0.18	1.8%	10.6%	4.5%	100%	0.13

3 Experiments

3.1 Models and datasets

NTU RGB+D consists of 25 joint points in each skeleton. The original paper recommends two benchmarks: (1) Cross-subject: the subject in training set and validation set are different. (2) Cross-view: training set captured by camera 2 and 3 and validation set captured by camera 1. Kinetics-400[27] contains 18 joints in each skeleton. The adversarial samples of following experiments are generated on the validation set on two datasets.

We select HCN[28], 2s AGCN[29], SGN[30] and investigate their vulnerability under different scenarios. HCN[28] has achieved state-of-the-art performance before GCN related work and 2s AGCN[29] is an effective GCN-based model. SGN[30] introduces semantics for the first time and achieves great performance.

3.2 Evaluation metrics

For the adversarial attack method, the effectiveness refers to the extent to which the method can provide ”successful” adversarial samples. On this premise, we evaluates the quality of adversarial samples from two aspects: misclassification and imperceptibility.

(1) Misclassification refers to the degree of deception of adversarial samples, reflected in the following indicators.

Attack Success Rate(SR), that is, the proportion of adversarial samples wrongly classified (in untarget mode) or wrongly classified to the specified class (in target mode). $SR_{UA}=\frac{1}{N}\sum_{j=0}^{N}sum(F(x)\neq l_{t})$ is under untarget mode and $SR_{TA}=\frac{1}{N}\sum_{j=0}^{N}sum(F(x)=l_{t})$ is under target mode.

(2) Imperceptibility. We defined four evaluation metrics for the original sample $x$ and adversarial sample $x^{{}^{\prime}}$ : the average deviation percentage of bone length $\triangle\textbf{B/B}=\frac{\sum_{i=0}^{N}(\sum_{j=0}^{M}(B_{j}^{i}-B_{j}^{{}^% {\prime}i})/B_{j}^{i})}{N\times M}$ , the average deviation percentage of bone angle $\triangle\textbf{A/A}$ , the deviation percentage of joint speed $\triangle\textbf{S/S}=\frac{\sum_{j=0}^{N}\lvert\lvert x_{s}-x_{s}^{{}^{\prime% }}\rvert\rvert_{2}}{F\times N\times O}$ and the l2 distance between the original sample and adversarial sample $\textbf{l2}=\frac{\sum_{j=0}^{N}\lvert\lvert x-x^{{}^{\prime}}\rvert\rvert_{2}% }{F\times N}$ where $N$ is the total number of adversarial samples, $F$ is the total number of frames in a motion and $O$ and $M$ are the total number of joints and frames in a skeleton.

3.3 Attack results

Misclassification. The quantitative results of untargeted attack mode are shown in Table 1. Our method achieves high success rates across different datasets and target models. For targeted attack mode, results are shown in Table 2. It is not surprising to turn ’reading’ into ’writing’. Therefore we choose ’drinking water’ as targeted skeletal motion class for obvious differences are existed. For classifier model 2s AGCN and SGN, we can see that the adversarial samples successfully deceives the two models but more perturbations are added to the adversarial samples to deceive SGN compared with 2s AGCN. This shows that even though the network of SGN is relatively simple, its defense capability exceeds that of 2s AGCN. What’s more, it is also proved that semantics can greatly enhance the ability of network to learn logical and dynamic features of skeletal motions.

Table 2: The results of our method on targeted attack mode on NTU RGB+D with cross view setting.

	$\triangle$ B/B	$\triangle$ A/A	$\triangle$ S/S	SR	l2
HCN	3.1%	15.4%	7.5%	100%	0.62
2s AGCN	1.1%	4.8%	3.1%	100%	0.21
SGN	2.9%	15%	4.2%	100%	0.44
AeS-GCN[5]	10.5%	41.2%	21.9%	73.9%	5.92

Imperceptibility. Our method obtains adversarial samples with much lower perturbation as shown in Table 4. Therefore, it is also proved that the dynamic distance proposed is more effective than l2 distance for skeletal motions. Compared with the existing methods, strict perceptual control is used as the optimization target problem to improve the imperceptibility. Taking the adversarial sample generated based on SGN as an example in Figure 1, we sample 10 frames from sequence for visual display. Since the previous results show that SGN produces greater perturbations when it is successfully attacked. The label of original sample is ’throwing’ and after attack the predicted label is ’brushing teeth’ under targeted attack mode. When comparing two samples carefully, we can find that differences are existed in some joints. However, when two samples are played as video sequences, differences are hard to find. In addition, we also find that the perturbations added to adversarial samples are concentrated on arms and hands, which contain obvious dynamics. This may remind us that we should consider to reduce sensitivity to special data when designing classifiers.

Table 3: The effectiveness of emotional features on HCN on NTU RGB+D with cross view setting.

	$\triangle$ B/B	$\triangle$ A/A	$\triangle$ S/S	SR	l2
Ours w emotion	1.3%	6.7%	4.2%	100%	0.21
Ours w/o emotion	1.2%	7.1%	4.8%	100%	0.24

We also verify the role of emotional features. Table 3 shows the effectiveness of emotional features. Emotional features can control the perception of adversarial samples from overall perspective. However, the improvement is not obvious, which may be due to the accuracy of the emotion recognizer used. Therefore, it can be expected that with in-depth study of emotion recognition, emotional features will be better integrated.

Table 4: The results of C&W on HCN on NTU RGB+D.

		$\triangle$ B/B	$\triangle$ A/A	SR	l2
Untargeted	NTU CV	4.67%	24.1%	100%	0.28
Untargeted	NTU CS	4.09%	21.1%	100%	0.24
Targeted	NTU CV	8.8%	46.8%	100%	0.51
Targeted	NTU CS	9.5%	50.7%	100%	0.52

4 Conclusion

To summarize, in order to explore the vulnerability of skeleton-based action recognizers, we propose a novel attack method for based on multi-dimensional features. We fuse dynamics and emotional features of skeletal motions and generate successful adversarial samples based on ADMM. A large number of experiments show that our method has fewer perturbations and better imperceptibility than other methods. Our method is effective on multiple datasets and the state-of-the-art models. In the future, we will systematically study how to improve the defense ability of models to resist attacks.

Acknowledgements

This work supported by Bei**g Key Laboratory of Behavior and Mental Health in School of Psychological and Cognitive Sciences, Peking University. We would also like to thank the colleagues in the CiL lab at East China Normal University for their efforts on this project, and the reviewers for their time and hard work.

References

[1] Rosalind W Picard. Affective computing. MIT press, 2000.
[2] Jiahao Zhang, Feng Liu, and Aimin Zhou. Off-tanet: a lightweight neural micro-expression recognizer with optical flow features and integrated attention mechanism. In Pacific Rim International Conference on Artificial Intelligence, pages 266–279. Springer, 2021.
[3] Feng Liu, Si-Yuan Shen, Zi-Wang Fu, Han-Yang Wang, Ai-Min Zhou, and Jia-Yin Qi. Lgcct: A light gated and crossed complementation transformer for multimodal speech emotion recognition. Entropy, 24(7):1010, 2022.
[4] Siyuan Shen, Feng Liu, and Aimin Zhou. Mingling or misalignment? temporal shift for speech emotion recognition with pre-trained representations. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
[5] Qing Xu, Feng LiU, Ziwang Fu, Aimin Zhou, and Jiayin Qi. Aes-gcn: Attention-enhanced semantic-guided graph convolutional networks for skeleton-based action recognition. Computer Animation and Virtual Worlds, 33(3-4):e2070, 2022.
[6] Ziwang Fu, Feng Liu, Qing Xu, Xiangling Fu, and Jiayin Qi. Lmr-cbt: Learning modality-fused representations with cb-transformer for multimodal emotion recognition from unaligned multimodal sequences. Frontiers of Computer Science, 18(4):184314, 2024.
[7] Feng Liu, Ziwang Fu, Yunlong Wang, and Qijian Zheng. Tacfn: Transformer-based adaptive cross-modal fusion network for multimodal emotion recognition. CAAI Artificial Intelligence Research, 2, 2023.
[8] Feng Liu, Han-Yang Wang, Si-Yuan Shen, Xun Jia, **g-Yi Hu, Jia-Hao Zhang, Xi-Yi Wang, Ying Lei, Ai-Min Zhou, Jia-Yin Qi, et al. Opo-fcm: A computational affection based occ-pad-ocean federation cognitive modeling approach. IEEE Transactions on Computational Social Systems, 2022.
[9] Hanyang Wang, Bo Li, Shuang Wu, Siyuan Shen, Feng Liu, Shouhong Ding, and Aimin Zhou. Rethinking the learning paradigm for dynamic facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17958–17968, June 2023.
[10] Feng Liu, Hanyang Wang, Jiahao Zhang, Ziwang Fu, Aimin Zhou, Jiayin Qi, and Zhibin Li. Evogan: An evolutionary computation assisted gan. Neurocomputing, 469:81–90, 2022.
[11] Jian Liu, Naveed Akhtar, and Ajmal Mian. Adversarial attack on skeleton-based human action recognition. IEEE Transactions on Neural Networks and Learning Systems, 2020.
[12] Tianhang Zheng, Sheng Liu, Changyou Chen, Junsong Yuan, Baochun Li, and Kui Ren. Towards understanding the adversarial vulnerability of skeleton-based action recognition. arXiv preprint arXiv:2005.07151, 2020.
[13] Xiuxiu Bai, Ming Yang, and Zhe Liu. On the robustness of skeleton detection against adversarial attacks. Neural Networks, 132:416–427, 2020.
[14] Chaowei Xiao, Dawei Yang, Bo Li, Jia Deng, and Mingyan Liu. Meshadv: Adversarial meshes for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6898–6907, 2019.
[15] Fazle Karim, Somshubra Majumdar, and Houshang Darabi. Adversarial attacks on time series. IEEE transactions on pattern analysis and machine intelligence, 2020.
[16] Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. Adversarial attacks on deep neural networks for time series classification. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2019.
[17] Shuai Jia, Chao Ma, Yibing Song, and Xiaokang Yang. Robust tracking against adversarial attacks. In European Conference on Computer Vision, pages 69–84. Springer, 2020.
[18] Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks. arXiv preprint arXiv:1801.02610, 2018.
[19] Yunfeng Diao, Tianjia Shao, Yong-Liang Yang, Kun Zhou, and He Wang. Basar: Black-box attack on skeletal action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7597–7607, 2021.
[20] He Wang, Feixiang He, Zhexi Peng, Tianjia Shao, Yong-Liang Yang, Kun Zhou, and David Hogg. Understanding the robustness of skeleton-based action recognition under adversarial attack. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14656–14665, 2021.
[21] Edmilson da Silva Morais, Ron Hoory, Weizhong Zhu, Itai Gat, Matheus da Silva Damasceno, and Hagai Aronowitz. Speech emotion recognition using self-supervised features. 2021.
[22] Yuanchao Li, Peter Bell, and Catherine Lai. Fusing asr outputs in joint training for speech emotion recognition. arXiv preprint arXiv:2110.15684, 2021.
[23] Pengfei Liu, Kun Li, and Helen Meng. Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition. arXiv preprint arXiv:2201.06309, 2022.
[24] Ziwang Fu, Feng Liu, Hanyang Wang, Siyuan Shen, Jiahao Zhang, Jiayin Qi, Xiangling Fu, and Aimin Zhou. Lmr-cbt: Learning modality-fused representations with cb-transformer for multimodal emotion recognition from unaligned multimodal sequences. arXiv preprint arXiv:2112.01697, 2021.
[25] Chuanfei Hu, Weijie Sheng, Bo Dong, and Xinde Li. Tntc: two-stream network with transformer-based complementarity for gait-based emotion recognition. arXiv preprint arXiv:2110.13708, 2021.
[26] Venkatraman Narayanan, Bala Murali Manoghar, Vishnu Sashank Dorbala, Dinesh Manocha, and Aniket Bera. Proxemo: Gait-based emotion learning and multi-view proxemic fusion for socially-aware robot navigation. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8200–8207. IEEE, 2020.
[27] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
[28] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055, 2018.
[29] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12026–12035, 2019.
[30] Pengfei Zhang, Cuiling Lan, Wenjun Zeng, Junliang Xing, Jianru Xue, and Nanning Zheng. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1112–1121, 2020.