Emotion Loss Attacking : Adversarial Attack Perception for Skeleton based on Multi-dimensional Features

Feng Liu
East China Normal University
[email protected]
   Qing Xu
Bei**g University of Posts and Telecommunications
[email protected]
   Qijian Zheng
East China Normal University
[email protected]
Abstract

Adversarial attack on skeletal motion is a hot topic. However, existing researches only consider part of dynamic features when measuring distance between skeleton graph sequences, which results in poor imperceptibility. To this end, we propose a novel adversarial attack method to attack action recognizers for skeletal motions. Firstly, our method systematically proposes a dynamic distance function to measure the difference between skeletal motions. Meanwhile, we innovatively introduce emotional features for complementary information. In addition, we use Alternating Direction Method of Multipliers(ADMM) to solve the constrained optimization problem, which generates adversarial samples with better imperceptibility to deceive the classifiers. Experiments show that our method is effective on multiple action classifiers and datasets. When the perturbation magnitude measured by l norms is the same, the dynamic perturbations generated by our method are much lower than that of other methods. What’s more, we are the first to prove the effectiveness of emotional features, and provide a new idea for measuring the distance between skeletal motions.

keywords:
adversarial attack, skeleton, optimization, emotion loss attacking, computational perception

1 Introduction

Affective computing[1] is one of the hotspots in today’s AI research, which includes and is not limited to the research of facial emotion recognition[2], speech emotion recognition[3][4], gesture emotion recognition[5], multimodal emotion recognition[6][7] and some personality recognition[8] based on dynamic expression recognition[9] and other related technologies[10]. As the research of skeleton-based action recognition is becoming more and more popular, its robustness in practical application scenarios has attracted extensive attention and exploration[11, 12, 13]. Researches in adversarial attack has revealed that deep learning methods are vulnerable to carefully devised data perturbations. Adversarial attacks on static data such as images and text have been widely studied[14], but the research on time-series data especially skeleton data is relatively lack and immature[15, 16, 17]. The adversarial attack on skeleton sequences mainly faces two challenges: low redundancy and perceptual sensitivity, which is unique from static data and other time-series data. A skeletal motion is composed of joints and physical relationships among joints. That means the action domain of skeletal motions is limited, and the requirements of imperceptibility for adversarial samples become more stringent. A skeletal motion usually has less than 100 Degrees of freedom (Dofs), much lower than images/meshes. What’s more, any sparsity based perturbation (on a single joint or a single frame) will greatly affect the dynamics (leading to jitter or bone-length violation) and is very obvious to the observer.

There are many typical attack methods for classified networks which can be divided into gradient based and optimization based. These methods are mostly based on images[18]that are also called European structure. However, skeletal motion is with non European structure. That means the distance measurement of skeletal motion can not be simply measured by l0, l2 and others. However, the existing researches[19, 20] don’t consider it and only use perturbations magnitude as the metric to judge the imperceptibility[20], which is unreasonable for skeletal motions. Therefore, we take the unique dynamics into consideration comprehensively when generating adversarial samples.

Research on emotion recognition is becoming more and more popular, such as speech-based emotion recognition[21], text-based emotion recognition[22], multimodal emotion recognition[23, 24] and action-based emotion recognition[25]. Previous studies have revealed that emotion can be reflected from action. Moreover, the dynamic features mentioned earlier can measure the visual difference between the two samples, and the emotional features can reveal the logical relationship between skeleton joints. Therefore, we try to introduce emotional feature as one of the indicators to measure the difference between samples, and explore the impact of emotional features.

To this end, in this paper we propose an adversarial attack method for skeletal motions. Firstly, we define a novel distance function based on multi-dimensional features that considers dynamics and introduce emotional features for the first time. The distance considers spatial dynamic such as bone length and bone angle and temporal dynamic like speed. In addition, we use Alternating Direction Method of Multipliers(ADMM) to ensure the imperceptibility. Our proposed attack method is evaluated on three kinds of state-of-the-art models. Extensive evaluations show that our attack method can achieve 100% success rate with almost no violation of the constraints mentioned above. To summarize, the contributions of this paper are as follows:

(1) We define a novel distance method based on multi-dimensional features to measure skeletal motions, which includes dynamic distance and innovatively incorporate emotional features as complementary information.

(2) We propose an effective optimization algorithm based on Alternating Direction Method of Multipliers(ADMM) to solve the primal constrained problem, which generates adversarial skeletal motion samples with perturbations as few as possible.

(3) We fully evaluate multiple state-of-the-art models and multiple datasets and verify that the adversarial samples generated by ours can successfully deceive models with fewer perturbations and lower imperceptibility.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Visual comparison. The green joints represent the original sample and the red ones represent the adversarial sample. (a) shows the original sample, (b) shows attack results of C&W, (c) shows attack results of SMART, (d) shows attack results of our method.

2 Methodology

Given a skeleton sample x𝑥xitalic_x, l𝑙litalic_l is the predicted label of x𝑥xitalic_x of a trained classifier. We denote Θ(x)Θ𝑥\Theta(x)roman_Θ ( italic_x ) as the result of the probability of each class before softmax layer and F𝐹Fitalic_F as softmax function. We aim to find minimum perturbation added to original sample to get adversarial sample xsuperscript𝑥x^{{}^{\prime}}italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT and F(Θ(x))F(Θ(x))𝐹Θ𝑥𝐹Θsuperscript𝑥F(\Theta(x))\neq F(\Theta(x^{{}^{\prime}}))italic_F ( roman_Θ ( italic_x ) ) ≠ italic_F ( roman_Θ ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ). The problem can be formulated as:

minD(x,x)subjecttoF(Θ(x))=l,x[0,1]nformulae-sequence𝑚𝑖𝑛𝐷𝑥superscript𝑥𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑡𝑜𝐹Θsuperscript𝑥superscript𝑙superscript𝑥superscript01𝑛\begin{split}min\quad&D(x,x^{{}^{\prime}})\\ subject\;to\quad&F(\Theta(x^{{}^{\prime}}))=l^{{}^{\prime}},x^{{}^{\prime}}\in% [0,1]^{n}\end{split}start_ROW start_CELL italic_m italic_i italic_n end_CELL start_CELL italic_D ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_s italic_u italic_b italic_j italic_e italic_c italic_t italic_t italic_o end_CELL start_CELL italic_F ( roman_Θ ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ) = italic_l start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_CELL end_ROW (1)

where D𝐷Ditalic_D is the distance function to measure original sample and adversarial sample, lsuperscript𝑙l^{{}^{\prime}}italic_l start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT is the predicted label of adversarial sample xsuperscript𝑥x^{{}^{\prime}}italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT and llsuperscript𝑙𝑙l^{{}^{\prime}}\neq litalic_l start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ≠ italic_l is a hard constraint. The hard constraint of classification is defined according to different attack modes. In addition, we use emotion feature as supplementary expression of skeletal motion.

2.1 Skeleton-based dynamic distance

For the motion h{m}={m0,m1,,mt}𝑚subscript𝑚0subscript𝑚1subscript𝑚𝑡h\{m\}=\{m_{0},m_{1},\cdots,m_{t}\}italic_h { italic_m } = { italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, mtsubscript𝑚𝑡m_{t}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t consists of not only 3D coordinates of joints but the connected topological structure. A skeleton has its dynamic information and we can represent dynamics of a skeleton from spatial and temporal aspects.

From spatial perspective, we need static and dynamic information. To ensure imperceptibility bone lengths should remain the same in adversarial samples. So we use bone length as static spatial dynamic constraint. A bone corresponds to two joints so a bone can be represented as a vector B𝐵Bitalic_B=(xsxtsubscript𝑥𝑠subscript𝑥𝑡x_{s}-x_{t}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, ysytsubscript𝑦𝑠subscript𝑦𝑡y_{s}-y_{t}italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, zsztsubscript𝑧𝑠subscript𝑧𝑡z_{s}-z_{t}italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) where (xs,ys,zssubscript𝑥𝑠subscript𝑦𝑠subscript𝑧𝑠x_{s},y_{s},z_{s}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) is coordinate of source joint and (xt,yt,ztsubscript𝑥𝑡subscript𝑦𝑡subscript𝑧𝑡x_{t},y_{t},z_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) is target joint. In this regard, length of the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT bone at the tthsubscript𝑡𝑡t_{th}italic_t start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT frame is defined as Bitsuperscriptsubscript𝐵𝑖𝑡B_{i}^{t}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = (xsxt)2+(ysyt)2+(zszt)2superscriptsubscript𝑥𝑠subscript𝑥𝑡2superscriptsubscript𝑦𝑠subscript𝑦𝑡2superscriptsubscript𝑧𝑠subscript𝑧𝑡2\sqrt{(x_{s}-x_{t})^{2}+(y_{s}-y_{t})^{2}+(z_{s}-z_{t})^{2}}square-root start_ARG ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Normally, the length of the bone should be the same between original sample and adversarial sample. It can be represented as b(x,x)=|BitBit|/Bit𝑏𝑥superscript𝑥superscriptsubscript𝐵𝑖𝑡superscriptsubscript𝐵𝑖superscript𝑡superscriptsubscript𝐵𝑖𝑡b(x,x^{{}^{\prime}})=\lvert B_{i}^{t}-B_{i}^{{}^{\prime}t}\rvert/B_{i}^{t}italic_b ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) = | italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | / italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT where Bitsuperscriptsubscript𝐵𝑖𝑡B_{i}^{t}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Bitsuperscriptsubscript𝐵𝑖superscript𝑡B_{i}^{{}^{\prime}t}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represent the original sample and the adversarial sample respectively.

We use angle between bones as measurement of change of skeletons. Every two connected bones form an angle, and the change of angles means the extent of bone rotation. So we introduce angle constraint into skeleton dynamic distance. To avoid gradient explosion, we compute change of the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT angle at t𝑡titalic_t frame Aitsuperscriptsubscript𝐴𝑖𝑡A_{i}^{t}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT by[12]. Similar to bone length, we use function a(x,x)𝑎𝑥superscript𝑥a(x,x^{{}^{\prime}})italic_a ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) to measure changes of angles.

From temporal perspective, it is necessary to ensure temporal smooth of adversarial samples. So we introduce joint’s speed as an index of temporal measurement. We can estimate speed of the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT joint at frame t𝑡titalic_t by the Euclidean distance between two consecutive temporal frames. The speed of the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT joint at frame t𝑡titalic_t is computed as Sitsuperscriptsubscript𝑆𝑖𝑡S_{i}^{t}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = (xit+1xit)2+(yit+1yit)2+(zit+1zit)2superscriptsuperscriptsubscript𝑥𝑖𝑡1superscriptsubscript𝑥𝑖𝑡2superscriptsuperscriptsubscript𝑦𝑖𝑡1superscriptsubscript𝑦𝑖𝑡2superscriptsuperscriptsubscript𝑧𝑖𝑡1superscriptsubscript𝑧𝑖𝑡2\sqrt{(x_{i}^{t+1}-x_{i}^{t})^{2}+(y_{i}^{t+1}-y_{i}^{t})^{2}+(z_{i}^{t+1}-z_{% i}^{t})^{2}}square-root start_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. The measure is represented as s(x,x)=|SitSit|/Sit𝑠𝑥superscript𝑥superscriptsubscript𝑆𝑖𝑡superscriptsubscript𝑆𝑖superscript𝑡superscriptsubscript𝑆𝑖𝑡s(x,x^{{}^{\prime}})=\lvert S_{i}^{t}-S_{i}^{{}^{\prime}t}\rvert/S_{i}^{t}italic_s ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) = | italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | / italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT where Sitsuperscriptsubscript𝑆𝑖𝑡S_{i}^{t}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Sitsuperscriptsubscript𝑆𝑖superscript𝑡S_{i}^{{}^{\prime}t}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represent the joint of original sample and adversarial sample respectively. Similar to spatial constraints, εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is maximum change value.

Algorithm 1 Generating adversarial samples
1:  Input: original sample x𝑥xitalic_x, maximum numbers of iterations I𝐼Iitalic_I, Classification Loss Function C𝐶Citalic_C, Dynamic Distance Function Ddsubscript𝐷𝑑D_{d}italic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, Emotion Distance Function Desubscript𝐷𝑒D_{e}italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT
2:  Initialization: x0superscriptsubscript𝑥0x_{0}^{{}^{\prime}}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = x𝑥xitalic_x, Lagrangian Variable: λ𝜆\lambdaitalic_λ
3:  while iI1𝑖𝐼1i\leq I-1italic_i ≤ italic_I - 1 do
4:     xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT(i+1𝑖1i+1italic_i + 1)=argminx𝑎𝑟𝑔𝑚𝑖subscript𝑛𝑥argmin_{x}italic_a italic_r italic_g italic_m italic_i italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT L𝐿Litalic_L(xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT(i𝑖iitalic_i), λ𝜆\lambdaitalic_λ(i𝑖iitalic_i));
5:     ldsubscript𝑙𝑑l_{d}italic_l start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, lesubscript𝑙𝑒l_{e}italic_l start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, lcsubscript𝑙𝑐l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT=Ddsubscript𝐷𝑑D_{d}italic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT (xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT(i+1𝑖1i+1italic_i + 1)), Desubscript𝐷𝑒D_{e}italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT(xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT(i+1𝑖1i+1italic_i + 1)), C𝐶Citalic_C(xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT(i+1𝑖1i+1italic_i + 1));
6:     lcsubscript𝑙𝑐l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = λ𝜆\lambdaitalic_λ ×\times× lcsubscript𝑙𝑐l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + γ2𝛾2\frac{\gamma}{2}divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG(||lc||22)\lvert\lvert l_{c}\rvert\rvert^{2}_{2})| | italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT );
7:     loss𝑙𝑜𝑠𝑠lossitalic_l italic_o italic_s italic_s = ldsubscript𝑙𝑑l_{d}italic_l start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + lesubscript𝑙𝑒l_{e}italic_l start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + lcsubscript𝑙𝑐l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT;
8:     Backward𝐵𝑎𝑐𝑘𝑤𝑎𝑟𝑑Backwarditalic_B italic_a italic_c italic_k italic_w italic_a italic_r italic_d(loss𝑙𝑜𝑠𝑠lossitalic_l italic_o italic_s italic_s);
9:     λ𝜆\lambdaitalic_λ(i+1𝑖1i+1italic_i + 1)= λ𝜆\lambdaitalic_λ(i𝑖iitalic_i) + γ𝛾\gammaitalic_γ C𝐶Citalic_C(xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT(i+1𝑖1i+1italic_i + 1));
10:  end while

2.2 Classification loss

Untargeted Setting. In mode of untargeted attack, F(Θ(x))l𝐹Θsuperscript𝑥𝑙F(\Theta(x^{{}^{\prime}}))\neq litalic_F ( roman_Θ ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ) ≠ italic_l means that predicted label of the classifier can be any label other than the ground truth l𝑙litalic_l. That means maximum value of possibility must not be l𝑙litalic_l, which can be represented as max(Θ(x))>Θl(x)𝑚𝑎𝑥Θsuperscript𝑥subscriptΘ𝑙superscript𝑥max(\Theta(x^{{}^{\prime}}))>\Theta_{l}(x^{{}^{\prime}})italic_m italic_a italic_x ( roman_Θ ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ) > roman_Θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ). Based on this, We can denote classification loss as max(Θ(x))Θl(x)>conf𝑚𝑎𝑥Θsuperscript𝑥subscriptΘ𝑙superscript𝑥𝑐𝑜𝑛𝑓max(\Theta(x^{{}^{\prime}}))-\Theta_{l}(x^{{}^{\prime}})>confitalic_m italic_a italic_x ( roman_Θ ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ) - roman_Θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) > italic_c italic_o italic_n italic_f where conf𝑐𝑜𝑛𝑓confitalic_c italic_o italic_n italic_f is expected value of wrong prediction of classifiers. Note that the inequality constraints in the original problem impose inequality constraints on the corresponding Lagrangian variables in the dual problem. So we convert inequality constraint to equality constraint as:

max(Θl(x)max(Θ(x))+conf)=0𝑚𝑎𝑥subscriptΘ𝑙superscript𝑥𝑚𝑎𝑥Θsuperscript𝑥𝑐𝑜𝑛𝑓0max(\Theta_{l}(x^{{}^{\prime}})-max(\Theta(x^{{}^{\prime}}))+conf)=0italic_m italic_a italic_x ( roman_Θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_m italic_a italic_x ( roman_Θ ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ) + italic_c italic_o italic_n italic_f ) = 0 (2)

Targeted Setting. Under the setting of targeted attack, we aim to get F(Θ(x))=lt𝐹Θsuperscript𝑥subscript𝑙𝑡F(\Theta(x^{{}^{\prime}}))=l_{t}italic_F ( roman_Θ ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ) = italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. That is to say, max(Θ(x))=Θlt(x)𝑚𝑎𝑥Θsuperscript𝑥subscriptΘsubscript𝑙𝑡superscript𝑥max(\Theta(x^{{}^{\prime}}))=\Theta_{l_{t}}(x^{{}^{\prime}})italic_m italic_a italic_x ( roman_Θ ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ) = roman_Θ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ). So we can use Θlt(x)maxllt(Θ(x))>confsubscriptΘsubscript𝑙𝑡superscript𝑥𝑚𝑎subscript𝑥𝑙subscript𝑙𝑡Θsubscript𝑥𝑐𝑜𝑛𝑓\Theta_{l_{t}}(x^{{}^{\prime}})-max_{l\neq l_{t}}(\Theta(x_{{}^{\prime}}))>confroman_Θ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_m italic_a italic_x start_POSTSUBSCRIPT italic_l ≠ italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_Θ ( italic_x start_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUBSCRIPT ) ) > italic_c italic_o italic_n italic_f as classification loss. The equation form is expressed as follows:

max(Θ(x))Θlt(x)=0𝑚𝑎𝑥Θsuperscript𝑥subscriptΘsubscript𝑙𝑡superscript𝑥0max(\Theta(x^{{}^{\prime}}))-\Theta_{l_{t}}(x^{{}^{\prime}})=0italic_m italic_a italic_x ( roman_Θ ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ) - roman_Θ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) = 0 (3)

2.3 Emotion loss

In addition to dynamics, skeletal motions may also contain emotional features. Therefore, we innovatively introduce emotions as non-dynamic features to measure the distance. Specifically, we use the emotion classifier in literature[26] to extract the emotional features of skeletal motions. The model embeds the skeletal motion sequence into images for training, and obtains the predicted emotional features through four group convolution. Group convolution allows the network to learn independently from different parts of the input, so as to determine the joint interval that has the greatest impact on the final category. We put gait-based skeletal motions and the generated adversarial samples into the pre-training model to obtain the distance loss of emotion in non-dynamic features. We use E𝐸Eitalic_E as emotional features. The specific formula is e(x,x)=||E(x)E(x)||𝑒𝑥superscript𝑥𝐸𝑥𝐸superscript𝑥e(x,x^{{}^{\prime}})=\lvert\lvert E(x)-E(x^{{}^{\prime}})\rvert\rvertitalic_e ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) = | | italic_E ( italic_x ) - italic_E ( italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) | |.

2.4 Optimal dual method

The objective function of the constrained optimization problem formulated as equation 1. D(x,x)𝐷𝑥superscript𝑥D(x,x^{{}^{\prime}})italic_D ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) is distance function and D(x,x)𝐷𝑥superscript𝑥D(x,x^{{}^{\prime}})italic_D ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) = b(x,x)+a(x,x)+s(x,x)+e(x,x)𝑏𝑥superscript𝑥𝑎𝑥superscript𝑥𝑠𝑥superscript𝑥𝑒𝑥superscript𝑥b(x,x^{{}^{\prime}})+a(x,x^{{}^{\prime}})+s(x,x^{{}^{\prime}})+e(x,x^{{}^{% \prime}})italic_b ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_a ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_s ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_e ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ). The hard constraint is loss of classification as denoted in equation 2 and equation 3. In optimization theory, the optimization problem of objective function under constraints can be transformed into a corresponding dual problem. Due to its strong duality, the solution of the original problem can be obtained by solving the dual problem. We introduce the Alternating Direction Method of Multiplier to solve the dual problem. We denote Lagrange expression as L(x,λ)=b(x,x)+a(x,x)+s(x,x)+e(x,x)+λC(l,l)+γ2||C(l,l)||22𝐿𝑥𝜆𝑏𝑥superscript𝑥𝑎𝑥superscript𝑥𝑠𝑥superscript𝑥𝑒𝑥superscript𝑥𝜆𝐶𝑙superscript𝑙𝛾2superscriptsubscript𝐶𝑙superscript𝑙22L(x,\lambda)=b(x,x^{{}^{\prime}})+a(x,x^{{}^{\prime}})+s(x,x^{{}^{\prime}})+e(% x,x^{{}^{\prime}})+\lambda C(l,l^{{}^{\prime}})+\frac{\gamma}{2}\lvert\lvert C% (l,l^{{}^{\prime}})\rvert\rvert_{2}^{2}italic_L ( italic_x , italic_λ ) = italic_b ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_a ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_s ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_e ( italic_x , italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_λ italic_C ( italic_l , italic_l start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG | | italic_C ( italic_l , italic_l start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT where λ𝜆\lambdaitalic_λ is Lagrange multiplier and C𝐶Citalic_C is classification loss. In order to effectively find the local optimal solution, we use Adam optimization algorithm for Adam optimization algorithm always converges faster than vanilla SGD. The process of generating adversarial samples is described as Algorithm 1.

Table 1: The results of our method with untargeted attack mode on NTU RGB+D.
Models γ𝛾\gammaitalic_γ NTU RGB+D CV NTU RGB+D CS
\triangleB/B \triangleA/A \triangleS/S SR l2 \triangleB/B \triangleA/A \triangleS/S SR l2
HCN 0.1 0.9% 4.2% 3.2% 100% 0.26 0.8% 4.1% 3.3% 100% 0.24
1.0 1.3% 6.7% 4.2% 100% 0.21 1.3% 6.7% 4.2% 100% 0.21
10.0 2.4% 15.1% 7.5% 100% 0.22 2.3% 13.7% 7.1% 100% 0.20
2s AGCN 0.1 0.4% 2.5% 1.8% 100% 0.07 0.6% 2.8% 1.6% 100% 0.06
1.0 0.6% 2.7% 2.2% 100% 0.08 0.5% 2.1% 1.9% 100% 0.07
10.0 2.3% 12.0% 6.9% 100% 0.15 1.4% 7.8% 4.1% 100% 0.12
SGN 0.1 0.6% 3.1% 2.1% 100% 0.15 0.8% 3.0% 1.8% 100% 0.14
1.0 0.9% 4.8% 2.4% 100% 0.12 0.9% 4.9% 1.7% 100% 0.12
10.0 2.6% 12.6% 7.5% 100% 0.18 1.8% 10.6% 4.5% 100% 0.13

3 Experiments

3.1 Models and datasets

NTU RGB+D consists of 25 joint points in each skeleton. The original paper recommends two benchmarks: (1) Cross-subject: the subject in training set and validation set are different. (2) Cross-view: training set captured by camera 2 and 3 and validation set captured by camera 1. Kinetics-400[27] contains 18 joints in each skeleton. The adversarial samples of following experiments are generated on the validation set on two datasets.

We select HCN[28], 2s AGCN[29], SGN[30] and investigate their vulnerability under different scenarios. HCN[28] has achieved state-of-the-art performance before GCN related work and 2s AGCN[29] is an effective GCN-based model. SGN[30] introduces semantics for the first time and achieves great performance.

3.2 Evaluation metrics

For the adversarial attack method, the effectiveness refers to the extent to which the method can provide ”successful” adversarial samples. On this premise, we evaluates the quality of adversarial samples from two aspects: misclassification and imperceptibility.

(1) Misclassification refers to the degree of deception of adversarial samples, reflected in the following indicators.

Attack Success Rate(SR), that is, the proportion of adversarial samples wrongly classified (in untarget mode) or wrongly classified to the specified class (in target mode). SRUA=1Nj=0Nsum(F(x)lt)𝑆subscript𝑅𝑈𝐴1𝑁superscriptsubscript𝑗0𝑁𝑠𝑢𝑚𝐹𝑥subscript𝑙𝑡SR_{UA}=\frac{1}{N}\sum_{j=0}^{N}sum(F(x)\neq l_{t})italic_S italic_R start_POSTSUBSCRIPT italic_U italic_A end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s italic_u italic_m ( italic_F ( italic_x ) ≠ italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is under untarget mode and SRTA=1Nj=0Nsum(F(x)=lt)𝑆subscript𝑅𝑇𝐴1𝑁superscriptsubscript𝑗0𝑁𝑠𝑢𝑚𝐹𝑥subscript𝑙𝑡SR_{TA}=\frac{1}{N}\sum_{j=0}^{N}sum(F(x)=l_{t})italic_S italic_R start_POSTSUBSCRIPT italic_T italic_A end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s italic_u italic_m ( italic_F ( italic_x ) = italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is under target mode.

(2) Imperceptibility. We defined four evaluation metrics for the original sample x𝑥xitalic_x and adversarial sample xsuperscript𝑥x^{{}^{\prime}}italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT: the average deviation percentage of bone length B/B=i=0N(j=0M(BjiBji)/Bji)N×MB/Bsuperscriptsubscript𝑖0𝑁superscriptsubscript𝑗0𝑀superscriptsubscript𝐵𝑗𝑖superscriptsubscript𝐵𝑗superscript𝑖superscriptsubscript𝐵𝑗𝑖𝑁𝑀\triangle\textbf{B/B}=\frac{\sum_{i=0}^{N}(\sum_{j=0}^{M}(B_{j}^{i}-B_{j}^{{}^% {\prime}i})/B_{j}^{i})}{N\times M}△ B/B = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) / italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_N × italic_M end_ARG, the average deviation percentage of bone angle A/AA/A\triangle\textbf{A/A}△ A/A, the deviation percentage of joint speed S/S=j=0N||xsxs||2F×N×OS/Ssuperscriptsubscript𝑗0𝑁subscriptsubscript𝑥𝑠superscriptsubscript𝑥𝑠2𝐹𝑁𝑂\triangle\textbf{S/S}=\frac{\sum_{j=0}^{N}\lvert\lvert x_{s}-x_{s}^{{}^{\prime% }}\rvert\rvert_{2}}{F\times N\times O}△ S/S = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | | italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_F × italic_N × italic_O end_ARG and the l2 distance between the original sample and adversarial sample l2=j=0N||xx||2F×Nl2superscriptsubscript𝑗0𝑁subscript𝑥superscript𝑥2𝐹𝑁\textbf{l2}=\frac{\sum_{j=0}^{N}\lvert\lvert x-x^{{}^{\prime}}\rvert\rvert_{2}% }{F\times N}l2 = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | | italic_x - italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_F × italic_N end_ARG where N𝑁Nitalic_N is the total number of adversarial samples, F𝐹Fitalic_F is the total number of frames in a motion and O𝑂Oitalic_O and M𝑀Mitalic_M are the total number of joints and frames in a skeleton.

3.3 Attack results

Misclassification. The quantitative results of untargeted attack mode are shown in Table 1. Our method achieves high success rates across different datasets and target models. For targeted attack mode, results are shown in Table 2. It is not surprising to turn ’reading’ into ’writing’. Therefore we choose ’drinking water’ as targeted skeletal motion class for obvious differences are existed. For classifier model 2s AGCN and SGN, we can see that the adversarial samples successfully deceives the two models but more perturbations are added to the adversarial samples to deceive SGN compared with 2s AGCN. This shows that even though the network of SGN is relatively simple, its defense capability exceeds that of 2s AGCN. What’s more, it is also proved that semantics can greatly enhance the ability of network to learn logical and dynamic features of skeletal motions.

Table 2: The results of our method on targeted attack mode on NTU RGB+D with cross view setting.
\triangleB/B \triangleA/A \triangleS/S SR l2
HCN 3.1% 15.4% 7.5% 100% 0.62
2s AGCN 1.1% 4.8% 3.1% 100% 0.21
SGN 2.9% 15% 4.2% 100% 0.44
AeS-GCN[5] 10.5% 41.2% 21.9% 73.9% 5.92

Imperceptibility. Our method obtains adversarial samples with much lower perturbation as shown in Table 4. Therefore, it is also proved that the dynamic distance proposed is more effective than l2 distance for skeletal motions. Compared with the existing methods, strict perceptual control is used as the optimization target problem to improve the imperceptibility. Taking the adversarial sample generated based on SGN as an example in Figure 1, we sample 10 frames from sequence for visual display. Since the previous results show that SGN produces greater perturbations when it is successfully attacked. The label of original sample is ’throwing’ and after attack the predicted label is ’brushing teeth’ under targeted attack mode. When comparing two samples carefully, we can find that differences are existed in some joints. However, when two samples are played as video sequences, differences are hard to find. In addition, we also find that the perturbations added to adversarial samples are concentrated on arms and hands, which contain obvious dynamics. This may remind us that we should consider to reduce sensitivity to special data when designing classifiers.

Table 3: The effectiveness of emotional features on HCN on NTU RGB+D with cross view setting.
\triangleB/B \triangleA/A \triangleS/S SR l2
Ours w emotion 1.3% 6.7% 4.2% 100% 0.21
Ours w/o emotion 1.2% 7.1% 4.8% 100% 0.24

We also verify the role of emotional features. Table 3 shows the effectiveness of emotional features. Emotional features can control the perception of adversarial samples from overall perspective. However, the improvement is not obvious, which may be due to the accuracy of the emotion recognizer used. Therefore, it can be expected that with in-depth study of emotion recognition, emotional features will be better integrated.

Table 4: The results of C&W on HCN on NTU RGB+D.
\triangleB/B \triangleA/A SR l2
Untargeted NTU CV 4.67% 24.1% 100% 0.28
NTU CS 4.09% 21.1% 100% 0.24
Targeted NTU CV 8.8% 46.8% 100% 0.51
NTU CS 9.5% 50.7% 100% 0.52

4 Conclusion

To summarize, in order to explore the vulnerability of skeleton-based action recognizers, we propose a novel attack method for based on multi-dimensional features. We fuse dynamics and emotional features of skeletal motions and generate successful adversarial samples based on ADMM. A large number of experiments show that our method has fewer perturbations and better imperceptibility than other methods. Our method is effective on multiple datasets and the state-of-the-art models. In the future, we will systematically study how to improve the defense ability of models to resist attacks.

Acknowledgements

This work supported by Bei**g Key Laboratory of Behavior and Mental Health in School of Psychological and Cognitive Sciences, Peking University. We would also like to thank the colleagues in the CiL lab at East China Normal University for their efforts on this project, and the reviewers for their time and hard work.

References

  • [1] Rosalind W Picard. Affective computing. MIT press, 2000.
  • [2] Jiahao Zhang, Feng Liu, and Aimin Zhou. Off-tanet: a lightweight neural micro-expression recognizer with optical flow features and integrated attention mechanism. In Pacific Rim International Conference on Artificial Intelligence, pages 266–279. Springer, 2021.
  • [3] Feng Liu, Si-Yuan Shen, Zi-Wang Fu, Han-Yang Wang, Ai-Min Zhou, and Jia-Yin Qi. Lgcct: A light gated and crossed complementation transformer for multimodal speech emotion recognition. Entropy, 24(7):1010, 2022.
  • [4] Siyuan Shen, Feng Liu, and Aimin Zhou. Mingling or misalignment? temporal shift for speech emotion recognition with pre-trained representations. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
  • [5] Qing Xu, Feng LiU, Ziwang Fu, Aimin Zhou, and Jiayin Qi. Aes-gcn: Attention-enhanced semantic-guided graph convolutional networks for skeleton-based action recognition. Computer Animation and Virtual Worlds, 33(3-4):e2070, 2022.
  • [6] Ziwang Fu, Feng Liu, Qing Xu, Xiangling Fu, and Jiayin Qi. Lmr-cbt: Learning modality-fused representations with cb-transformer for multimodal emotion recognition from unaligned multimodal sequences. Frontiers of Computer Science, 18(4):184314, 2024.
  • [7] Feng Liu, Ziwang Fu, Yunlong Wang, and Qijian Zheng. Tacfn: Transformer-based adaptive cross-modal fusion network for multimodal emotion recognition. CAAI Artificial Intelligence Research, 2, 2023.
  • [8] Feng Liu, Han-Yang Wang, Si-Yuan Shen, Xun Jia, **g-Yi Hu, Jia-Hao Zhang, Xi-Yi Wang, Ying Lei, Ai-Min Zhou, Jia-Yin Qi, et al. Opo-fcm: A computational affection based occ-pad-ocean federation cognitive modeling approach. IEEE Transactions on Computational Social Systems, 2022.
  • [9] Hanyang Wang, Bo Li, Shuang Wu, Siyuan Shen, Feng Liu, Shouhong Ding, and Aimin Zhou. Rethinking the learning paradigm for dynamic facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17958–17968, June 2023.
  • [10] Feng Liu, Hanyang Wang, Jiahao Zhang, Ziwang Fu, Aimin Zhou, Jiayin Qi, and Zhibin Li. Evogan: An evolutionary computation assisted gan. Neurocomputing, 469:81–90, 2022.
  • [11] Jian Liu, Naveed Akhtar, and Ajmal Mian. Adversarial attack on skeleton-based human action recognition. IEEE Transactions on Neural Networks and Learning Systems, 2020.
  • [12] Tianhang Zheng, Sheng Liu, Changyou Chen, Junsong Yuan, Baochun Li, and Kui Ren. Towards understanding the adversarial vulnerability of skeleton-based action recognition. arXiv preprint arXiv:2005.07151, 2020.
  • [13] Xiuxiu Bai, Ming Yang, and Zhe Liu. On the robustness of skeleton detection against adversarial attacks. Neural Networks, 132:416–427, 2020.
  • [14] Chaowei Xiao, Dawei Yang, Bo Li, Jia Deng, and Mingyan Liu. Meshadv: Adversarial meshes for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6898–6907, 2019.
  • [15] Fazle Karim, Somshubra Majumdar, and Houshang Darabi. Adversarial attacks on time series. IEEE transactions on pattern analysis and machine intelligence, 2020.
  • [16] Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. Adversarial attacks on deep neural networks for time series classification. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2019.
  • [17] Shuai Jia, Chao Ma, Yibing Song, and Xiaokang Yang. Robust tracking against adversarial attacks. In European Conference on Computer Vision, pages 69–84. Springer, 2020.
  • [18] Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks. arXiv preprint arXiv:1801.02610, 2018.
  • [19] Yunfeng Diao, Tianjia Shao, Yong-Liang Yang, Kun Zhou, and He Wang. Basar: Black-box attack on skeletal action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7597–7607, 2021.
  • [20] He Wang, Feixiang He, Zhexi Peng, Tianjia Shao, Yong-Liang Yang, Kun Zhou, and David Hogg. Understanding the robustness of skeleton-based action recognition under adversarial attack. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14656–14665, 2021.
  • [21] Edmilson da Silva Morais, Ron Hoory, Weizhong Zhu, Itai Gat, Matheus da Silva Damasceno, and Hagai Aronowitz. Speech emotion recognition using self-supervised features. 2021.
  • [22] Yuanchao Li, Peter Bell, and Catherine Lai. Fusing asr outputs in joint training for speech emotion recognition. arXiv preprint arXiv:2110.15684, 2021.
  • [23] Pengfei Liu, Kun Li, and Helen Meng. Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition. arXiv preprint arXiv:2201.06309, 2022.
  • [24] Ziwang Fu, Feng Liu, Hanyang Wang, Siyuan Shen, Jiahao Zhang, Jiayin Qi, Xiangling Fu, and Aimin Zhou. Lmr-cbt: Learning modality-fused representations with cb-transformer for multimodal emotion recognition from unaligned multimodal sequences. arXiv preprint arXiv:2112.01697, 2021.
  • [25] Chuanfei Hu, Weijie Sheng, Bo Dong, and Xinde Li. Tntc: two-stream network with transformer-based complementarity for gait-based emotion recognition. arXiv preprint arXiv:2110.13708, 2021.
  • [26] Venkatraman Narayanan, Bala Murali Manoghar, Vishnu Sashank Dorbala, Dinesh Manocha, and Aniket Bera. Proxemo: Gait-based emotion learning and multi-view proxemic fusion for socially-aware robot navigation. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8200–8207. IEEE, 2020.
  • [27] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [28] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055, 2018.
  • [29] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12026–12035, 2019.
  • [30] Pengfei Zhang, Cuiling Lan, Wenjun Zeng, Junliang Xing, Jianru Xue, and Nanning Zheng. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1112–1121, 2020.