HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: changes

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2203.14045v2 [cs.CV] 28 Feb 2024

Adaptively Enhancing Facial Expression Crucial Regions via Local Non-Local Joint Network

Guanghui Shi, Shasha Mao,   Shui** Gou,  Dandan Yan, Licheng Jiao,  and Lin Xiong Manuscript received by Machine Intelligence ResearchThe paper can be accessed via https://link.springer.com/article/10.1007/ s11633-023-1417-9DOI: 10.1007/s11633-023-1417-9
Abstract

Facial expression recognition (FER) is still one challenging research due to the small inter-class discrepancy in the facial expression data. In view of the significance of facial crucial regions for FER, many existing researches utilize the prior information from some annotated crucial points to improve the performance of FER. However, it is complicated and time-consuming to manually annotate facial crucial points, especially for vast wild expression images. Based on this, a local non-local joint network is proposed to adaptively light up the facial crucial regions in feature learning of FER in this paper. In the proposed method, two parts are constructed based on facial local and non-local information respectively, where an ensemble of multiple local networks are proposed to extract local features corresponding to multiple facial local regions and a non-local attention network is addressed to explore the significance of each local region. Especially, the attention weights obtained by the non-local network is fed into the local part to achieve the interactive feedback between the facial global and local information. Interestingly, the non-local weights corresponding to local regions are gradually updated and higher weights are given to more crucial regions. Moreover, U-Net is employed to extract the integrated features of deep semantic information and low hierarchical detail information of expression images. Finally, experimental results illustrate that the proposed method achieves more competitive performance compared with several state-of-the-art methods on five benchmark datasets. Noticeably, the analyses of the non-local weights corresponding to local regions demonstrate that the proposed method can automatically enhance some crucial regions in the process of feature learning without any facial landmark information.

Index Terms:
Facial Expression Recognition, Deep Neural Network, Multiple Networks Ensemble, Attention Network.

I Introduction

Emotion is a complex state that integrates people’s feelings, thoughts and behaviors [1], and facial expression is one of the most direct signals to communicate their innermost thoughts. Therefore, facial expression recognition (FER) [2, 3, 4, 5, 6] has attracted the attention of many researchers due to its important role in many practical application fields, such as human-computer interaction, recommendation system, patient monitoring, et al.. In general, facial expression is encoded into facial action units through facial action coding system [7, 8, 9], and any expressions can be described through a set of facial action units. As we know, some facial action units are crucial for FER [10], such as the one located in regions around eyes and the mouth, since they are of more obvious actions compared with other facial regions (such as cheek and forehead). In the following parts, we regard these crucial facial action units as facial crucial regions, shortened by FCRs. Fig.1 illustrates facial crucial regions of two facial images (ID1 and ID2) from six expressions, respectively. From Fig.1, it is found that the FCRs are more discriminative to determine the expression category of a facial image [11].

Refer to caption
Figure 1: An illustration of facial crucial regions from six expressions, where two facial images (ID1 and ID2) are shown for each expression. The regions around eyes and mouths are cropped as examples of FCRs in the purple box and the green box, respectively.

In view of the significance of FCRs, many studies [12, 13, 14, 15] have been proposed based on applying the information of facial local regions, where the facial landmarks are employed as the prior information of facial crucial regions, whereas the landmarks are given by manually annotating for facial expression images. Early, most of FER researches [16, 17, 18] focused on lab-collected expression datasets, such as CK+ [19], MMI [20], JAFFE [21], Oulu-CASIA [22]. For lab-collected datasets, facial expressions images were collected from several or dozens of individuals under similar conditions (such as illumination, angle, posture, et al.), generally with a few uncontrollable factors. Thus, it is easily achieved to manually annotate the landmark of FCRs for lab-collected datasets.

However, compared with the lab-controlled datasets, the wild expression datasets [23] are collected under more complex and uncontrollable conditions, such as RAF-DB [24], AffectNet [25], EmotionNet [26], et al. For the wild expression datasets, especially including a vast of images, it is very complicated and time-consuming for manually annotating FCRs. Moreover, the postures of different faces vary greatly on the wild database. One simple change of facial postures can cause multiple pixel deviations at the image level. Fig.2 gives an example about the landmarks moving with the change of postures, where two expression images and their landmarks are from RAF-DB dataset [24]. From Fig.2, it is obvious that 68 landmark points of the image (a) are different from the image (b) and the landmarks are greatly shifted from (a) to (b), shown as the figure (c). It implies that the position of FCRs varies with the change of facial postures. Inevitably, it increases the complexity of manually annotating landmarks for FER, especially for the wild dataset with a vast of images. In view of this, it is considerable that whether the significance of FCRs or their features could be spontaneously enhanced in the training of deep FER or not, without any prior information, such as landmarks of FCRs.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Schematic diagram of the pixel deviations at image level when posture changing. To demonstrate this change, we measured the movement of 68 landmark points on the faces with different postures and the same identity. In figure (a) and (b), 68 landmark points are marked with a green cross, and figure (c) shows the movement of 68 landmark points.

On the other hand, there exists a problem that some FCRs from different expression categories are similar, whereas some FCRs from one same category are very different. From Fig.1, it is obviously seen that the FCRs (including mouths) of ID1 from six expressions are similar with opening the mouth, which is absolutely different from ID2 with closing the mouth. Similarly, for the crucial regions including eyes, ID1 and ID2 from the category (Fear) are different, whereas ID1 from the category (Surprise) and ID2 from the category (Anger) are similar. It illustrates that FCRs of expression images belonging to the same category may be very different but FCRs from different categories are similar. Distinctly, it is insufficient that only local information of facial expressions is utilized to construct one effective model for FER, especially for the wild dataset. Hence, it is still important to utilize the global information of the facial expression while FCRs are enhanced in deep facial expression recognition.

Refer to caption
Figure 3: A simple view of the proposed model (LNLAttenNet). The part in the green dotted box shows the global weights corresponding to 16 local regions (from Patch 1 to Patch 16) obtained by LNLAttenNet, and the part under the green dotted box is a simple framework of LNLAttenNet.

Based on the above analyses, we propose a new method of facial expression recognition in this paper, which constructs a local non-local joint network to adaptively enhance the facial crucial regions in the process of deep feature learning, shortened for LNLAttenNet. In LNLAttenNet, the local and the non-local information of facial expressions are simultaneously considered to construct two parts of the network respectively: a local multi-network ensemble and a non-local attention network, and then the generated local and non-local feature vectors are integrated and jointly optimized in feature learning. Specially, the attention weights obtained by the non-local part is regarded as the significance of facial local regions and fed into the local multi-network ensemble system to combine multiple local networks. Interestingly, we find that some facial crucial regions can be automatically enhanced in the process of deep feature learning by the proposed method. Moreover, U-Net is employed to generate feature maps where each pixel has large receptive field and the local region also contains the global information. Fig.3 shows a simple view of LNLAttenNet. From Fig.3, it is obvious that some crucial regions is given higher weights by LNLAttenNet, such as the 5th patch around the left eye (0.1123), the 10th, 11t and 14th patches around the mouth (0.0887, 0.1073 and 0.1298), which illustrates that some crucial regions are effectively enhanced by LNLAttenNet. Note that wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the non-local attention weight corresponding to the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT local region and the initial weights are equal. More detailed descriptions will be introduced in the following parts.

Compared with stat-of-the art methods, our contributions are mainly three points:

  • We propose LNLAttenNet to automatically light up facial crucial regions in deep feature learning by utilizing the local and non-local information of facial expression simultaneously. To the best of our knowledge, it is the first work to study whether FCRs is directly explored and enhanced in feature learning of deep FER, where FCRs are automatically enhanced without any prior information for facial crucial regions or points. It effectively improves the problem that difficultly annotating for the wild dataset with a vast of facial images.

  • In LNLAttenNet, an attention mechanism is introduced to construct the non-local attention network which explores the significance of local regions for FER from a global perspective of facial expression. The obtained attention weights corresponding to local regions are fed into the local multi-network ensemble system to integrate multiple local features, and then the integration of features obtained by multiple local networks is jointly optimized with the facial global feature.

  • Experimental results demonstrate that FCRs can be enhanced in deep feature learning by LNLAttenNet, which validates FCRs are exactly more discriminative local regions for FER. Moreover, it also implies that the model of deep FER can spontaneously focus on some crucial regions in the training process, which probably brings a new inspiration for designing deep FER methods.

Refer to caption
Figure 4: The framework of the proposed model (LNLAttenNet). LNLAttenNet uses U-Net to generate feature map with the same resolution as the input image. Then, its feature map (Conv9-2) is cropped into M𝑀Mitalic_M local patches to construct the local multi-networks ensemble model, where each patch is used to generate an individual network based on the structure of Simple Net. The feature map (Conv5-2) is used to construct the global attention network. Finally, the global and local features are integrated based on the global weights, and then three fully connected layers are followed.

The rest of this manuscript is organized as follows. Section II firstly introduces related works about deep facial expression recognition. Secondly, Section III introduces the detail of the proposed method. Then, experimental results and analyses are demonstrated to validate the performance of the proposed method in Section IV. Finally, Section V provides the conclusion as well as the prospects on future works.

II Related Works

Due to the excellent performance of deep learning, various deep networks have been applied in FER [23], such as VggNet [27], InceptionNet [28], ResNet [29], et al. Based on this, many deep FER methods have been proposed to address different problems. In [30], Hu et al. firstly extended the idea of deep supervision to deal with FER in the wild. The training of deep CNNs was softer and easier through the supervision not only to deep layers but also to intermediate layers and shallow layers, and a fusion structure was constructed where the feature ahead was used for the second-level supervision. In [31], Acharya et al. thought that the second-order statistic (such as covariance) were more suitable to catch the feature of the twisted facial expression. In their framework, a mainfold structure was constructed for covariance pooling to obtain a competitive performance for FER. In [32], Li et al. proposed a new deep manifold strategy for multi-label expressions, and their proposed network focused on the ambiguity expressions and could learn the discriminative feature that was suitable for cross-database FER.

Considering that facial expression is determined by key regions, Fan et al. [12] utilized the information of facial landmark points to select three sub-images around the eyes, mouth and nose. Then, three sub-images were encoded by three sub-networks, and the last pooling layer in each sub-network was concatenated with each other, which obtained better recognition performance compared with others. In [33, 34], the information of facial landmark is used to extract features and generate masks from specific locations to remove the pose variation.

In [35], it was taken into account that there are inevitably labeling errors and deviations between different databases due to the subjectivity of labeling facial expressions. Therefore, when existing methods make use of multiple databases to expand the training set, their performance cannot be continuously improved. In order to solve this inconsistency between different databases, an Inconsistent Pseudo Annotations to Latent Truth (IPA2LT) framework is proposed to train a model from multiple inconsistent databases and large scale unlabeled images. The IPA2LT essentially constructs the ensemble at label level. Each image in the model has the same number of labels as the number of data sources, in which only one label is original and others are pseudo. Existing methods for FER have been almost satisfying on analyzing the frontal faces but fail to attain a good performance on partially occluded faces collected in the wild. Some facial expressions are ambiguous and have multi-labels. In [36], Gan et al. proposed a new framework based on CNN with the supervision of soft labels, where hard labels are used to construct soft labels with a novel label-level perturbation. In this framework, soft labels were obtained to eliminate the similarity between faces of different emotions, and multiple basic classifiers were trained and then combined. Moreover, some GAN-based methods have been proposed to generate expressional images for FER [37, 38, 39] or usually focus only on generating new facial expression images [40, 41, 42, 43]. In [37], a novel approach is proposed to learn facial expressions by extracting the expressive component through a de-expression procedure where the corresponding neutral expression is generated by the trained generative model by given a facial image with arbitrary expressions. In [40], a user-controllable approach is proposed so as to generate video clips of various lengths from a single face image and the lengths and types of the expressions are controlled by users.

In [13], Li et al. proposed a CNN with attention mechanism (ACNN) to detect the occlusion of facial regions and paid attention to the most discriminative regions, where ACNN used the information of 24 facial landmark points to select the key regions at the feature level. In [44], Barros et al. investigated the emotion-driven attention mechanisms from the view of videos. In [45], Wang et al. proposed two-level attention mechanism to extract emotion-related features, which was based on global information, not involving the local regions. Similarly to [44, 45, 13], the attention mechanism is also involved in this work, whereas the essence of algorithms is very different. Here, our purpose is to adaptively enhance the significance of facial crucial regions based on the attention weights in feature learning obtained by the non-local attention network from the view of multiple local regions, where the attention weights corresponding to each local regions are obtained by the non-local attention network.

III Local Non-Local Joint Network for Facial Expression Recognition

In this paper, we propose a Local Non-Local Attention Joint Network for FER to adaptively light up more crucial local regions of facial expression, named by LNLAttenNet. The overall framework of LNLAttenNet is visually shown in Fig.4. In Fig.4, one facial expression image is used as the initial input instance of the proposed network, and its size is 144×\times×144 as same as our implemented experiments.

Refer to caption
Figure 5: Overview of the Non-Local attention model.

In LNLAttenNet, U-Net is firstly employed to extract the feature maps integrating the deep semantic information and the low hierarchical detail information of facial expression images. For the facial expression dataset, when the regional integration is carried out [12], the inter-class discrepancy is smaller and the intra-class discrepancy is larger, as shown in Fig.1. The structure of U-Net [46, 47, 48], the top-down architecture with lateral connections for introducing details into high-level semantic feature maps, has been proved that local regions in last few layers are of the large receptive field and the global information, which is important and useful for ambiguous objects recognition [49, 50]. Therefore, U-Net is beneficial to alleviate the negative impact of the regional integration, but it does not mean that the proposed method is restricted to U-Net. Actually, one model with the similar structure to U-Net can be employed in our proposed method, such as FPN [49].

As shown as Fig.4, facial expression images are inputted to the proposed model. By U-Net, two different feature maps are generated for the initial input image, located in the last layer (Conv9-2) and the intermediate layer (Conv5-2) of U-Net, respectively. In the following parts, we use 5subscript5\mathcal{F}_{5}caligraphic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT and 9subscript9\mathcal{F}_{9}caligraphic_F start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT to express the feature maps from Conv5-2 and Conv9-2 of U-Net, respectively. Then, the generated feature maps 5subscript5\mathcal{F}_{5}caligraphic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT and 9subscript9\mathcal{F}_{9}caligraphic_F start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT are utilized to construct two parts of LNLAttenNet, where the map 5subscript5\mathcal{F}_{5}caligraphic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT is utilized as the input to construct the non-local part (the Non-Local Attention Network) and the map 9subscript9\mathcal{F}_{9}caligraphic_F start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT is employed as the input to construct the local part (the Local Multi-Networks Ensemble System). In the local part, an ensemble of multiple networks is applied to generate and integrate multiple individual networks corresponding to different facial local regions respectively. By the non-local attention network, an attention weight wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i=1,,M𝑖1𝑀i=1,...,Mitalic_i = 1 , … , italic_M) is obtained corresponding to the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT local region of the facial expression, and then the vector 𝐰𝐰\bf{w}bold_w ([w1,,wM]Tsuperscriptsubscript𝑤1subscript𝑤𝑀𝑇[w_{1},...,w_{M}]^{T}[ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT) are used as the weights of multiple local networks to combine M𝑀Mitalic_M local vectors and meanwhile boost the significance of local regions in the process of deep feature learning. Finally, the non-local attention network and the local ensemble network are jointly optimized by integrating local and non-local features in three fully connected layers of LNLAttenNet. More detailed descriptions of the proposed method will be introduced as follows.

III-A Non-Local Attention Network

For facial expression recognition, there is small inter-class discrepancy and large intra-class discrepancy on expression images, as shown in Fig.1. Therefore, facial crucial regions are regarded as more discriminative regions which determine the categories of facial expression, such as regions around the mouth (eyes) rather than the cheek. However, it is tough to estimate which regions are more crucial without the assistance from manually annotated crucial points. Based on this, we construct the Non-Local Attention Network to automatically mine more discriminative regions from the whole facial expression, visually shown in the box with orange dot lines of Fig.4.

In Fig.4, the feature map 5subscript5\mathcal{F}_{5}caligraphic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (Conv5-2) is generated by U-Net as the global information of the facial image to construct the non-local attention network. The Conv5-2 is with the minimum resolution and the maximum receptive field, which means that 5subscript5\mathcal{F}_{5}caligraphic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT is not affected by each local patch but contains the relationship between local patches implicitly. It is useful to mine more crucial regions based on the global information from the whole face.

Refer to caption
Figure 6: Overview of the local attention.

III-A1 Global Attention

Inspired by [51, 52], we construct a non-local attention model based on three branches, shown as in Fig.5. First, the input is the map 5subscript5\mathcal{F}_{5}caligraphic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT containing the global information of facial expression in Fig.5. Based on 5subscript5\mathcal{F}_{5}caligraphic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, three feature maps 𝒬𝒬\mathcal{Q}caligraphic_Q, 𝒦𝒦\mathcal{K}caligraphic_K and 𝒱𝒱\mathcal{V}caligraphic_V are generated by one convolution layer and one pooling layer, respectively. Note that three maps are with a special resolution111 This special resolution is set in order to expediently calculate the correlation between each patch. For example, when the number of cropped local regions is set as 16 (M=16𝑀16M=16italic_M = 16) in our experiments, the special resolution is 4*4444*44 * 4 (n=4𝑛4n=4italic_n = 4), as shown in Fig.5. with n*n𝑛𝑛n*nitalic_n * italic_n in this model, where M=n2𝑀superscript𝑛2M=n^{2}italic_M = italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and M𝑀Mitalic_M is the number of cropped local regions. Then, the maps 𝒬𝒬\mathcal{Q}caligraphic_Q and 𝒦𝒦\mathcal{K}caligraphic_K are reshaped as 𝐐*superscript𝐐\bf{Q}^{*}bold_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and 𝐊*superscript𝐊\bf{K}^{*}bold_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, shown as in Fig.5, and a multiplication operation is followed to get a matrix 𝐑𝐑\bf Rbold_R which reflects the correlation among local regions. Compared with [51, 52], the relevance of each region (patch) in LNLAttenNet is not as strong as each frame in video or each word in sentence, and thus L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT normalization is adopted to limits the sum of each row of R𝑅Ritalic_R to 1 instead of the softmax function. Finally, a vector is calculated via averaging the each column of the correlation matrix 𝐑𝐑\bf Rbold_R, regarded as the non-local attention weights 𝐰𝐠superscript𝐰𝐠\bf{w}^{g}bold_w start_POSTSUPERSCRIPT bold_g end_POSTSUPERSCRIPT assigned to M𝑀Mitalic_M local regions.

Furthermore, the map 𝐕𝐕{\bf{V}}bold_V is reshaped as 𝐕*superscript𝐕{\bf{V}}^{*}bold_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, and the feature vector 𝐬𝐬{\bf{s}}bold_s is obtained by multiplying 𝐕*superscript𝐕{\bf{V}}^{*}bold_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT by the correlation matrix 𝐑𝐑{\bf{R}}bold_R, which is the self-attention form in [51, 52]. In order to make the matrix 𝐑𝐑{\bf{R}}bold_R reflect the correlation among local regions, 𝐬𝐬{\bf{s}}bold_s is flattened and added to the non-local vector 𝐠𝐠{\bf{g}}bold_g (shown in Fig.4). Meanwhile, a function is given to trade off two vectors 𝐠𝐠{\bf{g}}bold_g and 𝐬𝐬{\bf{s}}bold_s, shown as

𝐠*=(1α)𝐠+αflat(𝐬),superscript𝐠1𝛼𝐠𝛼𝑓𝑙𝑎𝑡𝐬{\bf{g}}^{*}=(1-\alpha)\cdot{\bf{g}}+\alpha\cdot flat({\bf{s}}),bold_g start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = ( 1 - italic_α ) ⋅ bold_g + italic_α ⋅ italic_f italic_l italic_a italic_t ( bold_s ) , (1)

where 𝐠*superscript𝐠{\bf{g}}^{*}bold_g start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT expresses the new non-local vector and α𝛼\alphaitalic_α is the hyper-parameter to adjust the ratio of 𝐬𝐬{\bf{s}}bold_s. In experiments, we will give an analysis for the parameter α𝛼\alphaitalic_α.

III-B Local Multi-Networks Ensemble

The feature map (9subscript9\mathcal{F}_{9}caligraphic_F start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT) is employed as the input to construct the part: Local Multi-Networks Ensemble, shown as in Fig.4. The reason of using the map 9subscript9\mathcal{F}_{9}caligraphic_F start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT is that each pixel is of the large receptive field and the rich sementic information in Conv9-2, where 9subscript9\mathcal{F}_{9}caligraphic_F start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT is with the same resolution as the initial input image. In the part of Local Multi-Networks Ensemble, the feature map 9subscript9\mathcal{F}_{9}caligraphic_F start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT is firstly divided into M𝑀Mitalic_M patches (including different local regions) with the same dimension (set as 48*48*64 in our experiments). Then, M𝑀Mitalic_M patches are trained by Simple Network222The basic structure of Simple Network is shown in Fig.7, composed of six convolution layers and three pooling layers. to generate M𝑀Mitalic_M individual networks {𝒩1,,𝒩M}subscript𝒩1subscript𝒩𝑀\{{\mathcal{IN}}_{1},...,{\mathcal{IN}}_{M}\}{ caligraphic_I caligraphic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_I caligraphic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }, respectively. Specially, for each individual network, the local attention mechanism is added to enhance the feature vector of each local region. Finally, M𝑀Mitalic_M local feature vectors are combining with the non-local attention weights obtained by Non-Local Attention Network.

Refer to caption
Figure 7: The structure of Simple Network

III-B1 Local Attention

In practice, it is found that the useful information is decreased when partial regions in one patch are missed or obscured. It means that less attention should be given to them. In view of this, a local attention mechanism is adopted in each individual network to weaken the significance of useless regions. The local attention model is encoded by four convolution layers and two fully connected layers, and its structure is shown in Fig.6. Note that two convolution layers are not padded in order to reduce the computational complexity. In the local attention model, its input is the output of the last pooling layer in Simple-Net, and its output is one value between 0 and 1 obtained via the sigmoid function, regarded as the local attention weight wilsuperscriptsubscript𝑤𝑖𝑙w_{i}^{l}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT of each individual network, which represents the amount of information in each local patch can flow to the next level. If the facial local region is obscured or missed, the information that it contains for expression recognition will be reduced, and then the weight value of the local attention is also reduced to alleviate the effect of patches including the obscured region. Furthermore, the weights will be multiplied by the corresponding local vector as the output feature of each local network. More visual illustrations can be found in the part of experiments.

Q𝑄Qitalic_Q K𝐾Kitalic_K V𝑉Vitalic_V
Operation Activate Output shape Operation Activate Output shape Operation Activate Output shape
Conv 1×\times×1 s:1 ReLu 9*9*512 Conv 1×\times×1 s:1 ReLu 9*9*512 Conv 1×\times×1 s:1 ReLu 9*9*512
MaxPooling 2×\times×2 s:2 - 4*4*512 MaxPooling 2×\times×2 s:2 - 4*4*512 MaxPooling 2×\times×2 s:2 - 4*4*512
Reshape - 16*512 Reshape - 512*16 Reshape - 16*512
TABLE I: The structure of Non-Local attention.

III-B2 Combination of Multiple Local Networks

According to the non-local attention weights 𝐰gsuperscript𝐰𝑔{\bf{w}}^{g}bold_w start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and the local attention weights 𝐰lsuperscript𝐰𝑙{\bf{w}}^{l}bold_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, the local feature vectors given by M𝑀Mitalic_M individual networks {𝒩1,,𝒩M}subscript𝒩1subscript𝒩𝑀\{{\mathcal{IN}}_{1},...,{\mathcal{IN}}_{M}\}{ caligraphic_I caligraphic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_I caligraphic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } are aggregated by the formula

𝐟en=i=1Mwig*wil𝐟i,subscript𝐟𝑒𝑛superscriptsubscript𝑖1𝑀superscriptsubscript𝑤𝑖𝑔superscriptsubscript𝑤𝑖𝑙subscript𝐟𝑖{\bf{f}}_{en}=\sum_{i=1}^{M}w_{i}^{g}*w_{i}^{l}{\bf{f}}_{i},bold_f start_POSTSUBSCRIPT italic_e italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT * italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (2)

where 𝐟ensubscript𝐟𝑒𝑛{\bf f}_{en}bold_f start_POSTSUBSCRIPT italic_e italic_n end_POSTSUBSCRIPT expresses the ensemble feature vector, 𝐟isubscript𝐟𝑖{\bf{f}}_{i}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT expresses the feature vector given by 𝒩isubscript𝒩𝑖{\mathcal{IN}}_{i}caligraphic_I caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponding to the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT local region, wigsuperscriptsubscript𝑤𝑖𝑔w_{i}^{g}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT is the non-local attention weight of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT local region, and wilsuperscriptsubscript𝑤𝑖𝑙w_{i}^{l}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT expresses the local attention weight of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT local region. In experiments, we will give an analysis for the number M𝑀Mitalic_M of local patches.

III-C Joint Optimization of LNLAttenNet

In Fig.4, the non-local feature vector 𝐠*superscript𝐠{\bf{g}}^{*}bold_g start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is produced by the non-local attention network, and the local vector 𝐟ensubscript𝐟𝑒𝑛{\bf{f}}_{en}bold_f start_POSTSUBSCRIPT italic_e italic_n end_POSTSUBSCRIPT is obtained by the local multi-network ensemble. Inspired by [53], we think that the global information of an input image is essential, and each local patch can get large receptive field and the global information by embedding U-Net, which makes it easier to classify the similar patch of facial expression of different categories. Moreover, Conv5-2 is encoded to a global vector with 8192 dimension by two convolution layers and one pooling layer. Then, the non-local vector 𝐠*superscript𝐠{\bf{g}}^{*}bold_g start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is concatenated with the local vector 𝐟ensubscript𝐟𝑒𝑛{\bf{f}}_{en}bold_f start_POSTSUBSCRIPT italic_e italic_n end_POSTSUBSCRIPT to obtain the total vector as the feature of the first fully connected layer and is jointly optimized, and the dimension of the integrated feature vector is 17408 shown as in Fig.4. In LNLAttenNet, three full connect layers are implemented, and the loss function is formulated as

L=lossentropy+γlossl2,𝐿𝑙𝑜𝑠subscript𝑠𝑒𝑛𝑡𝑟𝑜𝑝𝑦𝛾𝑙𝑜𝑠subscript𝑠𝑙2L=loss_{entropy}+\gamma loss_{l2},italic_L = italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_e italic_n italic_t italic_r italic_o italic_p italic_y end_POSTSUBSCRIPT + italic_γ italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_l 2 end_POSTSUBSCRIPT , (3)

where lossentropy𝑙𝑜𝑠subscript𝑠𝑒𝑛𝑡𝑟𝑜𝑝𝑦loss_{entropy}italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_e italic_n italic_t italic_r italic_o italic_p italic_y end_POSTSUBSCRIPT expresses the cross entropy loss, lossl2𝑙𝑜𝑠subscript𝑠𝑙2loss_{l2}italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_l 2 end_POSTSUBSCRIPT is the l2 regularization loss, and γ𝛾\gammaitalic_γ is the hyper-parameter controlling the balance between two losses. The cross entropy is calculated as:

lossentropy=1Nn=0N1c=0C1𝕃(ln=c)log(pni),𝑙𝑜𝑠subscript𝑠𝑒𝑛𝑡𝑟𝑜𝑝𝑦1𝑁superscriptsubscript𝑛0𝑁1superscriptsubscript𝑐0𝐶1𝕃subscript𝑙𝑛𝑐𝑙𝑜𝑔superscriptsubscript𝑝𝑛𝑖loss_{entropy}=\frac{1}{N}\sum_{n=0}^{N-1}\sum_{c=0}^{C-1}\mathbb{L}(l_{n}=c)% \cdot log(p_{n}^{i}),italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_e italic_n italic_t italic_r italic_o italic_p italic_y end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C - 1 end_POSTSUPERSCRIPT blackboard_L ( italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_c ) ⋅ italic_l italic_o italic_g ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , (4)

where C𝐶Citalic_C is the number of categories, N𝑁Nitalic_N is the number of the input image, and 𝕃𝕃\mathbb{L}blackboard_L is the function that determines whether the input is correct. pnisuperscriptsubscript𝑝𝑛𝑖p_{n}^{i}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT component of the output of the last softmax layer of the nthsuperscript𝑛𝑡n^{th}italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT image, and lnsubscript𝑙𝑛l_{n}italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the label of the nthsuperscript𝑛𝑡n^{th}italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT input image. The l2 regularization loss is computed by lossl2=λW2𝑙𝑜𝑠subscript𝑠𝑙2𝜆superscriptnorm𝑊2loss_{l2}=\lambda\cdot{||W||}^{2}italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_l 2 end_POSTSUBSCRIPT = italic_λ ⋅ | | italic_W | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where W𝑊Witalic_W is the parameters of our model and λ𝜆\lambdaitalic_λ is set as 0.0001 in the following experiments.

IV Experiments and Analyses

In this section, we will validate the performance of the proposed method from several items: 1) the performance comparison with state-of-the-art methods on benchmark datasets, 2) the analyses of Non-Local Attention, 3) the visualization of Local Attention, 4) the change of the parameter α𝛼\alphaitalic_α, 5) the performance of LNLAttenNet with different M𝑀Mitalic_M, and 6) the analyses for overlapped pixels between local regions, respectively.

Operation Activate Output shape
Conv 3×\times×3 s:1 ReLu 48*48*64
Conv 3×\times×3 s:1 ReLu 48*48*64
MaxPooling 2×\times×2 s:2 - 24*24*64
Conv 3×\times×3 s:1 ReLu 24*24*128
Conv 3×\times×3 s:1 ReLu 24*24*128
MaxPooling 2×\times×2 s:2 - 12*12*128
Conv 3×\times×3 s:1 ReLu 12*12*256
Conv 3×\times×3 s:1 ReLu 12*12*256
MaxPooling 2×\times×2 s:2 - 6*6*256
TABLE II: The structure of SimpleNet.
Operation Activate Output shape
Conv 3×\times×3 s:1 No padding ReLu 4*4*256
Conv 1×\times×1 s:1 ReLu 4*4*128
Conv 3×\times×3 s:1 No padding ReLu 2*2*128
Conv 1×\times×1 s:1 ReLu 2*2*64
Reshape - 256
Full connect - 64
Full connect - 1
Sigmoid - 1
TABLE III: The structure of local attention.

IV-A Databases and Setups

In experiments, we employ five FER datasets to evaluate the performance of LNLAttenNet: RAF-DB [24], SFEW [54], AffectNet [25], CK+ [19] and MMI [20].

  • RAF-DB contains 29672 facial images downloaded from the Internet. For the RAF-DB dataset, the facial landmarks are manually annotated via the crowdsourcing method with basic or compound expressions. In experiments, we use the basic database including 12,271 training and 3,068 testing images.

  • SFEW contains the statistic images selected from the movie clips with spontaneous expressions, where the labels of training set and validation set are given. Therefore, 958 training images are used as the training set and 436 validation images are as the testing set in experiments.

  • AffectNet contains 450,000 images with 10 categories, where each image is annotated by one volunteer. In experiments, we use 287,401 images with neutral and six basic emotions, where 283,901 images are selected as the training set and 3,500 images are selected from the validation set as the testing set.

  • CK+ contains 593 sequences from 123 volunteers, where 309 sequences have been annotated with six basic emotions. The emotion in each sequence goes from neutral to peak and then to neutral again. In view of this, we select the first frame of each sequence with the label of neutral and the peak frame of each sequence with the target label to generate 618 experimental images.

  • MMI is recorded from 30 objects with rich details of annotations, and 398 images are generated by selecting the first frame of each sequence with the label of neutral and one peak frame of each sequence.

For RAF-DB and SFEW datasets, their training sets are directly used to train the model and testing sets are used to evaluate the performance. For AffectNet dataset, its training set is used to train the model, and its validation set is used as the testing set, since the testing set of AffectNet is not given the annotated labels [25]. For CK+ and MMI datasets, we adopt the five-fold cross-validation scheme to evaluate the recognition performance, in order to make a fair comparison with other methods. Additionally, in order to fairly compare with the state-of-the-art methods of FER, we initialized the parameters of U-Net by Xavier initializer [55] rather than pre-training. In experiments, the original images are resized to 144×\times×144, and the training images are augmented by standard approaches, such as image flips and random crop**. The number M𝑀Mitalic_M of local regions is set as 16, and each patch (local region) overlaps about 16 pixels with its adjacent patches, and the parameter α𝛼\alphaitalic_α is set as 0.7 in Eq.(1). The size of the epoch is set to 24, the initial learning rate is 0.0003, and the weight decay is set as 0.95 each epoch.

In Tables. I, III and II, we give the structures of the non-local attention network, the local attention and the simple net, respectively. For the non-local attention network, we only show the convolution layer and the pooling layer, and the operations such as resha** and matrix multiplication are not shown. All experiments are implemented on the framework of Tensorflow and GTX 2080Ti with 11G memory.

Refer to caption
Figure 8: Confusion matrix of the proposed model (LNLAttenNet) on RAF-DB database.
Refer to caption
Figure 9: Confusion matrix of the proposed model (LNLAttenNet) on AffectNet database.

IV-B Comparisons with State-of-the-Art Methods

In order to validate the performance of the proposed method, we firstly give a comparison with eight state-of-the-art methods on five datasets. Eight compared methods are VGG16 [27], DLP-CNN [24], NAL [56], Soft-CNN [36], CenterLoss [57], gACNN [13], LDL-ALSG [58] and IPA2LT [35], where VGG16 is applied as the baseline method in experiments.

  • DLP-CNN [24] decomposes the image structurally rather than spatially into regions (parts) which are discriminative for matching. According to the representations over the regions, it aggregates discriminative features for classification.

  • NAL [56] utilizes a noise adaptation layer to address the problem of noise labels.

  • Soft-CNN [36] fuses the latent label probability distribution predicted by the trained model to obtain soft labels with a novel label-level perturbation strategy.

  • CenterLoss [57] minimizes the center loss calculated by the distance between each data and its corresponding class center to reduce the intra-class discrepancy.

  • gACNN [13] uses 24 facial landmarks as the attention mechanism to conduct multi-region ensemble at the feature level.

  • LDL-ALSG [58] considers the subjectivity of human annotators and the ambiguous expression labels and then leverages the topological information of the labels from related but more distinct tasks, such as AU recognition and facial landmark detection, to explore the label distribution of facial expressions.

  • IPA2LT [35] employs an inconsistent pseudo annotations framework to solve the inconsistent annotations between different facial expression databases.

Noticeablely, IPA2LT [35] applies both RAF and AffectNet as the training set, differently from our method (LNLAttenNet) and other compared methods where only the training set of one dataset is employed to train a model. In LNLAttenNet, both non-local attention and local attention mechanisms are utilized. Thus, we also make a comparison with three special cases of our model: the model without both local and non-local attention (Model-S), the model only with local attention (Model-Local), and the model only with non-local attention (Model-NonLocal). Table IV shows the experimental results of 12 models, where the highest accuracy is bold for each dataset. All results are the average of the last 10 epochs.

TABLE IV: Accuracy (%) of the proposed method (LNLAttenNet) compared with state-of-the-arts methods.
Methods AffectNet RAF-DB SFEW CK+ MMI average
VGG16[27] 51.11 80.96 54.45 90.37 63.21 68.02
DLP-CNN[24] 54.47 80.89 - - - -
NAL[56] 55.97 84.22 58.13 91.20 64.71 70.85
Soft-CNN[36] 56.77 85.20 55.73 - - -
CenterLoss[57] 57.37 84.42 56.19 95.48 - -
gACNN[13] 58.78 85.07 - 97.03 - -
LDL-ALSG[58] 59.35 85.53 56.50 93.08 70.49 72.99
Model-S 56.26 83.80 54.82 94.14 63.52 70.51
Model-Local 57.63 84.55 56.42 96.44 65.42 72.09
Model-NonLocal 58.09 85.04 55.73 96.63 66.56 72.41
LNLAttenNet 59.28 86.15 57.80 98.18 68.75 74.03
IPA2LT[35] 55.11 86.77 58.29 91.67 65.61 71.49
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: Non-Local weights of 16 local regions of one face in RAF-DB obtained by the proposed model. The first and third lines show the facial images, and the second and forth lines show the Non-Local weights of 16 local regions corresponding to images.

From Table IV, it is obviously seen that the performance of the proposed method (LNLAttenNet) is superior to all compared methods except LDL-ALSG and IPA2LT on AffectNet, RAF-DB, CK+, MMI and SFEW. Differently to LNLAttenNet, IPA2LT[35] utilizes two big datasets (RAF and AffectNet) as the training set, which results in its obtaining better performance. But, LNLAttenNet still achieves a competitive performance on two datasets (RAF-DB and SFEW) and outperforms on three datasets (AffectNet, CK+ and MMI) compared with IPA2LT. Compared with LDL-ALSG[58], LNLAttenNet outperforms on RAF-DB, SFEW and CK+, ties on AffectNet and loss on MMI. In the last column of Table IV, we also show the average of accuracies for five datasets given by each method in the last. It is found that LNLAttenNet obtains the highest average of accuracies: 74.03%, which illustrates LNLAttenNet can obtain a more competitive performance of FER on all of five datasets than eight compared methods.

Furthermore, it is found that Model-S is inferior to all of Model-Local, Model-NonLocal and LNLAttenNet, which demonstrates that the attention mechanism is meaningful for improving the performance of FER in our model. Meanwhile, Model-NonLocal is slightly better than Model-Local but obviously inferior to LNLAttenNet, which also demonstrates our model jointly utilizing local and non-local information of facial expression is more effective. In short, the experimental results illustrate that adaptively enhancing the facial crucial regions in feature learning by LNLAttenNet is effective for improving the performance of FER.

Considering that RAF and AffectNet datasets have a large amount of images, we also shows the confusion matrices for them in Fig.8 and Fig.9, respectively. According to the confusion matrices, it is observed that the categories (fear and surprise) are easily distinguishable for RAF-DB (shown in Fig.8) and the categories (disgust and anger) are easily distinguishable for AffectNet (shown in Fig.9).

Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 11: 16 non-local weights of two input images. In the first row, the input image and the non-local weights corresponding to each patches is shown. In the second and third rows, the six figures show the non-local weights of the input images at different training stages,respectively. The last row shows the final non-local weights obtained by our model.
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 12: The change of weights 𝐰gsuperscript𝐰𝑔{\bf{w}}^{g}bold_w start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT corresponding to 16 local regions in the training process of LNLAttenNet. The abscissa represents the number of iterations in the training process and the ordinate represents the magnitude of the weight corresponding to each iteration.

IV-C Analyses of Non-Local Attention

In LNLAttenNet, it is achieved to adaptively enhance the feature learning of facial crucial regions by jointly optimizing for local and non-local parts, where the non-local attention network is constructed to obtain the global weights 𝐰gsuperscript𝐰𝑔{\bf{w}}^{g}bold_w start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT of multiple local regions. Actually, one purpose of our work is to explore how to automatically enhance the significance of local crucial regions in deep FER, while any landmarks are not given as the prior information of facial crucial regions. Thus, in order to validate it, we make an analysis for the weights of 16 local regions obtained by our non-local attention for RAF-DB dataset.

First, the visualization results from 16 persons are shown in Fig.10. In Fig.10, the first and third rows show the original facial expression images, and the second and fourth rows exhibit the matrix (4×\times×4) of the final global weights 𝐰gsuperscript𝐰𝑔{\bf{w}}^{g}bold_w start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT (16×\times×1) corresponding to 16 local regions. For each matrix, the darker the color is, the higher the weight is. From Fig.10, it is obvious that some crucial regions obtain higher weights and non-crucial regions get smaller weights for each facial expression. For examples, the areas including or around eyes are given higher weights for the first person in the first row, where the maximum is given the local region located at the coordinate (2,2) including eyes. For the sixth person in the first row, four local regions (located at (3,2), (3,3), (4,2), (4,3)) including his mouth are boosted and given higher weights. In the third and fourth rows, the local regions located around eyes and the mouth are boosted for the second person, and the whole regions including eyes are given higher weights for the last person. Visually, these enhanced local regions are more discriminative and significant for FER.

From Fig.10, it is also observed that the location of crucial regions is different for different facial images. But, our network still automatically tracks down more discriminative regions for each different face, without the supervision of any annotated crucial points. Based on this, secondely, we make an experiment to pursue the change of weights corresponding to each local region in the process of training our model. Fig.11 shows the change of non-local weights in the training process. In Fig.11, the first row shows the original image and its final global weights obtained by our model, the second and third rows show the given global weights of 16 local regions in the initial, 250th𝑡{}^{th}start_FLOATSUPERSCRIPT italic_t italic_h end_FLOATSUPERSCRIPT, 500th𝑡{}^{th}start_FLOATSUPERSCRIPT italic_t italic_h end_FLOATSUPERSCRIPT, 750th𝑡{}^{th}start_FLOATSUPERSCRIPT italic_t italic_h end_FLOATSUPERSCRIPT, 1000th𝑡{}^{th}start_FLOATSUPERSCRIPT italic_t italic_h end_FLOATSUPERSCRIPT and 1250th𝑡{}^{th}start_FLOATSUPERSCRIPT italic_t italic_h end_FLOATSUPERSCRIPT iterations, respectively, and the last row shows the final weights. From Fig.11, it is seen that the non-local weight of each local patch is same at the beginning of training, which implies that each local region is initially regarded as the equal importance. With the training of our network, each local region is given different weights, and the higher weights are given some more discriminative regions, such as the patches (located at (4,2) and (4,3)) including the mouth shown in Fig.10(a), the patches (located at (3,2), (3,3), (4,2) and (4,4)) in Fig.10(b), et al.. It illustrates that some more crucial local regions can be adaptively enhanced in the training of our network without any landmarks.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13: Local weights of 16 patches of each face on RAF-DB obtained by the proposed model. The first column shows the results corresponding to original images, and the second to seventh columns show the results corresponding to obscured images.

In order to better observe the change of weights, we also show the change of weights corresponding to 16 local regions in all iterations in Fig.12. From Fig.12, it is seen that the weight value fluctuates at the beginning of network training and it is gradually stabilized until the end of the training. Some patches that are visually more discriminative are lightened with higher weights and some patches located at the non crucial regions cut down with smaller weights. In summary, the analyses for non-local weights demonstrate that the proposed method can effectively automatically enhance the significance of facial crucial regions in deep feature learning, without any given prior information of facial crucial regions.

Refer to caption
Figure 14: The change in the non-local weight at different α𝛼\alphaitalic_α

IV-D Visualization of Local Attentions

In the proposed method, the local attention is designed to deal with the problem that local regions is missed or obscured. In this part, the visualization of local attentions will be shown to validate the robustness of the proposed method for faces with missing regions, experimented on RAF-DB database. Note that the sigmoid function is employed to select the information flowing into the next layer in our local attention model. Fig.13 shows visual results of local attentions obtained by our method.

In Fig.13, the 1th and 3rd rows show one original facial image and six obscured images (from 2nd to 7th columns), and the 2nd and 4th rows show the weights of 16 patches of each facial image obtained by our method. Compared with the result of the original images (shown in the first column of Fig.13), it is found that the weight is weakened while one patch is obscured and the weights of other patches are unchanged. Note that the weights of some adjacent patches are also decreased with the central patch, due to overlap pixels between two adjacent patches. Practically, the local vector encoded based on one obscured patch is given a small weight, which effectively diminishes the influence of that obscured patch for facial expression recognition. In short, the experimental results illustrate that the proposed method equipped with the local attention is more robust for complex facial expression databases in practice.

IV-E Analyses for the parameter α𝛼\alphaitalic_α

In the non-local attention network, we formulate Eq.(1) to obtain the non-local feature vector 𝐠*superscript𝐠{\bf{g}}^{*}bold_g start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT based on the global information of facial expression, where the parameter α𝛼\alphaitalic_α is used to traff off the feature vectors 𝐠𝐠{\bf{g}}bold_g and 𝐬𝐬{\bf{s}}bold_s. In the previous experiments, we set α=0.7𝛼0.7\alpha=0.7italic_α = 0.7. Therefore, we make an analysis to observe the performance of the proposed method with different values of α𝛼\alphaitalic_α in this part. In this experiment, the experimental setups are same as the above experiments except α𝛼\alphaitalic_α, and α𝛼\alphaitalic_α is set as {0, 0.1, 0.2, …,0.9, 1}, respectively. Table V shows the accuracy under different α𝛼\alphaitalic_α for five datasets.

From Table V, it is seen that the accuracy is firstly increased and then decreased with a change in trend while increasing the value of α𝛼\alphaitalic_α. According to Eq.(1), we get 𝐠*=𝐠superscript𝐠𝐠{\bf{g}}^{*}={\bf{g}}bold_g start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = bold_g if α=0𝛼0\alpha=0italic_α = 0 and 𝐠*=𝐬superscript𝐠𝐬{\bf{g}}^{*}={\bf{s}}bold_g start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = bold_s if α=1𝛼1\alpha=1italic_α = 1. Combining the network optimization, it is known that the back propagation in LNLAttenNet has no constraint on 𝐬𝐬{\bf s}bold_s when α=0𝛼0\alpha=0italic_α = 0, which implies that the same effect (or feedback) is given the non-local attention and each component of the non-local weights 𝐰𝐠superscript𝐰𝐠\bf{w}^{g}bold_w start_POSTSUPERSCRIPT bold_g end_POSTSUPERSCRIPT should be random in theory. On the contrary, α=1𝛼1\alpha=1italic_α = 1 means that the back propagation has no constraint on the global vector 𝐠𝐠{\bf{g}}bold_g, which means the back propagation in LNLAttenNet has no global information and may result in an extreme result. Actually, as shown in Fig.14, we also find that the obtained weights (𝐰𝐠superscript𝐰𝐠\bf{w}^{g}bold_w start_POSTSUPERSCRIPT bold_g end_POSTSUPERSCRIPT) tend to be random under a small α𝛼\alphaitalic_α and equal under a large α𝛼\alphaitalic_α, which effectively verifies the effect of α𝛼\alphaitalic_α as same as the above analysis.

TABLE V: Accuracy rates (%) given by the proposed method with different α𝛼\alphaitalic_α.
α𝛼\alphaitalic_α 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
RAF 84.09 85.60 85.69 86.15 85.59 85.33 85.17 85.23 83.74 83.54 83.02
SFEW 55.06 55.73 56.88 57.80 57.34 57.11 56.65 56.88 55.96 54.59 53.67
CK+ 96.02 96.75 97.56 98.18 98.30 97.74 97.36 96.60 96.22 96.04 95.28
MMI 67.00 67.45 68.50 68.75 68.88 68.25 67.50 67.38 66.93 66.50 66.25
AffectNet 57.94 58.71 59.43 59.28 58.03 57.80 56.83 56.86 56.71 56.66 56.63
TABLE VI: Accuracy(%) of the proposed method with different numbers (M𝑀Mitalic_M) of patches.
M𝑀Mitalic_M 4 9 16 25 36
RAF 84.97 85.66 86.15 85.53 85.63
SFEW 55.28 56.88 57.80 58.03 57.80
CK+ 96.22 97.17 98.18 97.92 97.74
MMI 67.60 67.90 68.75 68.83 67.13
AffectNet 58.06 58.43 59.28 59.06 57.97

IV-F Analyses for different M

In our method, multiple individual networks are generated based on facial local regions, and the previous experiments are implemented with the number of local patches M=16𝑀16M=16italic_M = 16. Therefore, we also make an analysis for the number (M𝑀Mitalic_M) of local patches on five datasets. In this experiment, M𝑀Mitalic_M is set as 4, 9, 16, 25 and 36, respectively. Table VI shows the accuracy rates with different M𝑀Mitalic_M. In this experiment, the size of the input image is 144*144 and the size of overlap** pixels between adjacent patches is around a third of the size of each patch, which is computed by

n*Psize(n1)*γ*Psize=144,𝑛subscript𝑃𝑠𝑖𝑧𝑒𝑛1𝛾subscript𝑃𝑠𝑖𝑧𝑒144n*P_{size}-(n-1)*\gamma*P_{size}=144,italic_n * italic_P start_POSTSUBSCRIPT italic_s italic_i italic_z italic_e end_POSTSUBSCRIPT - ( italic_n - 1 ) * italic_γ * italic_P start_POSTSUBSCRIPT italic_s italic_i italic_z italic_e end_POSTSUBSCRIPT = 144 , (5)

where γ𝛾\gammaitalic_γ is around 1/3131/31 / 3, n2=Msuperscript𝑛2𝑀n^{2}=Mitalic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_M and Psizesubscript𝑃𝑠𝑖𝑧𝑒P_{size}italic_P start_POSTSUBSCRIPT italic_s italic_i italic_z italic_e end_POSTSUBSCRIPT is the size of each patch. Note that the parameters of our network except M𝑀Mitalic_M is set as same as previous experiments.

From Table VI, it is observed that the performance with more local regions is superior to with less local regions. It implies that the size of each local region is too large to attain multiple diverse local information when M𝑀Mitalic_M is set as a small value. Whereas, it is also notice that the computational complexity will be increased when M𝑀Mitalic_M is set as a high value, and thus we finally set M=16𝑀16M=16italic_M = 16 to implement most experiments.

IV-G Analyses for Overlapped Pixels between Local Regions

In the previous experiments, 1/3131/31 / 3 of whole pixels in each patch are applied as the overlap** pixels between two neighbor patches, which is a more appropriate value, since the number of pixels overlap** between the middle patch and both sides is only 2/3232/32 / 3 , and the information of 1/3131/31 / 3 of the pixels at the center of patch is still retained. If a larger number of overlap** pixels is employed, such as 1/2121/21 / 2, the middle patch will completely overlap with the patches on both sides. If a smaller number is used, such as 1/4141/41 / 4, the number of pixels in the overlap** region will be too small to solve the problem of regional connectivity. In order to analyze the influence of overlap** pixels between two patches, an experiment that other experimental settings are same to before is implemented based on RAF-DB dataset, and the result is shown in Table VII. In Table VII, it shows accuracies obtained by the proposed method based on different number (N𝑁Nitalic_N) of overlap** pixels. From the results, it is seen that the performance on the test set increases slowly to plateau as the number of overlap** pixels increases. It illustrates that the more the overlap** pixels are, the larger the number of network parameters are. According to our analyses, the main reason is that it is easier to introduce redundant information between adjacent patches when the number of overlap** pixels is larger.

TABLE VII: Accuracy(%) of the proposed method with different overlap** numbers(N) of pixels.
N𝑁Nitalic_N 4 8 12 16 20 24
RAF 84.63 84.96 85.29 86.15 86.16 86.24

V Conclusion

In this paper, we propose the LNLAttenNet method to effectively explore the significance of facial crucial regions in feature learning for FER, without any landmark information. In LNLAttenNet, the global information of the facial expression is utilized to construct the non-local attention network, and meanwhile the local information is utilized to supervise self-information. By the joint optimization of facial non-local and local feature vectors, LNLAttenNet can adaptively enhance more crucial regions in the process of deep feature learning. Specifically, an ensemble of multiple networks corresponding to local regions is constructed to integrate the local feature with the non-local weights, which achieves the interactive guidance between the facial global and local information. Experimental results also demonstrate that some local crucial regions can be effectively enhanced in feature learning by LNLAttenNet while there are not any given information of landmarks in the training model. Moreover, the proposed method focuses on enhancing facial crucial regions in FER without any landmark information based on multiple patches, and thus we will explore it from the view of pixels for facial expressions in the further works.

References

  • [1] C. Darwin and P. Prodger, The expression of the emotions in man and animals.   Oxford University Press, USA, 1998.
  • [2] R. Buck, R. E. Miller, and W. F. Caul, “Sex, personality, and physiological variables in the communication of affect via facial expression.” Journal of personality and social psychology, vol. 30, no. 4, p. 587, 1974.
  • [3] M. C. Smith, M. K. Smith, and H. Ellgring, “Spontaneous and posed facial expression in parkinson’s disease,” Journal of the International Neuropsychological Society, vol. 2, no. 5, pp. 383–391, 1996.
  • [4] C. A. Corneanu, M. O. Simón, J. F. Cohn, and S. E. Guerrero, “Survey on rgb, 3d, thermal, and multimodal approaches for facial expression recognition: History, trends, and affect-related applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 8, pp. 1548–1568, 2016.
  • [5] A. Majumder, L. Behera, and V. K. Subramanian, “Automatic facial expression recognition system using deep network-based data fusion,” IEEE Transactions on Cybernetics, vol. 48, no. 1, pp. 103–114, 2018.
  • [6] W. Xie, L. Shen, and J. Duan, “Adaptive weighting of handcrafted feature losses for facial expression recognition,” IEEE Transactions on Cybernetics, 2019.
  • [7] R. Ekman, What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS).   Oxford University Press, USA, 1997.
  • [8] P. Ekman, W. V. Friesen, and J. C. Hager, “Facial action coding system: The manual on cd rom,” A Human Face, Salt Lake City, pp. 77–254, 2002.
  • [9] S. Wang, G. Peng, S. Chen, and Q. Ji, “Weakly supervised facial action unit recognition with domain knowledge,” IEEE Transactions on Cybernetics, vol. 48, no. 11, pp. 3265–3276, 2018.
  • [10] H. K. Ekenel and R. Stiefelhagen, “Why is facial occlusion a challenging problem?” in International Conference on Biometrics.   Springer, 2009, pp. 299–308.
  • [11] L. Zhong, Q. Liu, P. Yang, J. Huang, and D. N. Metaxas, “Learning multiscale active facial patches for expression analysis,” IEEE Transactions on Cybernetics, vol. 45, no. 8, pp. 1499–1510, 2014.
  • [12] Y. Fan, J. C. Lam, and V. O. Li, “Multi-region ensemble convolutional neural network for facial expression recognition,” in Proceedings of International Conference on Artificial Neural Networks.   Springer, 2018, pp. 84–94.
  • [13] Y. Li, J. Zeng, S. Shan, and X. Chen, “Occlusion aware facial expression recognition using cnn with attention mechanism,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2439–2450, 2018.
  • [14] S. L. Happy and A. Routray, “Automatic facial expression recognition using features of salient facial patches,” IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 1–12, Jan 2015.
  • [15] K. Wang, X. Peng, J. Yang, D. Meng, and Y. Qiao, “Region attention networks for pose and occlusion robust facial expression recognition,” IEEE Transactions on Image Processing, vol. 29, pp. 4057–4069, 2020.
  • [16] M. Liu, S. Shan, R. Wang, and X. Chen, “Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
  • [17] C. Shan, S. Gong, and P. W. McOwan, “Facial expression recognition based on local binary patterns: A comprehensive study,” Image and vision Computing, vol. 27, no. 6, pp. 803–816, 2009.
  • [18] I. Kotsia and I. Pitas, “Facial expression recognition in image sequences using geometric deformation features and support vector machines,” IEEE Transactions on Image Processing, vol. 16, no. 1, pp. 172–187, 2006.
  • [19] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, June 2010, pp. 94–101.
  • [20] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for facial expression analysis,” in 2005 IEEE International Conference on Multimedia and Expo, July 2005, pp. 5 pp.–.
  • [21] M. J. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba, and J. Budynek, “The japanese female facial expression (jaffe) database,” in Proceedings of third international conference on automatic face and gesture recognition, 1998, pp. 14–16.
  • [22] G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. PietikäInen, “Facial expression recognition from near-infrared videos,” Image and Vision Computing, vol. 29, no. 9, pp. 607–619, 2011.
  • [23] S. Li and W. Deng, “Deep facial expression recognition: A survey,” IEEE Transactions on Affective Computing, 2020.
  • [24] S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [25] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, Jan 2019.
  • [26] C. Fabian Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5562–5570.
  • [27] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [30] P. Hu, D. Cai, S. Wang, A. Yao, and Y. Chen, “Learning supervised scoring ensemble for emotion recognition in the wild,” in Proceedings of the 19th ACM international conference on multimodal interaction.   ACM, 2017, pp. 553–560.
  • [31] D. Acharya, Z. Huang, D. Pani Paudel, and L. Van Gool, “Covariance pooling for facial expression recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
  • [32] S. Li and W. Deng, “Blended emotion in-the-wild: Multi-label facial expression recognition using crowdsourced annotations and deep locality feature learning,” International Journal of Computer Vision, vol. 127, no. 6-7, pp. 884–906, 2019.
  • [33] H. Yang and L. Yin, “Cnn based 3d facial expression recognition using masking and landmark features,” 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 556–560, 2017.
  • [34] W. Wu, Y. Yin, Y. Wang, X. Wang, and D. Xu, “Facial expression recognition for different pose faces based on special landmark detection,” 2018 24th International Conference on Pattern Recognition (ICPR), pp. 1524–1529, 2018.
  • [35] J. Zeng, S. Shan, and X. Chen, “Facial expression recognition with inconsistently annotated datasets,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  • [36] Y. Gan, J. Chen, and L. Xu, “Facial expression recognition boosted by soft label with a diverse ensemble,” Pattern Recognition Letters, vol. 125, pp. 105–112, 2019.
  • [37] H. Yang, U. Ciftci, and L. Yin, “Facial expression recognition by de-expression residue learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [38] F. Zhang, T. Zhang, Q. Mao, and C. Xu, “Joint pose and expression modeling for facial expression recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [39] S. Zhao, C. Lin, P. Xu, S. Zhao, Y. Guo, R. Krishna, G. Ding, and K. Keutzer, “Cycleemotiongan: Emotional semantic consistency preserved cyclegan for adapting image emotions,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 2620–2627.
  • [40] L. Fan, W. Huang, C. Gan, J. Huang, and B. Gong, “Controllable image-to-video translation: A case study on facial expression generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 3510–3517.
  • [41] R. Wu, G. Zhang, S. Lu, and T. Chen, “Cascade ef-gan: Progressive facial expression editing with local focuses,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [42] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “Ganimation: Anatomically-aware facial animation from a single image,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  • [43] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [44] P. Barros, G. I. Parisi, C. Weber, and S. Wermter, “Emotion-modulated attention improves expression recognition: A deep learning model,” Neurocomputing, vol. 253, pp. 104–114, 2017.
  • [45] X. Wang, M. Peng, L. Pan, M. Hu, C. **, and F. Ren, “Two-level attention with two-stage multi-task learning for facial emotion recognition,” arXiv preprint arXiv:1811.12139, 2018.
  • [46] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [47] T. Falk, D. Mai, R. Bensch, Ö. Çiçek, A. Abdulkadir, Y. Marrakchi, A. Böhm, J. Deubner, Z. Jäckel, K. Seiwald et al., “U-net: deep learning for cell counting, detection, and morphometry,” Nature methods, vol. 16, no. 1, p. 67, 2019.
  • [48] Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual u-net,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749–753, May 2018.
  • [49] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [50] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds.   Curran Associates, Inc., 2017, pp. 5998–6008. [Online]. Available: http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
  • [52] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [53] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [54] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark,” in 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Nov 2011, pp. 2106–2112.
  • [55] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics.   JMLR Workshop and Conference Proceedings, 2010, pp. 249–256.
  • [56] J. Goldberger and E. Ben-Reuven, “Training deep neural-networks using a noise adaptation layer,” in Proceedings of International Conference of Learning Representation (ICLR), 2017.
  • [57] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European Conference on Computer Vision.   Springer, 2016, pp. 499–515.
  • [58] S. Chen, J. Wang, Y. Chen, Z. Shi, X. Geng, and Y. Rui, “Label distribution learning on auxiliary label space graphs for facial expression recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4321–4330.
[Uncaptioned image] Shasha Mao (M’14) received the Ph.D. degree in circuit and system from Key Lab of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi?an, China, in 2014. From 2014 to 2018, she worked as a Research Fellow in Nanyang Technological University and Singapore University of Technology and Design, Singapore, respectively. She is currently an Associate Professor at the school of Artificial Intelligence, Xidian University. Her research interests include ensemble learning, deep learning, imbalanced learning, facial expression recognition, and SAR images regestration.
[Uncaptioned image] Guanghui Shi Guanghui Shi received the B.S. degree in Electronics and Information Engineering from Wuhan University of Technology in 2018 and received the M.S. degree in Electronics and Communication Engineering from Xidian University, Xian, China in 2021. He is currently working at the 701 Research Institute. His research interests include machine learning, deep learning and facial expression recognition.
[Uncaptioned image] Shui** Gou (M’08) received the B.S. and M.S. degrees in computer science and technology from Xidian University, Xi’an, China, in 2000 and 2003, respectively, and the Ph.D. degree in pattern recognition and intelligent system from Xidian University, in 2008. She is currently a Professor with the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, School of Artificial Intelligence, Xidian University. Her research interests include machine learning, data mining, remote sensing image analysis and medical image analysis.
[Uncaptioned image] Dandan Yan received the B.S. degree in Computer Science and Technology from Xi?an University of Technology in July 2021. She is currently a student at the School of Artificial Intelligence, Xidian University. Her research interests include deep learning, facial expression recognition and label distribution learning.
[Uncaptioned image] Licheng Jiao (SM’89—F’17) received the B.S. degree in electronic engineering from Shanghai Jiao Tong University, Shanghai, China, in 1982, the M.S. and Ph.D. degrees in electronic engineering from Xian Jiaotong University, Xi?an, China, in 1984 and 1990, respectively. From 1990 to 1991, he was a Post-Doctoral Fellow with the National Key Laboratory for Radar Signal Processing, Xidian University, Xi?an. Since 1992, he was a Professor with the School of Electronic Engineering, Xidian University. Currently, he is a Professor with the School of Artificial Intelligence, Xidian University, and he is also the Director of the Key Laboratory of Intelligent Perception and Image Understanding, Ministry of Education of China, Xidian University. He is in charge of about 40 important scientific research projects. He has authored or co-authored more than 20 monographs and 100 papers in international journals and conferences. His research interests include image processing, natural computation, machine learning, and intelligent information processing. Prof. Jiao is a member of the IEEE Xian Section Execution Committee, the Chairman of awards and recognition committee, the Vice Board Chairperson of the Chinese Association of Artificial Intelligence, the Councilor of the Chinese Institute of Electronics, the Committee Member of the Chinese Committee of Neural Networks, and an Expert of academic degrees committee of the state council.
[Uncaptioned image] Lin Xiong received the Ph.D. degree in pattern recognition & intelligent system from Key Lab of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an, China, in 2015. Currently, he works as research scientist in JD Finance America Corporation. Before, he was a senior research engineer of Learning & Vision, Core Technology Group, Panasonic R&D Center Singapore (PRDCSG) from 2015 to 2018. His research interests include distributed model parallelism, unconstrained/large-scale face recognition, deep learning architecture engineering, person re-identification, face recognition, Riemannian manifold optimization, sparse and low-rank matrix factorization.