Adaptively Enhancing Facial Expression Crucial Regions via Local Non-Local Joint Network

Guanghui Shi, Shasha Mao, Shui** Gou, Dandan Yan, Licheng Jiao, and Lin Xiong Manuscript received by Machine Intelligence ResearchThe paper can be accessed via https://link.springer.com/article/10.1007/ s11633-023-1417-9DOI: 10.1007/s11633-023-1417-9

Abstract

Facial expression recognition (FER) is still one challenging research due to the small inter-class discrepancy in the facial expression data. In view of the significance of facial crucial regions for FER, many existing researches utilize the prior information from some annotated crucial points to improve the performance of FER. However, it is complicated and time-consuming to manually annotate facial crucial points, especially for vast wild expression images. Based on this, a local non-local joint network is proposed to adaptively light up the facial crucial regions in feature learning of FER in this paper. In the proposed method, two parts are constructed based on facial local and non-local information respectively, where an ensemble of multiple local networks are proposed to extract local features corresponding to multiple facial local regions and a non-local attention network is addressed to explore the significance of each local region. Especially, the attention weights obtained by the non-local network is fed into the local part to achieve the interactive feedback between the facial global and local information. Interestingly, the non-local weights corresponding to local regions are gradually updated and higher weights are given to more crucial regions. Moreover, U-Net is employed to extract the integrated features of deep semantic information and low hierarchical detail information of expression images. Finally, experimental results illustrate that the proposed method achieves more competitive performance compared with several state-of-the-art methods on five benchmark datasets. Noticeably, the analyses of the non-local weights corresponding to local regions demonstrate that the proposed method can automatically enhance some crucial regions in the process of feature learning without any facial landmark information.

Index Terms:

Facial Expression Recognition, Deep Neural Network, Multiple Networks Ensemble, Attention Network.

I Introduction

Emotion is a complex state that integrates people’s feelings, thoughts and behaviors [1], and facial expression is one of the most direct signals to communicate their innermost thoughts. Therefore, facial expression recognition (FER) [2, 3, 4, 5, 6] has attracted the attention of many researchers due to its important role in many practical application fields, such as human-computer interaction, recommendation system, patient monitoring, et al.. In general, facial expression is encoded into facial action units through facial action coding system [7, 8, 9], and any expressions can be described through a set of facial action units. As we know, some facial action units are crucial for FER [10], such as the one located in regions around eyes and the mouth, since they are of more obvious actions compared with other facial regions (such as cheek and forehead). In the following parts, we regard these crucial facial action units as facial crucial regions, shortened by FCRs. Fig.1 illustrates facial crucial regions of two facial images (ID1 and ID2) from six expressions, respectively. From Fig.1, it is found that the FCRs are more discriminative to determine the expression category of a facial image [11].

Refer to caption — Figure 1: An illustration of facial crucial regions from six expressions, where two facial images (ID1 and ID2) are shown for each expression. The regions around eyes and mouths are cropped as examples of FCRs in the purple box and the green box, respectively.

In view of the significance of FCRs, many studies [12, 13, 14, 15] have been proposed based on applying the information of facial local regions, where the facial landmarks are employed as the prior information of facial crucial regions, whereas the landmarks are given by manually annotating for facial expression images. Early, most of FER researches [16, 17, 18] focused on lab-collected expression datasets, such as CK+ [19], MMI [20], JAFFE [21], Oulu-CASIA [22]. For lab-collected datasets, facial expressions images were collected from several or dozens of individuals under similar conditions (such as illumination, angle, posture, et al.), generally with a few uncontrollable factors. Thus, it is easily achieved to manually annotate the landmark of FCRs for lab-collected datasets.

However, compared with the lab-controlled datasets, the wild expression datasets [23] are collected under more complex and uncontrollable conditions, such as RAF-DB [24], AffectNet [25], EmotionNet [26], et al. For the wild expression datasets, especially including a vast of images, it is very complicated and time-consuming for manually annotating FCRs. Moreover, the postures of different faces vary greatly on the wild database. One simple change of facial postures can cause multiple pixel deviations at the image level. Fig.2 gives an example about the landmarks moving with the change of postures, where two expression images and their landmarks are from RAF-DB dataset [24]. From Fig.2, it is obvious that 68 landmark points of the image (a) are different from the image (b) and the landmarks are greatly shifted from (a) to (b), shown as the figure (c). It implies that the position of FCRs varies with the change of facial postures. Inevitably, it increases the complexity of manually annotating landmarks for FER, especially for the wild dataset with a vast of images. In view of this, it is considerable that whether the significance of FCRs or their features could be spontaneously enhanced in the training of deep FER or not, without any prior information, such as landmarks of FCRs.

On the other hand, there exists a problem that some FCRs from different expression categories are similar, whereas some FCRs from one same category are very different. From Fig.1, it is obviously seen that the FCRs (including mouths) of ID1 from six expressions are similar with opening the mouth, which is absolutely different from ID2 with closing the mouth. Similarly, for the crucial regions including eyes, ID1 and ID2 from the category (Fear) are different, whereas ID1 from the category (Surprise) and ID2 from the category (Anger) are similar. It illustrates that FCRs of expression images belonging to the same category may be very different but FCRs from different categories are similar. Distinctly, it is insufficient that only local information of facial expressions is utilized to construct one effective model for FER, especially for the wild dataset. Hence, it is still important to utilize the global information of the facial expression while FCRs are enhanced in deep facial expression recognition.

Based on the above analyses, we propose a new method of facial expression recognition in this paper, which constructs a local non-local joint network to adaptively enhance the facial crucial regions in the process of deep feature learning, shortened for LNLAttenNet. In LNLAttenNet, the local and the non-local information of facial expressions are simultaneously considered to construct two parts of the network respectively: a local multi-network ensemble and a non-local attention network, and then the generated local and non-local feature vectors are integrated and jointly optimized in feature learning. Specially, the attention weights obtained by the non-local part is regarded as the significance of facial local regions and fed into the local multi-network ensemble system to combine multiple local networks. Interestingly, we find that some facial crucial regions can be automatically enhanced in the process of deep feature learning by the proposed method. Moreover, U-Net is employed to generate feature maps where each pixel has large receptive field and the local region also contains the global information. Fig.3 shows a simple view of LNLAttenNet. From Fig.3, it is obvious that some crucial regions is given higher weights by LNLAttenNet, such as the 5th patch around the left eye (0.1123), the 10th, 11t and 14th patches around the mouth (0.0887, 0.1073 and 0.1298), which illustrates that some crucial regions are effectively enhanced by LNLAttenNet. Note that $w_{i}$ is the non-local attention weight corresponding to the $i^{th}$ local region and the initial weights are equal. More detailed descriptions will be introduced in the following parts.

Compared with stat-of-the art methods, our contributions are mainly three points:

•

We propose LNLAttenNet to automatically light up facial crucial regions in deep feature learning by utilizing the local and non-local information of facial expression simultaneously. To the best of our knowledge, it is the first work to study whether FCRs is directly explored and enhanced in feature learning of deep FER, where FCRs are automatically enhanced without any prior information for facial crucial regions or points. It effectively improves the problem that difficultly annotating for the wild dataset with a vast of facial images.
•

In LNLAttenNet, an attention mechanism is introduced to construct the non-local attention network which explores the significance of local regions for FER from a global perspective of facial expression. The obtained attention weights corresponding to local regions are fed into the local multi-network ensemble system to integrate multiple local features, and then the integration of features obtained by multiple local networks is jointly optimized with the facial global feature.
•

Experimental results demonstrate that FCRs can be enhanced in deep feature learning by LNLAttenNet, which validates FCRs are exactly more discriminative local regions for FER. Moreover, it also implies that the model of deep FER can spontaneously focus on some crucial regions in the training process, which probably brings a new inspiration for designing deep FER methods.

The rest of this manuscript is organized as follows. Section II firstly introduces related works about deep facial expression recognition. Secondly, Section III introduces the detail of the proposed method. Then, experimental results and analyses are demonstrated to validate the performance of the proposed method in Section IV. Finally, Section V provides the conclusion as well as the prospects on future works.

II Related Works

Due to the excellent performance of deep learning, various deep networks have been applied in FER [23], such as VggNet [27], InceptionNet [28], ResNet [29], et al. Based on this, many deep FER methods have been proposed to address different problems. In [30], Hu et al. firstly extended the idea of deep supervision to deal with FER in the wild. The training of deep CNNs was softer and easier through the supervision not only to deep layers but also to intermediate layers and shallow layers, and a fusion structure was constructed where the feature ahead was used for the second-level supervision. In [31], Acharya et al. thought that the second-order statistic (such as covariance) were more suitable to catch the feature of the twisted facial expression. In their framework, a mainfold structure was constructed for covariance pooling to obtain a competitive performance for FER. In [32], Li et al. proposed a new deep manifold strategy for multi-label expressions, and their proposed network focused on the ambiguity expressions and could learn the discriminative feature that was suitable for cross-database FER.

Considering that facial expression is determined by key regions, Fan et al. [12] utilized the information of facial landmark points to select three sub-images around the eyes, mouth and nose. Then, three sub-images were encoded by three sub-networks, and the last pooling layer in each sub-network was concatenated with each other, which obtained better recognition performance compared with others. In [33, 34], the information of facial landmark is used to extract features and generate masks from specific locations to remove the pose variation.

In [35], it was taken into account that there are inevitably labeling errors and deviations between different databases due to the subjectivity of labeling facial expressions. Therefore, when existing methods make use of multiple databases to expand the training set, their performance cannot be continuously improved. In order to solve this inconsistency between different databases, an Inconsistent Pseudo Annotations to Latent Truth (IPA2LT) framework is proposed to train a model from multiple inconsistent databases and large scale unlabeled images. The IPA2LT essentially constructs the ensemble at label level. Each image in the model has the same number of labels as the number of data sources, in which only one label is original and others are pseudo. Existing methods for FER have been almost satisfying on analyzing the frontal faces but fail to attain a good performance on partially occluded faces collected in the wild. Some facial expressions are ambiguous and have multi-labels. In [36], Gan et al. proposed a new framework based on CNN with the supervision of soft labels, where hard labels are used to construct soft labels with a novel label-level perturbation. In this framework, soft labels were obtained to eliminate the similarity between faces of different emotions, and multiple basic classifiers were trained and then combined. Moreover, some GAN-based methods have been proposed to generate expressional images for FER [37, 38, 39] or usually focus only on generating new facial expression images [40, 41, 42, 43]. In [37], a novel approach is proposed to learn facial expressions by extracting the expressive component through a de-expression procedure where the corresponding neutral expression is generated by the trained generative model by given a facial image with arbitrary expressions. In [40], a user-controllable approach is proposed so as to generate video clips of various lengths from a single face image and the lengths and types of the expressions are controlled by users.

In [13], Li et al. proposed a CNN with attention mechanism (ACNN) to detect the occlusion of facial regions and paid attention to the most discriminative regions, where ACNN used the information of 24 facial landmark points to select the key regions at the feature level. In [44], Barros et al. investigated the emotion-driven attention mechanisms from the view of videos. In [45], Wang et al. proposed two-level attention mechanism to extract emotion-related features, which was based on global information, not involving the local regions. Similarly to [44, 45, 13], the attention mechanism is also involved in this work, whereas the essence of algorithms is very different. Here, our purpose is to adaptively enhance the significance of facial crucial regions based on the attention weights in feature learning obtained by the non-local attention network from the view of multiple local regions, where the attention weights corresponding to each local regions are obtained by the non-local attention network.

III Local Non-Local Joint Network for Facial Expression Recognition

In this paper, we propose a Local Non-Local Attention Joint Network for FER to adaptively light up more crucial local regions of facial expression, named by LNLAttenNet. The overall framework of LNLAttenNet is visually shown in Fig.4. In Fig.4, one facial expression image is used as the initial input instance of the proposed network, and its size is 144 $\times$ 144 as same as our implemented experiments.

In LNLAttenNet, U-Net is firstly employed to extract the feature maps integrating the deep semantic information and the low hierarchical detail information of facial expression images. For the facial expression dataset, when the regional integration is carried out [12], the inter-class discrepancy is smaller and the intra-class discrepancy is larger, as shown in Fig.1. The structure of U-Net [46, 47, 48], the top-down architecture with lateral connections for introducing details into high-level semantic feature maps, has been proved that local regions in last few layers are of the large receptive field and the global information, which is important and useful for ambiguous objects recognition [49, 50]. Therefore, U-Net is beneficial to alleviate the negative impact of the regional integration, but it does not mean that the proposed method is restricted to U-Net. Actually, one model with the similar structure to U-Net can be employed in our proposed method, such as FPN [49].

As shown as Fig.4, facial expression images are inputted to the proposed model. By U-Net, two different feature maps are generated for the initial input image, located in the last layer (Conv9-2) and the intermediate layer (Conv5-2) of U-Net, respectively. In the following parts, we use $\mathcal{F}_{5}$ and $\mathcal{F}_{9}$ to express the feature maps from Conv5-2 and Conv9-2 of U-Net, respectively. Then, the generated feature maps $\mathcal{F}_{5}$ and $\mathcal{F}_{9}$ are utilized to construct two parts of LNLAttenNet, where the map $\mathcal{F}_{5}$ is utilized as the input to construct the non-local part (the Non-Local Attention Network) and the map $\mathcal{F}_{9}$ is employed as the input to construct the local part (the Local Multi-Networks Ensemble System). In the local part, an ensemble of multiple networks is applied to generate and integrate multiple individual networks corresponding to different facial local regions respectively. By the non-local attention network, an attention weight $w_{i}$ ( $i=1,...,M$ ) is obtained corresponding to the $i^{th}$ local region of the facial expression, and then the vector $\bf{w}$ ( $[w_{1},...,w_{M}]^{T}$ ) are used as the weights of multiple local networks to combine $M$ local vectors and meanwhile boost the significance of local regions in the process of deep feature learning. Finally, the non-local attention network and the local ensemble network are jointly optimized by integrating local and non-local features in three fully connected layers of LNLAttenNet. More detailed descriptions of the proposed method will be introduced as follows.

III-A Non-Local Attention Network

For facial expression recognition, there is small inter-class discrepancy and large intra-class discrepancy on expression images, as shown in Fig.1. Therefore, facial crucial regions are regarded as more discriminative regions which determine the categories of facial expression, such as regions around the mouth (eyes) rather than the cheek. However, it is tough to estimate which regions are more crucial without the assistance from manually annotated crucial points. Based on this, we construct the Non-Local Attention Network to automatically mine more discriminative regions from the whole facial expression, visually shown in the box with orange dot lines of Fig.4.

In Fig.4, the feature map $\mathcal{F}_{5}$ (Conv5-2) is generated by U-Net as the global information of the facial image to construct the non-local attention network. The Conv5-2 is with the minimum resolution and the maximum receptive field, which means that $\mathcal{F}_{5}$ is not affected by each local patch but contains the relationship between local patches implicitly. It is useful to mine more crucial regions based on the global information from the whole face.

III-A1 Global Attention

Inspired by [51, 52], we construct a non-local attention model based on three branches, shown as in Fig.5. First, the input is the map $\mathcal{F}_{5}$ containing the global information of facial expression in Fig.5. Based on $\mathcal{F}_{5}$ , three feature maps $\mathcal{Q}$ , $\mathcal{K}$ and $\mathcal{V}$ are generated by one convolution layer and one pooling layer, respectively. Note that three maps are with a special resolution¹¹1 This special resolution is set in order to expediently calculate the correlation between each patch. For example, when the number of cropped local regions is set as 16 ( $M=16$ ) in our experiments, the special resolution is $4*4$ ( $n=4$ ), as shown in Fig.5. with $n*n$ in this model, where $M=n^{2}$ and $M$ is the number of cropped local regions. Then, the maps $\mathcal{Q}$ and $\mathcal{K}$ are reshaped as $\bf{Q}^{*}$ and $\bf{K}^{*}$ , shown as in Fig.5, and a multiplication operation is followed to get a matrix $\bf R$ which reflects the correlation among local regions. Compared with [51, 52], the relevance of each region (patch) in LNLAttenNet is not as strong as each frame in video or each word in sentence, and thus $L_{1}$ normalization is adopted to limits the sum of each row of $R$ to 1 instead of the softmax function. Finally, a vector is calculated via averaging the each column of the correlation matrix $\bf R$ , regarded as the non-local attention weights $\bf{w}^{g}$ assigned to $M$ local regions.

Furthermore, the map ${\bf{V}}$ is reshaped as ${\bf{V}}^{*}$ , and the feature vector ${\bf{s}}$ is obtained by multiplying ${\bf{V}}^{*}$ by the correlation matrix ${\bf{R}}$ , which is the self-attention form in [51, 52]. In order to make the matrix ${\bf{R}}$ reflect the correlation among local regions, ${\bf{s}}$ is flattened and added to the non-local vector ${\bf{g}}$ (shown in Fig.4). Meanwhile, a function is given to trade off two vectors ${\bf{g}}$ and ${\bf{s}}$ , shown as

{\bf{g}}^{*}=(1-\alpha)\cdot{\bf{g}}+\alpha\cdot flat({\bf{s}}),

(1)

where ${\bf{g}}^{*}$ expresses the new non-local vector and $\alpha$ is the hyper-parameter to adjust the ratio of ${\bf{s}}$ . In experiments, we will give an analysis for the parameter $\alpha$ .

III-B Local Multi-Networks Ensemble

The feature map ( $\mathcal{F}_{9}$ ) is employed as the input to construct the part: Local Multi-Networks Ensemble, shown as in Fig.4. The reason of using the map $\mathcal{F}_{9}$ is that each pixel is of the large receptive field and the rich sementic information in Conv9-2, where $\mathcal{F}_{9}$ is with the same resolution as the initial input image. In the part of Local Multi-Networks Ensemble, the feature map $\mathcal{F}_{9}$ is firstly divided into $M$ patches (including different local regions) with the same dimension (set as 48*48*64 in our experiments). Then, $M$ patches are trained by Simple Network²²2The basic structure of Simple Network is shown in Fig.7, composed of six convolution layers and three pooling layers. to generate $M$ individual networks $\{{\mathcal{IN}}_{1},...,{\mathcal{IN}}_{M}\}$ , respectively. Specially, for each individual network, the local attention mechanism is added to enhance the feature vector of each local region. Finally, $M$ local feature vectors are combining with the non-local attention weights obtained by Non-Local Attention Network.

III-B1 Local Attention

In practice, it is found that the useful information is decreased when partial regions in one patch are missed or obscured. It means that less attention should be given to them. In view of this, a local attention mechanism is adopted in each individual network to weaken the significance of useless regions. The local attention model is encoded by four convolution layers and two fully connected layers, and its structure is shown in Fig.6. Note that two convolution layers are not padded in order to reduce the computational complexity. In the local attention model, its input is the output of the last pooling layer in Simple-Net, and its output is one value between 0 and 1 obtained via the sigmoid function, regarded as the local attention weight $w_{i}^{l}$ of each individual network, which represents the amount of information in each local patch can flow to the next level. If the facial local region is obscured or missed, the information that it contains for expression recognition will be reduced, and then the weight value of the local attention is also reduced to alleviate the effect of patches including the obscured region. Furthermore, the weights will be multiplied by the corresponding local vector as the output feature of each local network. More visual illustrations can be found in the part of experiments.

$Q$			$K$			$V$
Operation	Activate	Output shape	Operation	Activate	Output shape	Operation	Activate	Output shape
Conv 1 $\times$ 1 s:1	ReLu	99512	Conv 1 $\times$ 1 s:1	ReLu	99512	Conv 1 $\times$ 1 s:1	ReLu	99512
MaxPooling 2 $\times$ 2 s:2	-	44512	MaxPooling 2 $\times$ 2 s:2	-	44512	MaxPooling 2 $\times$ 2 s:2	-	44512
Reshape	-	16*512	Reshape	-	512*16	Reshape	-	16*512

TABLE I: The structure of Non-Local attention.

III-B2 Combination of Multiple Local Networks

According to the non-local attention weights ${\bf{w}}^{g}$ and the local attention weights ${\bf{w}}^{l}$ , the local feature vectors given by $M$ individual networks $\{{\mathcal{IN}}_{1},...,{\mathcal{IN}}_{M}\}$ are aggregated by the formula

{\bf{f}}_{en}=\sum_{i=1}^{M}w_{i}^{g}*w_{i}^{l}{\bf{f}}_{i},

(2)

where ${\bf f}_{en}$ expresses the ensemble feature vector, ${\bf{f}}_{i}$ expresses the feature vector given by ${\mathcal{IN}}_{i}$ corresponding to the $i^{th}$ local region, $w_{i}^{g}$ is the non-local attention weight of the $i^{th}$ local region, and $w_{i}^{l}$ expresses the local attention weight of the $i^{th}$ local region. In experiments, we will give an analysis for the number $M$ of local patches.

III-C Joint Optimization of LNLAttenNet

In Fig.4, the non-local feature vector ${\bf{g}}^{*}$ is produced by the non-local attention network, and the local vector ${\bf{f}}_{en}$ is obtained by the local multi-network ensemble. Inspired by [53], we think that the global information of an input image is essential, and each local patch can get large receptive field and the global information by embedding U-Net, which makes it easier to classify the similar patch of facial expression of different categories. Moreover, Conv5-2 is encoded to a global vector with 8192 dimension by two convolution layers and one pooling layer. Then, the non-local vector ${\bf{g}}^{*}$ is concatenated with the local vector ${\bf{f}}_{en}$ to obtain the total vector as the feature of the first fully connected layer and is jointly optimized, and the dimension of the integrated feature vector is 17408 shown as in Fig.4. In LNLAttenNet, three full connect layers are implemented, and the loss function is formulated as

L=loss_{entropy}+\gamma loss_{l2},

(3)

where $loss_{entropy}$ expresses the cross entropy loss, $loss_{l2}$ is the l2 regularization loss, and $\gamma$ is the hyper-parameter controlling the balance between two losses. The cross entropy is calculated as:

loss_{entropy}=\frac{1}{N}\sum_{n=0}^{N-1}\sum_{c=0}^{C-1}\mathbb{L}(l_{n}=c)% \cdot log(p_{n}^{i}),

(4)

where $C$ is the number of categories, $N$ is the number of the input image, and $\mathbb{L}$ is the function that determines whether the input is correct. $p_{n}^{i}$ is the $i^{th}$ component of the output of the last softmax layer of the $n^{th}$ image, and $l_{n}$ is the label of the $n^{th}$ input image. The l2 regularization loss is computed by $loss_{l2}=\lambda\cdot{||W||}^{2}$ , where $W$ is the parameters of our model and $\lambda$ is set as 0.0001 in the following experiments.

IV Experiments and Analyses

In this section, we will validate the performance of the proposed method from several items: 1) the performance comparison with state-of-the-art methods on benchmark datasets, 2) the analyses of Non-Local Attention, 3) the visualization of Local Attention, 4) the change of the parameter $\alpha$ , 5) the performance of LNLAttenNet with different $M$ , and 6) the analyses for overlapped pixels between local regions, respectively.

Operation	Activate	Output shape
Conv 3 $\times$ 3 s:1	ReLu	484864
Conv 3 $\times$ 3 s:1	ReLu	484864
MaxPooling 2 $\times$ 2 s:2	-	242464
Conv 3 $\times$ 3 s:1	ReLu	2424128
Conv 3 $\times$ 3 s:1	ReLu	2424128
MaxPooling 2 $\times$ 2 s:2	-	1212128
Conv 3 $\times$ 3 s:1	ReLu	1212256
Conv 3 $\times$ 3 s:1	ReLu	1212256
MaxPooling 2 $\times$ 2 s:2	-	66256

TABLE II: The structure of SimpleNet.

Operation	Activate	Output shape
Conv 3 $\times$ 3 s:1 No padding	ReLu	44256
Conv 1 $\times$ 1 s:1	ReLu	44128
Conv 3 $\times$ 3 s:1 No padding	ReLu	22128
Conv 1 $\times$ 1 s:1	ReLu	2264
Reshape	-	256
Full connect	-	64
Full connect	-	1
Sigmoid	-	1

TABLE III: The structure of local attention.

IV-A Databases and Setups

In experiments, we employ five FER datasets to evaluate the performance of LNLAttenNet: RAF-DB [24], SFEW [54], AffectNet [25], CK+ [19] and MMI [20].

•

RAF-DB contains 29672 facial images downloaded from the Internet. For the RAF-DB dataset, the facial landmarks are manually annotated via the crowdsourcing method with basic or compound expressions. In experiments, we use the basic database including 12,271 training and 3,068 testing images.
•

SFEW contains the statistic images selected from the movie clips with spontaneous expressions, where the labels of training set and validation set are given. Therefore, 958 training images are used as the training set and 436 validation images are as the testing set in experiments.
•

AffectNet contains 450,000 images with 10 categories, where each image is annotated by one volunteer. In experiments, we use 287,401 images with neutral and six basic emotions, where 283,901 images are selected as the training set and 3,500 images are selected from the validation set as the testing set.
•

CK+ contains 593 sequences from 123 volunteers, where 309 sequences have been annotated with six basic emotions. The emotion in each sequence goes from neutral to peak and then to neutral again. In view of this, we select the first frame of each sequence with the label of neutral and the peak frame of each sequence with the target label to generate 618 experimental images.
•

MMI is recorded from 30 objects with rich details of annotations, and 398 images are generated by selecting the first frame of each sequence with the label of neutral and one peak frame of each sequence.

For RAF-DB and SFEW datasets, their training sets are directly used to train the model and testing sets are used to evaluate the performance. For AffectNet dataset, its training set is used to train the model, and its validation set is used as the testing set, since the testing set of AffectNet is not given the annotated labels [25]. For CK+ and MMI datasets, we adopt the five-fold cross-validation scheme to evaluate the recognition performance, in order to make a fair comparison with other methods. Additionally, in order to fairly compare with the state-of-the-art methods of FER, we initialized the parameters of U-Net by Xavier initializer [55] rather than pre-training. In experiments, the original images are resized to 144 $\times$ 144, and the training images are augmented by standard approaches, such as image flips and random crop**. The number $M$ of local regions is set as 16, and each patch (local region) overlaps about 16 pixels with its adjacent patches, and the parameter $\alpha$ is set as 0.7 in Eq.(1). The size of the epoch is set to 24, the initial learning rate is 0.0003, and the weight decay is set as 0.95 each epoch.

In Tables. I, III and II, we give the structures of the non-local attention network, the local attention and the simple net, respectively. For the non-local attention network, we only show the convolution layer and the pooling layer, and the operations such as resha** and matrix multiplication are not shown. All experiments are implemented on the framework of Tensorflow and GTX 2080Ti with 11G memory.

IV-B Comparisons with State-of-the-Art Methods

In order to validate the performance of the proposed method, we firstly give a comparison with eight state-of-the-art methods on five datasets. Eight compared methods are VGG16 [27], DLP-CNN [24], NAL [56], Soft-CNN [36], CenterLoss [57], gACNN [13], LDL-ALSG [58] and IPA2LT [35], where VGG16 is applied as the baseline method in experiments.

•

DLP-CNN [24] decomposes the image structurally rather than spatially into regions (parts) which are discriminative for matching. According to the representations over the regions, it aggregates discriminative features for classification.
•

NAL [56] utilizes a noise adaptation layer to address the problem of noise labels.
•

Soft-CNN [36] fuses the latent label probability distribution predicted by the trained model to obtain soft labels with a novel label-level perturbation strategy.
•

CenterLoss [57] minimizes the center loss calculated by the distance between each data and its corresponding class center to reduce the intra-class discrepancy.
•

gACNN [13] uses 24 facial landmarks as the attention mechanism to conduct multi-region ensemble at the feature level.
•

LDL-ALSG [58] considers the subjectivity of human annotators and the ambiguous expression labels and then leverages the topological information of the labels from related but more distinct tasks, such as AU recognition and facial landmark detection, to explore the label distribution of facial expressions.
•

IPA2LT [35] employs an inconsistent pseudo annotations framework to solve the inconsistent annotations between different facial expression databases.

Noticeablely, IPA2LT [35] applies both RAF and AffectNet as the training set, differently from our method (LNLAttenNet) and other compared methods where only the training set of one dataset is employed to train a model. In LNLAttenNet, both non-local attention and local attention mechanisms are utilized. Thus, we also make a comparison with three special cases of our model: the model without both local and non-local attention (Model-S), the model only with local attention (Model-Local), and the model only with non-local attention (Model-NonLocal). Table IV shows the experimental results of 12 models, where the highest accuracy is bold for each dataset. All results are the average of the last 10 epochs.

TABLE IV: Accuracy (%) of the proposed method (LNLAttenNet) compared with state-of-the-arts methods.

Methods	AffectNet	RAF-DB	SFEW	CK+	MMI	average
VGG16[27]	51.11	80.96	54.45	90.37	63.21	68.02
DLP-CNN[24]	54.47	80.89	-	-	-	-
NAL[56]	55.97	84.22	58.13	91.20	64.71	70.85
Soft-CNN[36]	56.77	85.20	55.73	-	-	-
CenterLoss[57]	57.37	84.42	56.19	95.48	-	-
gACNN[13]	58.78	85.07	-	97.03	-	-
LDL-ALSG[58]	59.35	85.53	56.50	93.08	70.49	72.99
Model-S	56.26	83.80	54.82	94.14	63.52	70.51
Model-Local	57.63	84.55	56.42	96.44	65.42	72.09
Model-NonLocal	58.09	85.04	55.73	96.63	66.56	72.41
LNLAttenNet	59.28	86.15	57.80	98.18	68.75	74.03
IPA2LT[35]	55.11	86.77	58.29	91.67	65.61	71.49

Figure 10: Non-Local weights of 16 local regions of one face in RAF-DB obtained by the proposed model. The first and third lines show the facial images, and the second and forth lines show the Non-Local weights of 16 local regions corresponding to images.

From Table IV, it is obviously seen that the performance of the proposed method (LNLAttenNet) is superior to all compared methods except LDL-ALSG and IPA2LT on AffectNet, RAF-DB, CK+, MMI and SFEW. Differently to LNLAttenNet, IPA2LT[35] utilizes two big datasets (RAF and AffectNet) as the training set, which results in its obtaining better performance. But, LNLAttenNet still achieves a competitive performance on two datasets (RAF-DB and SFEW) and outperforms on three datasets (AffectNet, CK+ and MMI) compared with IPA2LT. Compared with LDL-ALSG[58], LNLAttenNet outperforms on RAF-DB, SFEW and CK+, ties on AffectNet and loss on MMI. In the last column of Table IV, we also show the average of accuracies for five datasets given by each method in the last. It is found that LNLAttenNet obtains the highest average of accuracies: 74.03%, which illustrates LNLAttenNet can obtain a more competitive performance of FER on all of five datasets than eight compared methods.

Furthermore, it is found that Model-S is inferior to all of Model-Local, Model-NonLocal and LNLAttenNet, which demonstrates that the attention mechanism is meaningful for improving the performance of FER in our model. Meanwhile, Model-NonLocal is slightly better than Model-Local but obviously inferior to LNLAttenNet, which also demonstrates our model jointly utilizing local and non-local information of facial expression is more effective. In short, the experimental results illustrate that adaptively enhancing the facial crucial regions in feature learning by LNLAttenNet is effective for improving the performance of FER.

Considering that RAF and AffectNet datasets have a large amount of images, we also shows the confusion matrices for them in Fig.8 and Fig.9, respectively. According to the confusion matrices, it is observed that the categories (fear and surprise) are easily distinguishable for RAF-DB (shown in Fig.8) and the categories (disgust and anger) are easily distinguishable for AffectNet (shown in Fig.9).

IV-C Analyses of Non-Local Attention

In LNLAttenNet, it is achieved to adaptively enhance the feature learning of facial crucial regions by jointly optimizing for local and non-local parts, where the non-local attention network is constructed to obtain the global weights ${\bf{w}}^{g}$ of multiple local regions. Actually, one purpose of our work is to explore how to automatically enhance the significance of local crucial regions in deep FER, while any landmarks are not given as the prior information of facial crucial regions. Thus, in order to validate it, we make an analysis for the weights of 16 local regions obtained by our non-local attention for RAF-DB dataset.

First, the visualization results from 16 persons are shown in Fig.10. In Fig.10, the first and third rows show the original facial expression images, and the second and fourth rows exhibit the matrix (4 $\times$ 4) of the final global weights ${\bf{w}}^{g}$ (16 $\times$ 1) corresponding to 16 local regions. For each matrix, the darker the color is, the higher the weight is. From Fig.10, it is obvious that some crucial regions obtain higher weights and non-crucial regions get smaller weights for each facial expression. For examples, the areas including or around eyes are given higher weights for the first person in the first row, where the maximum is given the local region located at the coordinate (2,2) including eyes. For the sixth person in the first row, four local regions (located at (3,2), (3,3), (4,2), (4,3)) including his mouth are boosted and given higher weights. In the third and fourth rows, the local regions located around eyes and the mouth are boosted for the second person, and the whole regions including eyes are given higher weights for the last person. Visually, these enhanced local regions are more discriminative and significant for FER.

From Fig.10, it is also observed that the location of crucial regions is different for different facial images. But, our network still automatically tracks down more discriminative regions for each different face, without the supervision of any annotated crucial points. Based on this, secondely, we make an experiment to pursue the change of weights corresponding to each local region in the process of training our model. Fig.11 shows the change of non-local weights in the training process. In Fig.11, the first row shows the original image and its final global weights obtained by our model, the second and third rows show the given global weights of 16 local regions in the initial, 250 ${}^{th}$ , 500 ${}^{th}$ , 750 ${}^{th}$ , 1000 ${}^{th}$ and 1250 ${}^{th}$ iterations, respectively, and the last row shows the final weights. From Fig.11, it is seen that the non-local weight of each local patch is same at the beginning of training, which implies that each local region is initially regarded as the equal importance. With the training of our network, each local region is given different weights, and the higher weights are given some more discriminative regions, such as the patches (located at (4,2) and (4,3)) including the mouth shown in Fig.10(a), the patches (located at (3,2), (3,3), (4,2) and (4,4)) in Fig.10(b), et al.. It illustrates that some more crucial local regions can be adaptively enhanced in the training of our network without any landmarks.

Figure 13: Local weights of 16 patches of each face on RAF-DB obtained by the proposed model. The first column shows the results corresponding to original images, and the second to seventh columns show the results corresponding to obscured images.

In order to better observe the change of weights, we also show the change of weights corresponding to 16 local regions in all iterations in Fig.12. From Fig.12, it is seen that the weight value fluctuates at the beginning of network training and it is gradually stabilized until the end of the training. Some patches that are visually more discriminative are lightened with higher weights and some patches located at the non crucial regions cut down with smaller weights. In summary, the analyses for non-local weights demonstrate that the proposed method can effectively automatically enhance the significance of facial crucial regions in deep feature learning, without any given prior information of facial crucial regions.

IV-D Visualization of Local Attentions

In the proposed method, the local attention is designed to deal with the problem that local regions is missed or obscured. In this part, the visualization of local attentions will be shown to validate the robustness of the proposed method for faces with missing regions, experimented on RAF-DB database. Note that the sigmoid function is employed to select the information flowing into the next layer in our local attention model. Fig.13 shows visual results of local attentions obtained by our method.

In Fig.13, the 1th and 3rd rows show one original facial image and six obscured images (from 2nd to 7th columns), and the 2nd and 4th rows show the weights of 16 patches of each facial image obtained by our method. Compared with the result of the original images (shown in the first column of Fig.13), it is found that the weight is weakened while one patch is obscured and the weights of other patches are unchanged. Note that the weights of some adjacent patches are also decreased with the central patch, due to overlap pixels between two adjacent patches. Practically, the local vector encoded based on one obscured patch is given a small weight, which effectively diminishes the influence of that obscured patch for facial expression recognition. In short, the experimental results illustrate that the proposed method equipped with the local attention is more robust for complex facial expression databases in practice.

IV-E Analyses for the parameter $\alpha$

In the non-local attention network, we formulate Eq.(1) to obtain the non-local feature vector ${\bf{g}}^{*}$ based on the global information of facial expression, where the parameter $\alpha$ is used to traff off the feature vectors ${\bf{g}}$ and ${\bf{s}}$ . In the previous experiments, we set $\alpha=0.7$ . Therefore, we make an analysis to observe the performance of the proposed method with different values of $\alpha$ in this part. In this experiment, the experimental setups are same as the above experiments except $\alpha$ , and $\alpha$ is set as {0, 0.1, 0.2, …,0.9, 1}, respectively. Table V shows the accuracy under different $\alpha$ for five datasets.

From Table V, it is seen that the accuracy is firstly increased and then decreased with a change in trend while increasing the value of $\alpha$ . According to Eq.(1), we get ${\bf{g}}^{*}={\bf{g}}$ if $\alpha=0$ and ${\bf{g}}^{*}={\bf{s}}$ if $\alpha=1$ . Combining the network optimization, it is known that the back propagation in LNLAttenNet has no constraint on ${\bf s}$ when $\alpha=0$ , which implies that the same effect (or feedback) is given the non-local attention and each component of the non-local weights $\bf{w}^{g}$ should be random in theory. On the contrary, $\alpha=1$ means that the back propagation has no constraint on the global vector ${\bf{g}}$ , which means the back propagation in LNLAttenNet has no global information and may result in an extreme result. Actually, as shown in Fig.14, we also find that the obtained weights ( $\bf{w}^{g}$ ) tend to be random under a small $\alpha$ and equal under a large $\alpha$ , which effectively verifies the effect of $\alpha$ as same as the above analysis.

TABLE V: Accuracy rates (%) given by the proposed method with different

\alpha

$\alpha$	0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1.0
RAF	84.09	85.60	85.69	86.15	85.59	85.33	85.17	85.23	83.74	83.54	83.02
SFEW	55.06	55.73	56.88	57.80	57.34	57.11	56.65	56.88	55.96	54.59	53.67
CK+	96.02	96.75	97.56	98.18	98.30	97.74	97.36	96.60	96.22	96.04	95.28
MMI	67.00	67.45	68.50	68.75	68.88	68.25	67.50	67.38	66.93	66.50	66.25
AffectNet	57.94	58.71	59.43	59.28	58.03	57.80	56.83	56.86	56.71	56.66	56.63

TABLE VI: Accuracy(%) of the proposed method with different numbers (

M

) of patches.

$M$	4	9	16	25	36
RAF	84.97	85.66	86.15	85.53	85.63
SFEW	55.28	56.88	57.80	58.03	57.80
CK+	96.22	97.17	98.18	97.92	97.74
MMI	67.60	67.90	68.75	68.83	67.13
AffectNet	58.06	58.43	59.28	59.06	57.97

IV-F Analyses for different M

In our method, multiple individual networks are generated based on facial local regions, and the previous experiments are implemented with the number of local patches $M=16$ . Therefore, we also make an analysis for the number ( $M$ ) of local patches on five datasets. In this experiment, $M$ is set as 4, 9, 16, 25 and 36, respectively. Table VI shows the accuracy rates with different $M$ . In this experiment, the size of the input image is 144*144 and the size of overlap** pixels between adjacent patches is around a third of the size of each patch, which is computed by

n*P_{size}-(n-1)*\gamma*P_{size}=144,

(5)

where $\gamma$ is around $1/3$ , $n^{2}=M$ and $P_{size}$ is the size of each patch. Note that the parameters of our network except $M$ is set as same as previous experiments.

From Table VI, it is observed that the performance with more local regions is superior to with less local regions. It implies that the size of each local region is too large to attain multiple diverse local information when $M$ is set as a small value. Whereas, it is also notice that the computational complexity will be increased when $M$ is set as a high value, and thus we finally set $M=16$ to implement most experiments.

IV-G Analyses for Overlapped Pixels between Local Regions

In the previous experiments, $1/3$ of whole pixels in each patch are applied as the overlap** pixels between two neighbor patches, which is a more appropriate value, since the number of pixels overlap** between the middle patch and both sides is only $2/3$ , and the information of $1/3$ of the pixels at the center of patch is still retained. If a larger number of overlap** pixels is employed, such as $1/2$ , the middle patch will completely overlap with the patches on both sides. If a smaller number is used, such as $1/4$ , the number of pixels in the overlap** region will be too small to solve the problem of regional connectivity. In order to analyze the influence of overlap** pixels between two patches, an experiment that other experimental settings are same to before is implemented based on RAF-DB dataset, and the result is shown in Table VII. In Table VII, it shows accuracies obtained by the proposed method based on different number ( $N$ ) of overlap** pixels. From the results, it is seen that the performance on the test set increases slowly to plateau as the number of overlap** pixels increases. It illustrates that the more the overlap** pixels are, the larger the number of network parameters are. According to our analyses, the main reason is that it is easier to introduce redundant information between adjacent patches when the number of overlap** pixels is larger.

TABLE VII: Accuracy(%) of the proposed method with different overlap** numbers(N) of pixels.

$N$	4	8	12	16	20	24
RAF	84.63	84.96	85.29	86.15	86.16	86.24

V Conclusion

In this paper, we propose the LNLAttenNet method to effectively explore the significance of facial crucial regions in feature learning for FER, without any landmark information. In LNLAttenNet, the global information of the facial expression is utilized to construct the non-local attention network, and meanwhile the local information is utilized to supervise self-information. By the joint optimization of facial non-local and local feature vectors, LNLAttenNet can adaptively enhance more crucial regions in the process of deep feature learning. Specifically, an ensemble of multiple networks corresponding to local regions is constructed to integrate the local feature with the non-local weights, which achieves the interactive guidance between the facial global and local information. Experimental results also demonstrate that some local crucial regions can be effectively enhanced in feature learning by LNLAttenNet while there are not any given information of landmarks in the training model. Moreover, the proposed method focuses on enhancing facial crucial regions in FER without any landmark information based on multiple patches, and thus we will explore it from the view of pixels for facial expressions in the further works.

References

[1] C. Darwin and P. Prodger, The expression of the emotions in man and animals. Oxford University Press, USA, 1998.
[2] R. Buck, R. E. Miller, and W. F. Caul, “Sex, personality, and physiological variables in the communication of affect via facial expression.” Journal of personality and social psychology, vol. 30, no. 4, p. 587, 1974.
[3] M. C. Smith, M. K. Smith, and H. Ellgring, “Spontaneous and posed facial expression in parkinson’s disease,” Journal of the International Neuropsychological Society, vol. 2, no. 5, pp. 383–391, 1996.
[4] C. A. Corneanu, M. O. Simón, J. F. Cohn, and S. E. Guerrero, “Survey on rgb, 3d, thermal, and multimodal approaches for facial expression recognition: History, trends, and affect-related applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 8, pp. 1548–1568, 2016.
[5] A. Majumder, L. Behera, and V. K. Subramanian, “Automatic facial expression recognition system using deep network-based data fusion,” IEEE Transactions on Cybernetics, vol. 48, no. 1, pp. 103–114, 2018.
[6] W. Xie, L. Shen, and J. Duan, “Adaptive weighting of handcrafted feature losses for facial expression recognition,” IEEE Transactions on Cybernetics, 2019.
[7] R. Ekman, What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, USA, 1997.
[8] P. Ekman, W. V. Friesen, and J. C. Hager, “Facial action coding system: The manual on cd rom,” A Human Face, Salt Lake City, pp. 77–254, 2002.
[9] S. Wang, G. Peng, S. Chen, and Q. Ji, “Weakly supervised facial action unit recognition with domain knowledge,” IEEE Transactions on Cybernetics, vol. 48, no. 11, pp. 3265–3276, 2018.
[10] H. K. Ekenel and R. Stiefelhagen, “Why is facial occlusion a challenging problem?” in International Conference on Biometrics. Springer, 2009, pp. 299–308.
[11] L. Zhong, Q. Liu, P. Yang, J. Huang, and D. N. Metaxas, “Learning multiscale active facial patches for expression analysis,” IEEE Transactions on Cybernetics, vol. 45, no. 8, pp. 1499–1510, 2014.
[12] Y. Fan, J. C. Lam, and V. O. Li, “Multi-region ensemble convolutional neural network for facial expression recognition,” in Proceedings of International Conference on Artificial Neural Networks. Springer, 2018, pp. 84–94.
[13] Y. Li, J. Zeng, S. Shan, and X. Chen, “Occlusion aware facial expression recognition using cnn with attention mechanism,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2439–2450, 2018.
[14] S. L. Happy and A. Routray, “Automatic facial expression recognition using features of salient facial patches,” IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 1–12, Jan 2015.
[15] K. Wang, X. Peng, J. Yang, D. Meng, and Y. Qiao, “Region attention networks for pose and occlusion robust facial expression recognition,” IEEE Transactions on Image Processing, vol. 29, pp. 4057–4069, 2020.
[16] M. Liu, S. Shan, R. Wang, and X. Chen, “Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
[17] C. Shan, S. Gong, and P. W. McOwan, “Facial expression recognition based on local binary patterns: A comprehensive study,” Image and vision Computing, vol. 27, no. 6, pp. 803–816, 2009.
[18] I. Kotsia and I. Pitas, “Facial expression recognition in image sequences using geometric deformation features and support vector machines,” IEEE Transactions on Image Processing, vol. 16, no. 1, pp. 172–187, 2006.
[19] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, June 2010, pp. 94–101.
[20] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for facial expression analysis,” in 2005 IEEE International Conference on Multimedia and Expo, July 2005, pp. 5 pp.–.
[21] M. J. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba, and J. Budynek, “The japanese female facial expression (jaffe) database,” in Proceedings of third international conference on automatic face and gesture recognition, 1998, pp. 14–16.
[22] G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. PietikäInen, “Facial expression recognition from near-infrared videos,” Image and Vision Computing, vol. 29, no. 9, pp. 607–619, 2011.
[23] S. Li and W. Deng, “Deep facial expression recognition: A survey,” IEEE Transactions on Affective Computing, 2020.
[24] S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[25] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, Jan 2019.
[26] C. Fabian Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5562–5570.
[27] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[30] P. Hu, D. Cai, S. Wang, A. Yao, and Y. Chen, “Learning supervised scoring ensemble for emotion recognition in the wild,” in Proceedings of the 19th ACM international conference on multimodal interaction. ACM, 2017, pp. 553–560.
[31] D. Acharya, Z. Huang, D. Pani Paudel, and L. Van Gool, “Covariance pooling for facial expression recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
[32] S. Li and W. Deng, “Blended emotion in-the-wild: Multi-label facial expression recognition using crowdsourced annotations and deep locality feature learning,” International Journal of Computer Vision, vol. 127, no. 6-7, pp. 884–906, 2019.
[33] H. Yang and L. Yin, “Cnn based 3d facial expression recognition using masking and landmark features,” 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 556–560, 2017.
[34] W. Wu, Y. Yin, Y. Wang, X. Wang, and D. Xu, “Facial expression recognition for different pose faces based on special landmark detection,” 2018 24th International Conference on Pattern Recognition (ICPR), pp. 1524–1529, 2018.
[35] J. Zeng, S. Shan, and X. Chen, “Facial expression recognition with inconsistently annotated datasets,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
[36] Y. Gan, J. Chen, and L. Xu, “Facial expression recognition boosted by soft label with a diverse ensemble,” Pattern Recognition Letters, vol. 125, pp. 105–112, 2019.
[37] H. Yang, U. Ciftci, and L. Yin, “Facial expression recognition by de-expression residue learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[38] F. Zhang, T. Zhang, Q. Mao, and C. Xu, “Joint pose and expression modeling for facial expression recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[39] S. Zhao, C. Lin, P. Xu, S. Zhao, Y. Guo, R. Krishna, G. Ding, and K. Keutzer, “Cycleemotiongan: Emotional semantic consistency preserved cyclegan for adapting image emotions,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 2620–2627.
[40] L. Fan, W. Huang, C. Gan, J. Huang, and B. Gong, “Controllable image-to-video translation: A case study on facial expression generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 3510–3517.
[41] R. Wu, G. Zhang, S. Lu, and T. Chen, “Cascade ef-gan: Progressive facial expression editing with local focuses,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[42] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “Ganimation: Anatomically-aware facial animation from a single image,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
[43] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[44] P. Barros, G. I. Parisi, C. Weber, and S. Wermter, “Emotion-modulated attention improves expression recognition: A deep learning model,” Neurocomputing, vol. 253, pp. 104–114, 2017.
[45] X. Wang, M. Peng, L. Pan, M. Hu, C. **, and F. Ren, “Two-level attention with two-stage multi-task learning for facial emotion recognition,” arXiv preprint arXiv:1811.12139, 2018.
[46] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[47] T. Falk, D. Mai, R. Bensch, Ö. Çiçek, A. Abdulkadir, Y. Marrakchi, A. Böhm, J. Deubner, Z. Jäckel, K. Seiwald et al., “U-net: deep learning for cell counting, detection, and morphometry,” Nature methods, vol. 16, no. 1, p. 67, 2019.
[48] Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual u-net,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749–753, May 2018.
[49] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[50] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
[51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 5998–6008. [Online]. Available: http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
[52] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[53] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[54] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark,” in 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Nov 2011, pp. 2106–2112.
[55] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010, pp. 249–256.
[56] J. Goldberger and E. Ben-Reuven, “Training deep neural-networks using a noise adaptation layer,” in Proceedings of International Conference of Learning Representation (ICLR), 2017.
[57] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 499–515.
[58] S. Chen, J. Wang, Y. Chen, Z. Shi, X. Geng, and Y. Rui, “Label distribution learning on auxiliary label space graphs for facial expression recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4321–4330.