Purbayan [email protected]
\addauthorVishal [email protected]
\addauthorNaoyuki [email protected]
\addauthorPankaj Wasnik[email protected]
\addauthorVineeth [email protected]
\addinstitution
Sony Research India,
Bangalore, India
\addinstitution
Indian Institute of Technology,
Hyderabad, India
Fiducial Focus Augmentation for Landmark Detection
Fiducial Focus Augmentation for Facial Landmark Detection
Abstract
Deep learning methods have led to significant improvements in the performance on the facial landmark detection (FLD) task. However, detecting landmarks in challenging settings, such as head pose changes, exaggerated expressions, or uneven illumination, continue to remain a challenge due to high variability and insufficient samples. This inadequacy can be attributed to the model’s inability to effectively acquire appropriate facial structure information from the input images. To address this, we propose a novel image augmentation technique specifically designed for the FLD task to enhance the model’s understanding of facial structures. To effectively utilize the newly proposed augmentation technique, we employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss to achieve collective learning of high-level feature representations from two different views of the input images. Furthermore, we employ a Transformer + CNN-based network with a custom hourglass module as the robust backbone for the Siamese framework. Extensive experiments show that our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.
1 Introduction
Facial Landmark Detection (FLD) aims to detect coordinates of the predefined landmarks on given facial image. The rich geometric information provided by landmarks with distinct semantic significance, such as eye corner, nose tip, or jawline, can be helpful in various tasks like 3D face reconstruction [Kittler et al.(2016)Kittler, Huber, Feng, Hu, and Christmas, Koppen et al.(2018)Koppen, Feng, Kittler, Awais, Christmas, Wu, and Yin, Roth et al.(2016)Roth, Tong, and Liu], face identification [Masi et al.(2016)Masi, Rawls, Medioni, and Natarajan, Taigman et al.(2014)Taigman, Yang, Ranzato, and Wolf, Yang et al.(2017)Yang, Ren, Zhang, Chen, Wen, Li, and Hua], emotion recognition [Fabian Benitez-Quiroz et al.(2016)Fabian Benitez-Quiroz, Srinivasan, and Martinez, Li et al.(2017)Li, Deng, and Du, Walecki et al.(2016)Walecki, Rudovic, Pavlovic, and Pantic], and face morphing [Hassner et al.(2015)Hassner, Harel, Paz, and Enbar]. Several FLD algorithms, based either on coordinate regression [Sun et al.(2013)Sun, Wang, and Tang, Toshev and Szegedy(2014), Trigeorgis et al.(2016)Trigeorgis, Snape, Nicolaou, Antonakos, and Zafeiriou, Lv et al.(2017)Lv, Shao, Xing, Cheng, and Zhou, Zhang et al.(2014)Zhang, Shan, Kan, and Chen, Zhou et al.(2013a)Zhou, Fan, Cao, Jiang, and Yin] or heatmap regression [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen, Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos, Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye, Li et al.(2022)Li, Guo, Rhee, Han, and Han, Wen et al.(2022)Wen, Ding, Yao, Wang, and Qian, Lan et al.(2021)Lan, Hu, Chen, Xue, and Cheng], have emerged in recent years with promising performance on various datasets. However, landmark detection still remains challenging task due the high variability in poses, lighting and expressions. Despite the various existing FLD methodologies, none have focused on robust image augmentation techniques to solve these challenges. This study illustrates that meticulously designed image augmentations can considerably enhance the FLD performance.
But why do sophisticated deep neural network (DNN) architectures struggle to detect landmarks accurately in challenging scenarios? The reason is that the DNN is unable to learn the facial structure information as accurately as required. If a DNN model can accurately capture features that extract a facial structure, it can predict the landmarks more accurately even from obscured facial regions, like occluded areas. To learn facial structures effectively, we propose new augmentation technique called Fiducial Focus Augmentation (FiFA), which leverages the ground truth landmark coordinates as an inductive bias for facial structure. To this end, we introduce black patches around the landmark locations in the training images, gradually reducing them over the epoch and then removing completely for the rest of the training, as illustrated in Fig 1. Since the patches cover key semantic regions of the face, e.g., eyes, nose, lips and jawline, when the model learns to predict these patches, it is able to learn the entire facial structure significantly better, as compared to an architecture without this inductive bias. One could view this augmentation technique as similar to Curriculum Learning (CL) [Hacohen and Weinshall(2019)], a strategy that trains a machine learning model from simpler data to more difficult data, mimicking the meaningful order found in human-designed learning curricula.
Drawing inspiration from [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos], we leverage the Siamese architecture to acquire a comprehensive understanding of reliable landmark predictions across various image augmentations. However, our method employs Deep Canonical Correlation Analysis (DCCA) [Andrew et al.(2013)Andrew, Arora, Bilmes, and Livescu] as loss function in Siamese architecture to amplify the efficacy of the learning process between distinctively augmented views. This loss function assists in the extraction of features that are correlated across views, while simultaneously eliminating uncorrelated noise. To design a robust backbone for the Siamese architecture, we adopt Vision Transformer (ViT) [Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.]. We further improved its performance and efficiency by incorporating a Convolutional Neural Network (CNN)-based hourglass module in-between the transformer layers of the ViT. Modern CNNs are usually considered to be shift-invariant; we hence use an Anti-aliased CNN [Zhang(2019)] inside the hourglass module to leverage this benefit. We summarize the contributions of this paper as follows.
-
•
To the best of our knowledge, this is the first effort in literature to propose a new patch-based augmentation technique for FLD task to learn facial semantic structures effectively.
-
•
We employ a Siamese-based training scheme utilising DCCA loss between feature representations of two different views of the same image, that enforces consistent predictions of the landmark for the two views. To incorporate virtues of both a Transformer and a CNN, we design a robust Transformer + CNN-based backbone in our proposed framework.
-
•
We performed extensive experiments on various benchmark datasets showing significant improvements over prior work. We also conducted ablation studies on our framework components and additional empirical analysis to study the usefulness of the proposed method.
2 Related Works
Earlier efforts on FLD task, especially those in recent years, can broadly be categorized into network architecture enhancements for heatmap generation and loss function improvements.
Network architecture enhancements: Coordinate regression-based methods [Sun et al.(2013)Sun, Wang, and Tang, Toshev and Szegedy(2014), Trigeorgis et al.(2016)Trigeorgis, Snape, Nicolaou, Antonakos, and Zafeiriou, Lv et al.(2017)Lv, Shao, Xing, Cheng, and Zhou, Zhang et al.(2014)Zhang, Shan, Kan, and Chen, Zhou et al.(2013a)Zhou, Fan, Cao, Jiang, and Yin] directly perform regression on landmark coordinate vectors through a fully connected output layer that disregards the spatial correlations of features and results in limited accuracy of landmark detection. On the other hand, heatmap regression-based methods [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos, Bulat and Tzimiropoulos(2017), Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen, Huang et al.(2021)Huang, Yang, Li, Kim, and Wei, Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye, Lan et al.(2021)Lan, Hu, Chen, Xue, and Cheng, Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu, Wen et al.(2022)Wen, Ding, Yao, Wang, and Qian, Li et al.(2022)Li, Guo, Rhee, Han, and Han] predict landmark coordinates by creating heatmaps. By doing so, they effectively maintain the original spatial relationships between pixels and achieve promising landmark detection accuracy. Therefore, heatmap regression has become the de facto choice for the FLD task in modern times. In [Bulat and Tzimiropoulos(2017)], Bulat et al. proposed an encoder-decoder based framework with heatmap regression for FLD. Their network incorporates hourglass and hierarchical blocks. Several research works [Sun et al.(2019)Sun, Zhao, Jiang, Cheng, Xiao, Liu, Mu, Wang, Liu, and Wang, Wang et al.(2020)Wang, Sun, Cheng, Jiang, Deng, Zhao, Liu, Mu, Tan, Wang, et al., Xiao et al.(2018)Xiao, Wu, and Wei] have been published based on the ResNet [He et al.(2016)He, Zhang, Ren, and Sun] architecture and modify their network for dense pixel-wise landmark predictions. Recently, the Vision Transformer (ViT) [Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.] has been incorporated in FLD task by Zhang et al. [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen] and has produced remarkable results. In our proposed framework, we also use ViT as the backbone network and improve its performance by introducing CNN layers in between transformer layers. This allows us to combine the best of both designs.
Loss function improvements: A pixel-wise or loss is the conventional loss generally applied to heatmap regression-based methods [Zhou et al.(2013b)Zhou, Fan, Cao, Jiang, and Yin, Deng et al.(2019)Deng, Trigeorgis, Zhou, and Zafeiriou, Dong et al.(2018)Dong, Yan, Ouyang, and Yang, Newell et al.(2016)Newell, Yang, and Deng, Wei et al.(2016)Wei, Ramakrishna, Kanade, and Sheikh]. To emphasize the importance of tiny and medium range errors during the training process, Feng et al. [Feng et al.(2018)Feng, Kittler, Awais, Huber, and Wu] introduced the Wing loss, which modifies the L1 loss by using a logarithmic function to amplify the impact of errors within a specific range. Additionally, Wang et al. [Wang et al.(2019)Wang, Bo, and Fuxin] developed the Adaptive Wing Loss, which can adjust its curvature based on the ground truth pixels. In [Kumar et al.(2020)Kumar, Marks, Mou, Wang, Jones, Cherian, Koike-Akino, Liu, and Feng], Kumar et al. proposed the LUVLi loss that optimizes the position of the keypoints, the uncertainty, and the likelihood of visibility. Recently, the authors from [Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye] proposed the Focal Wing Loss, which is used to mine and emphasize difficult samples under in-the-wild conditions.
In this work, we use the standard Binary Cross Entropy (BCE) and losses for heatmap and coordinate regression, respectively. We however employ the DCCA loss [Andrew et al.(2013)Andrew, Arora, Bilmes, and Livescu] which suits our framework and has never been used before for the FLD task. These simple losses help the proposed framework set a new benchmark. Our study of literature revealed that well-designed image augmentations are largely ignored for the FLD task. This paper attends to this very issue and introduces a new augmentation technique called FiFA that accounts for our impressive results.
3 Proposed Framework
3.1 Problem Statement & Notations
Given an input image , FLD aims to detect , the coordinates of predefined landmarks. To this end, we propose a heatmap-based approach to regress the facial landmarks. During training, it encodes the target ground truth coordinates as a series of heatmaps with a 2D Gaussian curve centered on them:
(1) |
where and are the spatial coordinates of the point, while and are their scaled, quantized version obtained by scaling factor and rounding operator , i.e.
(2) |
As shown in Eq. (1), we use a Gaussian with variance around each coordinate from to generate the corresponding heatmap . Finally, the pixels with maximum intensity of the heatmap are selected to get the final landmarks in the FLD task.
To attain precise facial landmarks, we propose a novel augmentation technique called Fiducial Focus Augmentation (FiFA) that helps the network to learn facial structures in the provided images, along with a Siamese network with a robust backbone and the DCCA loss to ensure consistent predictions between different augmented views. Detailed explanations of these modules are provided in the subsequent subsections.
3.2 Fiducial Focus Augmentation
We seek to explore the potential of carefully designed image augmentations for the FLD task in this section. To this end, we propose an augmentation for input training images, where . Here, can be any standard image augmentations used in the FLD task [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen, Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos, Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye, Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu, Wen et al.(2022)Wen, Ding, Yao, Wang, and Qian] and is the proposed Fiducial Focus Augmentation (FiFA).
First, we take the original input image and apply standard image augmentation to get the augmented image (). Mathematically, this can be expressed as:
(3) |
To get the final augmented image , is passed through the proposed augmentation operation i.e., (as descibed in Alg 1), i.e.
(4) |
Here, we aim to incorporate the available facial structure ground truth information into the augmented image, in order to aptly utilize the underlying facial structure. To achieve this, we construct black square patches of dimensions , where while retaining the landmarks as the intersection points of the two diagonals of the square patches (see Figure 1 (a)). These patches comprise of four coordinates which can be expressed as:
(5) |
Here, we start with a bigger patch size of for a certain number of epoch intervals . After every such interval, we reduce the patch size by 1 pixel and eventually, these patches are removed from the images and rest of the training goes on with augmentation only. So the final augmented image is (where is the total number of epochs):
(6) |
The proposed FiFA helps the backbone network learn the underlying facial structure and address difficult test samples, since the patches cover the entire face uniformly over the different joints (eyes, lips, nose and jawline). At the beginning of training, the model is exposed to larger patches as low-confidence regions to concentrate on the joints and eventually, as the model learns progressively with each epoch, smaller patches are introduced as high-confidence regions around the joints. When the patches are removed completely, the model tries to predict the joints with the inductive bias provided by earlier training steps in our augmentation process. Since the patches can be used with any facial variations (such as pose or expression), their integration into the images as augmentations enables the model to learn the inherent facial structures.
3.3 Matching Two Views
Earlier work on the task of FLD has seen limited exploration of Siamese architecture-based training, with the exception of [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos]. In this paper, we propose a Siamese architecture-based framework as illustrated in Fig. 2. The network takes the two input images and generated using two different augmentations and . This training scheme using augmentations holds a notable advantage, as CNNs may not be invariant under arbitrary affine transformations. Therefore, even minor variations within the input space may produce significant changes in the output. By optimizing jointly using the Siamese architecture and combining the two predictions, we enhance the robustness and consistency of the predictions (under such variations).
To maximize the correlation between two different augmented views, we employ the Deep Canonical Correlation Analysis (DCCA) loss [Andrew et al.(2013)Andrew, Arora, Bilmes, and Livescu] between the high-level representation map**s and , where . The correlation between these two map**s can be expressed as below:
(7) |
The DCCA loss (i.e., ) is then computed as:
(8) |
The use of DCCA loss presents three key advantages: (i) correlated representations partially reconstruct the information in the second view, when it is unavailable; (ii) it has potential to eliminate noise that is uncorrelated across the two views; and (iii) if capture features that are correlated across the views, they may represent latent aspects of the face. This, in turn helps the backbone network in capturing the facial structure in the images.
3.4 Architectural Details
In the proposed framework, we employ a transformer-based architecture (a pre-trained ViT-B/16 [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen] consisting of 12 layers and a width of 768) as a backbone. To enhance its performance further, we incorporated three custom CNN-based hourglass modules after every four layers of the transformer network. The purpose of this module is to introduce desirable properties of CNNs, such as shift, scale, and distortion invariance, into the ViT architecture, while still retaining the characteristics of transformers, i.e., dynamic attention, global context, and better generalization. This results in a robust backbone network (Transformer + CNN) which learns facial structures effectively.
The utilization of pooling layers in CNNs often provides a certain degree of shift invariance in the model. However, in our task, it is imperative to avoid the loss of structural information caused by pooling layers. We therefore adopt the Anti-aliased CNN [Zhang(2019)] into our hourglass modules, hereafter known as Anti-aliased Hourglass. The combination of these components significantly enhances the caliber of our network towards high-quality heatmap generation. Nevertheless, the upsampling + concatenation (U+A) operation in the hourglass modules may introduce some high-frequency noise. To mitigate this negative impact and filter the features in the Fourier space, we integrate a FF-Parser layer [Wu et al.(2022)Wu, Fang, Zhang, Yang, and Xu] after each U+A operation in the hourglass modules. We provide ablation studies on these components in our results to demonstrate their usefulness.
4 Experiments and Results
This section discusses the implementation details, comparison with SOTA methods on benchmark datasets and ablation analysis of the introduced components of the proposed method.
Implementation Details: The proposed method is trained/tested on the various benchmark datasets, i.e., WFLW [WFL()], 300W [Sagonas et al.(2016)Sagonas, Antonakos, Tzimiropoulos, Zafeiriou, and Pantic], COFW [Burgos-Artizzu et al.(2013)Burgos-Artizzu, Perona, and Dollár] and AFLW [Köstinger et al.(2011)Köstinger, Wohlhart, Roth, and Bischof]. Details of these datasets are discussed in the Supplementary material. During the training phase, the input image is cropped and resized to . The output feature map size of every hourglass module is set to , which is smaller than the input image size. The ground truth heatmaps are generated by a Gaussian with and radius . During training process, we used AdamW [Loshchilov and Hutter(2017)] to optimize our network with the initial learning rate of and trained for 250 epochs. Apart from the proposed augmentation (FiFA), other standard data augmentations () are employed at training time, such as random masking, bilinear interpolation, random occlusion, random gray, random gamma, random blur, noise fusion. For effective learning, along with the DCCA loss (i.e., ), we also employ the standard BCE loss (i.e., ) and mean absolute error loss (i.e., ) for heatmap and coordinate regression, respectively with equal weights (i.e., 1.0). For evaluation, we used the standard evaluation metrics i.e., Normalized Mean Error () variants (i.e., , , ), Failure Rate (), Area Under the Curve (). Detailed definitions of these metrics have been discussed in the Supplementary material. For comparison, we choose recent baselines such as FaRL [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen], ADNet [Huang et al.(2021)Huang, Yang, Li, Kim, and Wei], SH-FAN [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos], PropNet [Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye], HIH [Lan et al.(2021)Lan, Hu, Chen, Xue, and Cheng], SLPT [Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu], PicassoNet [Wen et al.(2022)Wen, Ding, Yao, Wang, and Qian] and DTLD [Li et al.(2022)Li, Guo, Rhee, Han, and Han]. All the experiments were implemented using PyTorch and the network was trained on 4 GPUs (40GB NVIDIA A100), with batch size 5 per GPU.
4.1 Result Analysis
Comparison on COFW: In Table 1, we presents a comparison of the proposed FiFA approach with existing SOTA methods on the COFW testset, which is a well-known benchmark for heavy occlusion and a wide range of head pose variation. It is noteworthy that the proposed FiFA model outperforms the existing SOTA methods. The leading and demonstrate its robustness against extreme situations.
Comparison on 300W: On the 300W dataset, our approach exhibits superior performance in comparison to SOTA methods in terms of , and is given in Table 1. In challenge-set, the proposed approach performs slightly lower than PropNet [Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye] and SH-FAN [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos] methods. However, it has achieved SOTA results in other scenarios (i.e., full-set and common-set), which suggests that our method makes plausible predictions even in deplorable situations.
Comparison on AFLW: The results on AFLW testset are presented in Table 1. Adhering to the evaluation protocol adopted in [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen], we report comparisons in terms of , and . This table clearly indicates that our approach has outperformed the SOTA results, despite the fact that the dataset is almost saturated.
Comparison on WFLW: In Table 2, we compare results in terms of , and . Here, it is observed that the proposed FiFA approach obtains better for Pose, Illumination and Make Up subsets. Additionally, in comparison on , the proposed approach achieves higher performance in all subsets i.e., Pose, Expression, Illumination, Make Up, Occlusion, Blur by 44%, 41%, 23%, 1%, 34%, 37.4%, respectively over the previous best performing SOTA methods. These results show that our method improves the accuracy in challenging scenarios while also reducing the overall failure ratio for difficult images. Moreover, Fig. 3 visually conveys that the proposed approach delivers significantly more precise landmarks in challenging scenarios.
4.2 Ablation Studies & Analysis
This section presents the ablation analysis carried out to establish the efficacy of the proposed framework. To ensure fair comparison, all experiments were performed on COFW dataset.
Effects of method’s components: Herein, we investigate the impact of each component of the proposed framework. The results, presented in Table 5, reveal that the baseline network, i.e., Vanilla backbone (ViT-B/16), attains an of 3.11 when trained solely with standard augmentations, i.e., . When anti-aliased CNN-based hourglass modules are incorporated into baseline, an improvement in to 3.07 is observed. By employing the proposed augmentation, , on the input images during training, a remarkable performance boost is achieved, with an of 3.00. The highest of 2.96 is attained when incorporating the Siamese training approach with DCCA loss on both and augmented images. This finding demonstrates that training the backbone with proposed components gives best performance in results.
Effects of fiducial mask sizes: We have conducted a series of experiments to determine the optimal initial patch size for the proposed FiFA. As shown in Table 5, a patch size of yields the best of 2.96, while deviating from this size leads to a deterioration in performance. This can be attributed to the fact that during the initial stages of training, when the network weights are not yet sufficiently tuned, a patch size that is either too large or too small will result in a confidence region that is either too broad or too narrow for the network to focus on the landmarks. This, in turn, has an adverse effect on the learning process and ultimately on the performance of the network.
Effect of FiFA over standard augmentations: Several experiments were conducted to prove the effectiveness of our proposed FiFA over other standard augmentations. Due to the availability of only one view of augmented images, all these experiments were performed without a Siamese-based training mechanism. Table 5 displays the results obtained in terms of on the COFW testset. One can notice that the inclusion of our proposed FiFA in standard augmentation techniques leads to a notable improvement in the value.
Comparison with other losses in Siamese training: We employ DCCA loss [Andrew et al.(2013)Andrew, Arora, Bilmes, and Livescu] in Siamese training to maximize the correlation between different views. To demonstrate the efficacy of DCCA loss, we conducted several experiments with different losses (i.e., L2, L1, Smooth L1, and Wing loss [Feng et al.(2018)Feng, Kittler, Awais, Huber, and Wu]), and the corresponding results are presented in Table 6. One can observe that the DCCA loss helps to obtain better , exhibiting a 3% increase as compared to previous best-performing Wing loss.
Effectiveness of the proposed components to other SOTA methods: To validate the effectiveness of the proposed components, we conducted a series of experiments wherein the proposed FiFA augmentation and Siamese network based DCCA loss were implemented on other baseline methods such as HRNet [Wang et al.(2020)Wang, Sun, Cheng, Jiang, Deng, Zhao, Liu, Mu, Tan, Wang, et al.], ADNet [Huang et al.(2021)Huang, Yang, Li, Kim, and Wei], SH-FAN backbone [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos], FaRL [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen], SLPT [Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu] and the corresponding results are summarized in Table 7. The proposed FiFA augmentation technique improved the performance of baseline methods. Additionally, the Siamese network based DCCA loss contributed to improve the NME score further. This clearly indicates the generalization capability of our method.
5 Conclusion & Future Work
In this paper, we successfully proposed a simple yet effective image augmentation technique called Fiducial Focus Augmentation (FiFA) for facial landmark detection task. The integration of FiFA during training significantly enhanced the accuracy of proposed approach on testing benchmarks without extreme modifications to its backbone network and the loss function. Our findings suggest that the employment of FiFA as an image augmentation technique, when used in conjunction with a Siamese-based training with DCCA loss results in state-of-the-art performance. Additionally, we employed an anti-aliased CNN-based hourglass network with ViT as our backbone network to address shift invariance and noise. We performed extensive experimentation and ablation studies to validate the effectiveness of the proposed approach. In future work, FiFA can be studied further to extend it for other face-related tasks.
References
- [WFL()] Look at boundary: A boundary-aware face alignment algorithm. https://wywu.github.io/projects/LAB/WFLW.html.
- [Andrew et al.(2013)Andrew, Arora, Bilmes, and Livescu] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In International conference on machine learning, pages 1247–1255. PMLR, 2013.
- [Bulat and Tzimiropoulos(2017)] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE international conference on computer vision, pages 1021–1030, 2017.
- [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos] Adrian Bulat, Enrique Sanchez, and Georgios Tzimiropoulos. Subpixel heatmap regression for facial landmark localization. arXiv preprint arXiv:2111.02360, 2021.
- [Burgos-Artizzu et al.(2013)Burgos-Artizzu, Perona, and Dollár] Xavier P. Burgos-Artizzu, Pietro Perona, and Piotr Dollár. Robust face landmark estimation under occlusion. In 2013 IEEE International Conference on Computer Vision, pages 1513–1520, 2013. 10.1109/ICCV.2013.191.
- [Deng et al.(2019)Deng, Trigeorgis, Zhou, and Zafeiriou] Jiankang Deng, George Trigeorgis, Yuxiang Zhou, and Stefanos Zafeiriou. Joint multi-view face alignment in the wild. IEEE Transactions on Image Processing, 28(7):3636–3648, 2019.
- [Dong et al.(2018)Dong, Yan, Ouyang, and Yang] Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style aggregated network for facial landmark detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 379–388, 2018.
- [Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- [Fabian Benitez-Quiroz et al.(2016)Fabian Benitez-Quiroz, Srinivasan, and Martinez] C Fabian Benitez-Quiroz, Ramprakash Srinivasan, and Aleix M Martinez. Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5562–5570, 2016.
- [Feng et al.(2018)Feng, Kittler, Awais, Huber, and Wu] Zhen-Hua Feng, Josef Kittler, Muhammad Awais, Patrik Huber, and Xiao-Jun Wu. Wing loss for robust facial landmark localisation with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2235–2245, 2018.
- [Hacohen and Weinshall(2019)] Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in training deep networks. In International Conference on Machine Learning, pages 2535–2544. PMLR, 2019.
- [Hassner et al.(2015)Hassner, Harel, Paz, and Enbar] Tal Hassner, Shai Harel, Eran Paz, and Roee Enbar. Effective face frontalization in unconstrained images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4295–4304, 2015.
- [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. cvpr. 2016. arXiv preprint arXiv:1512.03385, 2016.
- [Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye] ** Ye. Propagationnet: Propagate points to curve to learn structure information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7265–7274, 2020.
- [Huang et al.(2021)Huang, Yang, Li, Kim, and Wei] Yangyu Huang, Hao Yang, Chong Li, Jongyoo Kim, and Fangyun Wei. Adnet: Leveraging error-bias towards normal direction in face alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3080–3090, 2021.
- [Kittler et al.(2016)Kittler, Huber, Feng, Hu, and Christmas] Josef Kittler, Patrik Huber, Zhen-Hua Feng, Guosheng Hu, and William Christmas. 3d morphable face models and their applications. In Articulated Motion and Deformable Objects: 9th International Conference, AMDO 2016, Palma de Mallorca, Spain, July 13-15, 2016, Proceedings 9, pages 185–206. Springer, 2016.
- [Koppen et al.(2018)Koppen, Feng, Kittler, Awais, Christmas, Wu, and Yin] Paul Koppen, Zhen-Hua Feng, Josef Kittler, Muhammad Awais, William Christmas, Xiao-Jun Wu, and He-Feng Yin. Gaussian mixture 3d morphable face model. Pattern Recognition, 74:617–628, 2018.
- [Kumar et al.(2020)Kumar, Marks, Mou, Wang, Jones, Cherian, Koike-Akino, Liu, and Feng] Abhinav Kumar, Tim K Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xiaoming Liu, and Chen Feng. Luvli face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8236–8246, 2020.
- [Köstinger et al.(2011)Köstinger, Wohlhart, Roth, and Bischof] Martin Köstinger, Paul Wohlhart, Peter M. Roth, and Horst Bischof. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 2144–2151, 2011. 10.1109/ICCVW.2011.6130513.
- [Lan et al.(2021)Lan, Hu, Chen, Xue, and Cheng] Xing Lan, Qinghao Hu, Qiang Chen, Jian Xue, and Jian Cheng. Hih: Towards more accurate face alignment via heatmap in heatmap. arXiv preprint arXiv:2104.03100, 2021.
- [Li et al.(2022)Li, Guo, Rhee, Han, and Han] Hui Li, Zidong Guo, Seon-Min Rhee, Seungju Han, and Jae-Joon Han. Towards accurate facial landmark detection via cascaded transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4176–4185, 2022.
- [Li et al.(2017)Li, Deng, and Du] Shan Li, Weihong Deng, and Jun** Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2852–2861, 2017.
- [Loshchilov and Hutter(2017)] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- [Lv et al.(2017)Lv, Shao, Xing, Cheng, and Zhou] Jiang**g Lv, Xiaohu Shao, Junliang Xing, Cheng Cheng, and Xi Zhou. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3317–3326, 2017.
- [Masi et al.(2016)Masi, Rawls, Medioni, and Natarajan] Iacopo Masi, Stephen Rawls, Gérard Medioni, and Prem Natarajan. Pose-aware face recognition in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4838–4846, 2016.
- [Newell et al.(2016)Newell, Yang, and Deng] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 483–499. Springer, 2016.
- [Roth et al.(2016)Roth, Tong, and Liu] Joseph Roth, Yiying Tong, and Xiaoming Liu. Adaptive 3d face reconstruction from unconstrained photo collections. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4197–4206, 2016.
- [Sagonas et al.(2016)Sagonas, Antonakos, Tzimiropoulos, Zafeiriou, and Pantic] Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: Database and results. Image and vision computing, 47:3–18, 2016.
- [Sun et al.(2019)Sun, Zhao, Jiang, Cheng, Xiao, Liu, Mu, Wang, Liu, and Wang] Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and **gdong Wang. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514, 2019.
- [Sun et al.(2013)Sun, Wang, and Tang] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3476–3483, 2013.
- [Taigman et al.(2014)Taigman, Yang, Ranzato, and Wolf] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014.
- [Toshev and Szegedy(2014)] Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660, 2014.
- [Trigeorgis et al.(2016)Trigeorgis, Snape, Nicolaou, Antonakos, and Zafeiriou] George Trigeorgis, Patrick Snape, Mihalis A Nicolaou, Epameinondas Antonakos, and Stefanos Zafeiriou. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4177–4187, 2016.
- [Walecki et al.(2016)Walecki, Rudovic, Pavlovic, and Pantic] Robert Walecki, Ognjen Rudovic, Vladimir Pavlovic, and Maja Pantic. Copula ordinal regression for joint estimation of facial action unit intensity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4902–4910, 2016.
- [Wang et al.(2020)Wang, Sun, Cheng, Jiang, Deng, Zhao, Liu, Mu, Tan, Wang, et al.] **gdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10):3349–3364, 2020.
- [Wang et al.(2019)Wang, Bo, and Fuxin] Xinyao Wang, Liefeng Bo, and Li Fuxin. Adaptive wing loss for robust face alignment via heatmap regression. In Proceedings of the IEEE International Conference on Computer Vision, pages 6971–6981, 2019.
- [Wei et al.(2016)Wei, Ramakrishna, Kanade, and Sheikh] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 4724–4732, 2016.
- [Wen et al.(2022)Wen, Ding, Yao, Wang, and Qian] Tiancheng Wen, Zhonggan Ding, Yongqiang Yao, Yaxiong Wang, and Xueming Qian. Picassonet: Searching adaptive architecture for efficient facial landmark localization. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- [Wu et al.(2022)Wu, Fang, Zhang, Yang, and Xu] Junde Wu, Huihui Fang, Yu Zhang, Yehui Yang, and Yanwu Xu. Medsegdiff: Medical image segmentation with diffusion probabilistic model. arXiv preprint arXiv:2211.00611, 2022.
- [Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu] Jiahao Xia, Weiwei Qu, Wenjian Huang, Jianguo Zhang, Xi Wang, and Min Xu. Sparse local patch transformer for robust face alignment and landmarks inherent relation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4052–4061, 2022.
- [Xiao et al.(2018)Xiao, Wu, and Wei] Bin ** Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pages 466–481, 2018.
- [Yang et al.(2017)Yang, Ren, Zhang, Chen, Wen, Li, and Hua] Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen, Hongdong Li, and Gang Hua. Neural aggregation network for video face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4362–4371, 2017.
- [Zhang et al.(2014)Zhang, Shan, Kan, and Chen] Jie Zhang, Shiguang Shan, Meina Kan, and Xilin Chen. Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pages 1–16. Springer, 2014.
- [Zhang(2019)] Richard Zhang. Making convolutional networks shift-invariant again. In International conference on machine learning, pages 7324–7334. PMLR, 2019.
- [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen] Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representation learning in a visual-linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18697–18709, 2022.
- [Zhou et al.(2013a)Zhou, Fan, Cao, Jiang, and Yin] Er** Zhou, Haoqiang Fan, Zhimin Cao, Yuning Jiang, and Qi Yin. Extensive facial landmark localization with coarse-to-fine convolutional network cascade. In Proceedings of the IEEE international conference on computer vision workshops, pages 386–391, 2013a.
- [Zhou et al.(2013b)Zhou, Fan, Cao, Jiang, and Yin] Er** Zhou, Haoqiang Fan, Zhimin Cao, Yuning Jiang, and Qi Yin. Extensive facial landmark localization with coarse-to-fine convolutional network cascade. In Proceedings of the IEEE international conference on computer vision workshops, pages 386–391, 2013b.