License: arXiv.org perpetual non-exclusive license
arXiv:2402.15044v1 [cs.CV] 23 Feb 2024
\addauthor

Purbayan [email protected] \addauthorVishal [email protected] \addauthorNaoyuki [email protected] \addauthorPankaj Wasnik{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT[email protected] \addauthorVineeth [email protected] \addinstitution Sony Research India,
Bangalore, India \addinstitution Indian Institute of Technology,
Hyderabad, India Fiducial Focus Augmentation for Landmark Detection

Fiducial Focus Augmentation for Facial Landmark Detection

Abstract

Deep learning methods have led to significant improvements in the performance on the facial landmark detection (FLD) task. However, detecting landmarks in challenging settings, such as head pose changes, exaggerated expressions, or uneven illumination, continue to remain a challenge due to high variability and insufficient samples. This inadequacy can be attributed to the model’s inability to effectively acquire appropriate facial structure information from the input images. To address this, we propose a novel image augmentation technique specifically designed for the FLD task to enhance the model’s understanding of facial structures. To effectively utilize the newly proposed augmentation technique, we employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss to achieve collective learning of high-level feature representations from two different views of the input images. Furthermore, we employ a Transformer + CNN-based network with a custom hourglass module as the robust backbone for the Siamese framework. Extensive experiments show that our approach outperforms multiple state-of-the-art approaches across various benchmark datasets.

1 Introduction

Facial Landmark Detection (FLD) aims to detect coordinates of the predefined landmarks on given facial image. The rich geometric information provided by landmarks with distinct semantic significance, such as eye corner, nose tip, or jawline, can be helpful in various tasks like 3D face reconstruction [Kittler et al.(2016)Kittler, Huber, Feng, Hu, and Christmas, Koppen et al.(2018)Koppen, Feng, Kittler, Awais, Christmas, Wu, and Yin, Roth et al.(2016)Roth, Tong, and Liu], face identification [Masi et al.(2016)Masi, Rawls, Medioni, and Natarajan, Taigman et al.(2014)Taigman, Yang, Ranzato, and Wolf, Yang et al.(2017)Yang, Ren, Zhang, Chen, Wen, Li, and Hua], emotion recognition [Fabian Benitez-Quiroz et al.(2016)Fabian Benitez-Quiroz, Srinivasan, and Martinez, Li et al.(2017)Li, Deng, and Du, Walecki et al.(2016)Walecki, Rudovic, Pavlovic, and Pantic], and face morphing [Hassner et al.(2015)Hassner, Harel, Paz, and Enbar]. Several FLD algorithms, based either on coordinate regression [Sun et al.(2013)Sun, Wang, and Tang, Toshev and Szegedy(2014), Trigeorgis et al.(2016)Trigeorgis, Snape, Nicolaou, Antonakos, and Zafeiriou, Lv et al.(2017)Lv, Shao, Xing, Cheng, and Zhou, Zhang et al.(2014)Zhang, Shan, Kan, and Chen, Zhou et al.(2013a)Zhou, Fan, Cao, Jiang, and Yin] or heatmap regression [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen, Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos, Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye, Li et al.(2022)Li, Guo, Rhee, Han, and Han, Wen et al.(2022)Wen, Ding, Yao, Wang, and Qian, Lan et al.(2021)Lan, Hu, Chen, Xue, and Cheng], have emerged in recent years with promising performance on various datasets. However, landmark detection still remains challenging task due the high variability in poses, lighting and expressions. Despite the various existing FLD methodologies, none have focused on robust image augmentation techniques to solve these challenges. This study illustrates that meticulously designed image augmentations can considerably enhance the FLD performance.

Refer to caption
Figure 1: Illustration of the proposed Fiducial Focus Augmentation (FiFA). In row (a), 5×\times×5 black patches are created around the landmark joints (along with other standard augmentations) in the initial epochs and reduced over the epochs. Rows (b) and (c) show corresponding GradCAM-based saliency maps of the network’s last layer with and without FiFA, respectively. It is clearly seen that activations are more prominent around the desired landmarks when FiFA is used as additional augmentation.

But why do sophisticated deep neural network (DNN) architectures struggle to detect landmarks accurately in challenging scenarios? The reason is that the DNN is unable to learn the facial structure information as accurately as required. If a DNN model can accurately capture features that extract a facial structure, it can predict the landmarks more accurately even from obscured facial regions, like occluded areas. To learn facial structures effectively, we propose new augmentation technique called Fiducial Focus Augmentation (FiFA), which leverages the ground truth landmark coordinates as an inductive bias for facial structure. To this end, we introduce n×n𝑛𝑛n\times nitalic_n × italic_n black patches around the landmark locations in the training images, gradually reducing them over the epoch and then removing completely for the rest of the training, as illustrated in Fig 1. Since the patches cover key semantic regions of the face, e.g., eyes, nose, lips and jawline, when the model learns to predict these patches, it is able to learn the entire facial structure significantly better, as compared to an architecture without this inductive bias. One could view this augmentation technique as similar to Curriculum Learning (CL) [Hacohen and Weinshall(2019)], a strategy that trains a machine learning model from simpler data to more difficult data, mimicking the meaningful order found in human-designed learning curricula.

Drawing inspiration from [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos], we leverage the Siamese architecture to acquire a comprehensive understanding of reliable landmark predictions across various image augmentations. However, our method employs Deep Canonical Correlation Analysis (DCCA) [Andrew et al.(2013)Andrew, Arora, Bilmes, and Livescu] as loss function in Siamese architecture to amplify the efficacy of the learning process between distinctively augmented views. This loss function assists in the extraction of features that are correlated across views, while simultaneously eliminating uncorrelated noise. To design a robust backbone for the Siamese architecture, we adopt Vision Transformer (ViT) [Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.]. We further improved its performance and efficiency by incorporating a Convolutional Neural Network (CNN)-based hourglass module in-between the transformer layers of the ViT. Modern CNNs are usually considered to be shift-invariant; we hence use an Anti-aliased CNN [Zhang(2019)] inside the hourglass module to leverage this benefit. We summarize the contributions of this paper as follows.

  • To the best of our knowledge, this is the first effort in literature to propose a new patch-based augmentation technique for FLD task to learn facial semantic structures effectively.

  • We employ a Siamese-based training scheme utilising DCCA loss between feature representations of two different views of the same image, that enforces consistent predictions of the landmark for the two views. To incorporate virtues of both a Transformer and a CNN, we design a robust Transformer + CNN-based backbone in our proposed framework.

  • We performed extensive experiments on various benchmark datasets showing significant improvements over prior work. We also conducted ablation studies on our framework components and additional empirical analysis to study the usefulness of the proposed method.

2 Related Works

Earlier efforts on FLD task, especially those in recent years, can broadly be categorized into network architecture enhancements for heatmap generation and loss function improvements.

Network architecture enhancements: Coordinate regression-based methods [Sun et al.(2013)Sun, Wang, and Tang, Toshev and Szegedy(2014), Trigeorgis et al.(2016)Trigeorgis, Snape, Nicolaou, Antonakos, and Zafeiriou, Lv et al.(2017)Lv, Shao, Xing, Cheng, and Zhou, Zhang et al.(2014)Zhang, Shan, Kan, and Chen, Zhou et al.(2013a)Zhou, Fan, Cao, Jiang, and Yin] directly perform regression on landmark coordinate vectors through a fully connected output layer that disregards the spatial correlations of features and results in limited accuracy of landmark detection. On the other hand, heatmap regression-based methods [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos, Bulat and Tzimiropoulos(2017), Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen, Huang et al.(2021)Huang, Yang, Li, Kim, and Wei, Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye, Lan et al.(2021)Lan, Hu, Chen, Xue, and Cheng, Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu, Wen et al.(2022)Wen, Ding, Yao, Wang, and Qian, Li et al.(2022)Li, Guo, Rhee, Han, and Han] predict landmark coordinates by creating heatmaps. By doing so, they effectively maintain the original spatial relationships between pixels and achieve promising landmark detection accuracy. Therefore, heatmap regression has become the de facto choice for the FLD task in modern times. In [Bulat and Tzimiropoulos(2017)], Bulat et al. proposed an encoder-decoder based framework with heatmap regression for FLD. Their network incorporates hourglass and hierarchical blocks. Several research works [Sun et al.(2019)Sun, Zhao, Jiang, Cheng, Xiao, Liu, Mu, Wang, Liu, and Wang, Wang et al.(2020)Wang, Sun, Cheng, Jiang, Deng, Zhao, Liu, Mu, Tan, Wang, et al., Xiao et al.(2018)Xiao, Wu, and Wei] have been published based on the ResNet [He et al.(2016)He, Zhang, Ren, and Sun] architecture and modify their network for dense pixel-wise landmark predictions. Recently, the Vision Transformer (ViT) [Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.] has been incorporated in FLD task by Zhang et al. [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen] and has produced remarkable results. In our proposed framework, we also use ViT as the backbone network and improve its performance by introducing CNN layers in between transformer layers. This allows us to combine the best of both designs.

Loss function improvements: A pixel-wise L2𝐿2L2italic_L 2 or L1𝐿1L1italic_L 1 loss is the conventional loss generally applied to heatmap regression-based methods [Zhou et al.(2013b)Zhou, Fan, Cao, Jiang, and Yin, Deng et al.(2019)Deng, Trigeorgis, Zhou, and Zafeiriou, Dong et al.(2018)Dong, Yan, Ouyang, and Yang, Newell et al.(2016)Newell, Yang, and Deng, Wei et al.(2016)Wei, Ramakrishna, Kanade, and Sheikh]. To emphasize the importance of tiny and medium range errors during the training process, Feng et al. [Feng et al.(2018)Feng, Kittler, Awais, Huber, and Wu] introduced the Wing loss, which modifies the L1 loss by using a logarithmic function to amplify the impact of errors within a specific range. Additionally, Wang et al. [Wang et al.(2019)Wang, Bo, and Fuxin] developed the Adaptive Wing Loss, which can adjust its curvature based on the ground truth pixels. In [Kumar et al.(2020)Kumar, Marks, Mou, Wang, Jones, Cherian, Koike-Akino, Liu, and Feng], Kumar et al. proposed the LUVLi loss that optimizes the position of the keypoints, the uncertainty, and the likelihood of visibility. Recently, the authors from [Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye] proposed the Focal Wing Loss, which is used to mine and emphasize difficult samples under in-the-wild conditions.

In this work, we use the standard Binary Cross Entropy (BCE) and L2𝐿2L2italic_L 2 losses for heatmap and coordinate regression, respectively. We however employ the DCCA loss [Andrew et al.(2013)Andrew, Arora, Bilmes, and Livescu] which suits our framework and has never been used before for the FLD task. These simple losses help the proposed framework set a new benchmark. Our study of literature revealed that well-designed image augmentations are largely ignored for the FLD task. This paper attends to this very issue and introduces a new augmentation technique called FiFA that accounts for our impressive results.

Refer to caption
Figure 2: An overview of the proposed Siamese-based framework. PPE = Patch + Position Embeddings; RB = Residual Block; MHA = Multi-Head Attention, MLP = Multi-Layer Perceptron; CBP = Convolution+BlurPool; BU = Bilinear Upsampling; FFP = FF-Parser.

3 Proposed Framework

3.1 Problem Statement & Notations

Given an input image I𝐼Iitalic_I, FLD aims to detect {x,y}k×2𝑥𝑦superscript𝑘2\{x,y\}\in\mathbb{R}^{k\times 2}{ italic_x , italic_y } ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × 2 end_POSTSUPERSCRIPT, the coordinates of K𝐾Kitalic_K predefined landmarks. To this end, we propose a heatmap-based approach to regress the facial landmarks. During training, it encodes the target ground truth coordinates as a series of k𝑘kitalic_k heatmaps with a 2D Gaussian curve centered on them:

Ψi,j,k=12πσ2e12σ2[(ix¯k)2+(jy¯k)2]subscriptΨ𝑖𝑗𝑘12𝜋superscript𝜎2superscript𝑒12superscript𝜎2delimited-[]superscript𝑖subscript¯𝑥𝑘2superscript𝑗subscript¯𝑦𝑘2\Psi_{i,j,k}=\frac{1}{2\pi\sigma^{2}}e^{-\frac{1}{2\sigma^{2}}\left[(i-\bar{x}% _{k})^{2}+(j-\bar{y}_{k})^{2}\right]}roman_Ψ start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ ( italic_i - over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_j - over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_POSTSUPERSCRIPT (1)

where xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and yksubscript𝑦𝑘y_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are the spatial coordinates of the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT point, while x¯ksubscript¯𝑥𝑘\bar{x}_{k}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and y¯ksubscript¯𝑦𝑘\bar{y}_{k}over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are their scaled, quantized version obtained by scaling factor s𝑠sitalic_s and rounding operator delimited-⌊⌉\lfloor\cdot\rceil⌊ ⋅ ⌉, i.e.

(x¯k,y¯k)=(1sxk,1syk)(\bar{x}_{k},\bar{y}_{k})=(\lfloor\frac{1}{s}x_{k}\rceil,\lfloor\frac{1}{s}y_{% k}\rceil)( over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ( ⌊ divide start_ARG 1 end_ARG start_ARG italic_s end_ARG italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⌉ , ⌊ divide start_ARG 1 end_ARG start_ARG italic_s end_ARG italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⌉ ) (2)

As shown in Eq. (1), we use a Gaussian with variance σ𝜎\sigmaitalic_σ around each coordinate from {x,y}𝑥𝑦\{x,y\}{ italic_x , italic_y } to generate the corresponding heatmap k×W×Hsuperscript𝑘𝑊𝐻\mathbb{H}\in\mathbb{R}^{k\times W\times H}blackboard_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_W × italic_H end_POSTSUPERSCRIPT. Finally, the pixels with maximum intensity of the heatmap \mathbb{H}blackboard_H are selected to get the final K𝐾Kitalic_K landmarks in the FLD task.

To attain precise facial landmarks, we propose a novel augmentation technique called Fiducial Focus Augmentation (FiFA) that helps the network to learn facial structures in the provided images, along with a Siamese network with a robust backbone and the DCCA loss to ensure consistent predictions between different augmented views. Detailed explanations of these modules are provided in the subsequent subsections.

3.2 Fiducial Focus Augmentation

We seek to explore the potential of carefully designed image augmentations for the FLD task in this section. To this end, we propose an augmentation fAsubscript𝑓𝐴f_{A}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT for input training images, where fA=fA2fA1subscript𝑓𝐴subscript𝑓subscript𝐴2subscript𝑓subscript𝐴1f_{A}=f_{A_{2}}\circ f_{A_{1}}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Here, fA1subscript𝑓subscript𝐴1f_{A_{1}}italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT can be any standard image augmentations used in the FLD task [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen, Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos, Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye, Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu, Wen et al.(2022)Wen, Ding, Yao, Wang, and Qian] and fA2subscript𝑓subscript𝐴2f_{A_{2}}italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the proposed Fiducial Focus Augmentation (FiFA).

First, we take the original input image I𝐼Iitalic_I and apply standard image augmentation fA1subscript𝑓subscript𝐴1f_{A_{1}}italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to get the augmented image (Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT). Mathematically, this can be expressed as:

I=fA1I.superscript𝐼tensor-productsubscript𝑓subscript𝐴1𝐼I^{\prime}=f_{A_{1}}\otimes I.italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊗ italic_I . (3)

To get the final augmented image I′′superscript𝐼′′I^{\prime\prime}italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT, Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is passed through the proposed augmentation operation i.e., fA2subscript𝑓subscript𝐴2f_{A_{2}}italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (as descibed in Alg 1), i.e.

I′′=fA2I=I^I=I^(fA1I).superscript𝐼′′tensor-productsubscript𝑓subscript𝐴2superscript𝐼tensor-product^𝐼superscript𝐼tensor-product^𝐼tensor-productsubscript𝑓subscript𝐴1𝐼I^{\prime\prime}=f_{A_{2}}\otimes I^{\prime}=\hat{I}\otimes I^{\prime}=\hat{I}% \otimes(f_{A_{1}}\otimes I).italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊗ italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = over^ start_ARG italic_I end_ARG ⊗ italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = over^ start_ARG italic_I end_ARG ⊗ ( italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊗ italic_I ) . (4)

Here, we aim to incorporate the available facial structure ground truth information into the augmented image, Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in order to aptly utilize the underlying facial structure. To achieve this, we construct black square patches of dimensions hf×wfsubscript𝑓subscript𝑤𝑓h_{f}\times w_{f}italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, where hf,wf{1,,n}subscript𝑓subscript𝑤𝑓1𝑛h_{f},w_{f}\in\{1,\cdots,n\}italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∈ { 1 , ⋯ , italic_n } while retaining the landmarks as the intersection points of the two diagonals of the square patches (see Figure 1 (a)). These patches comprise of four coordinates which can be expressed as:

{(xiwf,yi+hf),(xi+wf,yi+hf),(xi+wf,yihf),(xiwf,yihf)}{xi,yi}L.subscript𝑥𝑖subscript𝑤𝑓subscript𝑦𝑖subscript𝑓subscript𝑥𝑖subscript𝑤𝑓subscript𝑦𝑖subscript𝑓subscript𝑥𝑖subscript𝑤𝑓subscript𝑦𝑖subscript𝑓subscript𝑥𝑖subscript𝑤𝑓subscript𝑦𝑖subscript𝑓for-allsubscript𝑥𝑖subscript𝑦𝑖𝐿\{(x_{i}-w_{f},y_{i}+h_{f}),(x_{i}+w_{f},y_{i}+h_{f}),(x_{i}+w_{f},y_{i}-h_{f}% ),(x_{i}-w_{f},y_{i}-h_{f})\}\ \forall\ \{x_{i},y_{i}\}\in L.{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) } ∀ { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ∈ italic_L . (5)

Here, we start with a bigger patch size of n×n𝑛𝑛n\times nitalic_n × italic_n for a certain number of epoch intervals \mathcal{E}caligraphic_E. After every such interval, we reduce the patch size by 1 pixel and eventually, these patches are removed from the images and rest of the training goes on with augmentation fA1subscript𝑓subscript𝐴1f_{A_{1}}italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT only. So the final augmented image is (where Tnsubscript𝑇𝑛T_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the total number of epochs):

{I′′when epoch no. nIwhen n < epoch no. Tn.casessuperscript𝐼′′when epoch no. nsuperscript𝐼when n < epoch no. Tn\begin{cases}I^{\prime\prime}&\text{when epoch no. $\leq n\cdot\mathcal{E}$}\\ I^{\prime}&\text{when $n\cdot\mathcal{E}$ < epoch no. $\leq T_{n}$}.\end{cases}{ start_ROW start_CELL italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_CELL start_CELL when epoch no. ≤ italic_n ⋅ caligraphic_E end_CELL end_ROW start_ROW start_CELL italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL start_CELL when italic_n ⋅ caligraphic_E < epoch no. ≤ italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT . end_CELL end_ROW (6)
Initialize:   Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT: Augmented Image, where I=fA1Isuperscript𝐼tensor-productsubscript𝑓subscript𝐴1𝐼I^{\prime}=f_{A_{1}}\otimes Iitalic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊗ italic_I,
     Lnsubscript𝐿𝑛L_{n}italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT: Number of landmarks in I𝐼Iitalic_I,
       L𝐿Litalic_L: Set of Lnsubscript𝐿𝑛L_{n}italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT landmarks, where L={(xi,yi)}𝐿subscript𝑥𝑖subscript𝑦𝑖L=\{(x_{i},y_{i})\}italic_L = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }, i{1,,Ln}𝑖1subscript𝐿𝑛i\in\{1,...,L_{n}\}italic_i ∈ { 1 , … , italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT },
     hf,wfsubscript𝑓subscript𝑤𝑓h_{f},w_{f}italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT: Height and width of the patches (S)𝑆(S)( italic_S ), where hf,wf{n,,1}subscript𝑓subscript𝑤𝑓𝑛1h_{f},w_{f}\in\{n,...,1\}italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∈ { italic_n , … , 1 },
     Iinsubscript𝐼𝑖𝑛I_{in}italic_I start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT: Pixel intensity of S𝑆Sitalic_S, where Iin=(0,0,0)subscript𝐼𝑖𝑛000I_{in}=(0,0,0)italic_I start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT = ( 0 , 0 , 0 ),
       \mathcal{E}caligraphic_E: Epoch interval, where {1,,n}n<1𝑛𝑛absent\mathcal{E}\in\{1,...,n\}\land n<caligraphic_E ∈ { 1 , … , italic_n } ∧ italic_n < Total number of epochs (Tn)subscript𝑇𝑛(T_{n})( italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
     Tn::subscript𝑇𝑛absentT_{n}:italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : Total number of epochs =i=1ni+wabsentsuperscriptsubscript𝑖1𝑛subscript𝑖𝑤=\sum_{i=1}^{n}\mathcal{E}_{i}+w= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_w, where w𝕎𝑤𝕎w\in\mathbb{W}italic_w ∈ blackboard_W
Procedure:
  for i𝑖iitalic_i in range Tnsubscript𝑇𝑛T_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT do
    ki/𝑘𝑖k\leftarrow\lfloor i/\mathcal{E}\rflooritalic_k ← ⌊ italic_i / caligraphic_E ⌋
    hf,wf|nk|subscript𝑓subscript𝑤𝑓𝑛𝑘h_{f},w_{f}\leftarrow|n-k|italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ← | italic_n - italic_k |
    for j𝑗jitalic_j in range Lnsubscript𝐿𝑛L_{n}italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT do
      C{(xjwf/2,yj+hf/2),(xj+wf/2,yj+hf/2),(xj+wf/2,yjhf/2),(xjwf/2,yjhf/2)}𝐶subscript𝑥𝑗subscript𝑤𝑓2subscript𝑦𝑗subscript𝑓2subscript𝑥𝑗subscript𝑤𝑓2subscript𝑦𝑗subscript𝑓2subscript𝑥𝑗subscript𝑤𝑓2subscript𝑦𝑗subscript𝑓2subscript𝑥𝑗subscript𝑤𝑓2subscript𝑦𝑗subscript𝑓2C\leftarrow\{(x_{j}-w_{f}/2,y_{j}+h_{f}/2),(x_{j}+w_{f}/2,y_{j}+h_{f}/2),(x_{j% }+w_{f}/2,y_{j}-h_{f}/2),(x_{j}-w_{f}/2,y_{j}-h_{f}/2)\}italic_C ← { ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT / 2 , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT / 2 ) , ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT / 2 , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT / 2 ) , ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT / 2 , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT / 2 ) , ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT / 2 , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT / 2 ) }
      Create patch S𝑆Sitalic_S with C𝐶Citalic_C of Iinsubscript𝐼𝑖𝑛I_{in}italic_I start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT
      ISIsuperscript𝐼tensor-product𝑆superscript𝐼I^{\prime}\leftarrow S\otimes I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_S ⊗ italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
    end for
  end for
  I^I^𝐼superscript𝐼\hat{I}\leftarrow I^{\prime}over^ start_ARG italic_I end_ARG ← italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
  return I^^𝐼\hat{I}over^ start_ARG italic_I end_ARG
Algorithm 1 Fiducial Focus Augmentation (fA2)subscript𝑓subscript𝐴2(f_{A_{2}})( italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )

The proposed FiFA helps the backbone network learn the underlying facial structure and address difficult test samples, since the patches cover the entire face uniformly over the different joints (eyes, lips, nose and jawline). At the beginning of training, the model is exposed to larger patches as low-confidence regions to concentrate on the joints and eventually, as the model learns progressively with each epoch, smaller patches are introduced as high-confidence regions around the joints. When the patches are removed completely, the model tries to predict the joints with the inductive bias provided by earlier training steps in our augmentation process. Since the patches can be used with any facial variations (such as pose or expression), their integration into the images as augmentations enables the model to learn the inherent facial structures.

3.3 Matching Two Views

Earlier work on the task of FLD has seen limited exploration of Siamese architecture-based training, with the exception of [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos]. In this paper, we propose a Siamese architecture-based framework as illustrated in Fig. 2. The network f𝑓fitalic_f takes the two input images Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and I′′superscript𝐼′′I^{\prime\prime}italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT generated using two different augmentations fA1subscript𝑓subscript𝐴1f_{A_{1}}italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and fAsubscript𝑓𝐴f_{A}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. This training scheme using augmentations holds a notable advantage, as CNNs may not be invariant under arbitrary affine transformations. Therefore, even minor variations within the input space may produce significant changes in the output. By optimizing jointly using the Siamese architecture and combining the two predictions, we enhance the robustness and consistency of the predictions (under such variations).

To maximize the correlation between two different augmented views, we employ the Deep Canonical Correlation Analysis (DCCA) loss [Andrew et al.(2013)Andrew, Arora, Bilmes, and Livescu] between the high-level representation map**s f1(I)subscript𝑓1superscript𝐼f_{1}(I^{\prime})italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and f2(I′′)subscript𝑓2superscript𝐼′′f_{2}(I^{\prime\prime})italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ), where f1=f2=fsubscript𝑓1subscript𝑓2𝑓f_{1}=f_{2}=fitalic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_f. The correlation between these two map**s can be expressed as below:

corr(f1(I),f2(I′′))=cov(f1(I),f2(I′′))var(f1(I))var(f2(I′′)).𝑐𝑜𝑟𝑟subscript𝑓1superscript𝐼subscript𝑓2superscript𝐼′′𝑐𝑜𝑣subscript𝑓1superscript𝐼subscript𝑓2superscript𝐼′′𝑣𝑎𝑟subscript𝑓1superscript𝐼𝑣𝑎𝑟subscript𝑓2superscript𝐼′′corr(f_{1}(I^{\prime}),f_{2}(I^{\prime\prime}))=\frac{cov(f_{1}(I^{\prime}),f_% {2}(I^{\prime\prime}))}{\sqrt{var(f_{1}(I^{\prime}))\cdot var(f_{2}(I^{\prime% \prime}))}}.italic_c italic_o italic_r italic_r ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ) = divide start_ARG italic_c italic_o italic_v ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG square-root start_ARG italic_v italic_a italic_r ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ⋅ italic_v italic_a italic_r ( italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ) end_ARG end_ARG . (7)

The DCCA loss (i.e., DCCAsubscript𝐷𝐶𝐶𝐴\mathcal{L}_{DCCA}caligraphic_L start_POSTSUBSCRIPT italic_D italic_C italic_C italic_A end_POSTSUBSCRIPT) is then computed as:

DCCA=corr(f1(I),f2(I′′)).subscript𝐷𝐶𝐶𝐴𝑐𝑜𝑟𝑟subscript𝑓1superscript𝐼subscript𝑓2superscript𝐼′′\mathcal{L}_{DCCA}=-corr(f_{1}(I^{\prime}),f_{2}(I^{\prime\prime})).caligraphic_L start_POSTSUBSCRIPT italic_D italic_C italic_C italic_A end_POSTSUBSCRIPT = - italic_c italic_o italic_r italic_r ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ) . (8)

The use of DCCA loss presents three key advantages: (i) correlated representations partially reconstruct the information in the second view, when it is unavailable; (ii) it has potential to eliminate noise that is uncorrelated across the two views; and (iii) if f1,f2subscript𝑓1subscript𝑓2f_{1},f_{2}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT capture features that are correlated across the views, they may represent latent aspects of the face. This, in turn helps the backbone network in capturing the facial structure in the images.

3.4 Architectural Details

In the proposed framework, we employ a transformer-based architecture (a pre-trained ViT-B/16 [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen] consisting of 12 layers and a width of 768) as a backbone. To enhance its performance further, we incorporated three custom CNN-based hourglass modules after every four layers of the transformer network. The purpose of this module is to introduce desirable properties of CNNs, such as shift, scale, and distortion invariance, into the ViT architecture, while still retaining the characteristics of transformers, i.e., dynamic attention, global context, and better generalization. This results in a robust backbone network (Transformer + CNN) which learns facial structures effectively.

The utilization of pooling layers in CNNs often provides a certain degree of shift invariance in the model. However, in our task, it is imperative to avoid the loss of structural information caused by pooling layers. We therefore adopt the Anti-aliased CNN [Zhang(2019)] into our hourglass modules, hereafter known as Anti-aliased Hourglass. The combination of these components significantly enhances the caliber of our network towards high-quality heatmap generation. Nevertheless, the upsampling + concatenation (U+A) operation in the hourglass modules may introduce some high-frequency noise. To mitigate this negative impact and filter the features in the Fourier space, we integrate a FF-Parser layer [Wu et al.(2022)Wu, Fang, Zhang, Yang, and Xu] after each U+A operation in the hourglass modules. We provide ablation studies on these components in our results to demonstrate their usefulness.

4 Experiments and Results

This section discusses the implementation details, comparison with SOTA methods on benchmark datasets and ablation analysis of the introduced components of the proposed method.

Implementation Details: The proposed method is trained/tested on the various benchmark datasets, i.e., WFLW [WFL()], 300W [Sagonas et al.(2016)Sagonas, Antonakos, Tzimiropoulos, Zafeiriou, and Pantic], COFW [Burgos-Artizzu et al.(2013)Burgos-Artizzu, Perona, and Dollár] and AFLW [Köstinger et al.(2011)Köstinger, Wohlhart, Roth, and Bischof]. Details of these datasets are discussed in the Supplementary material. During the training phase, the input image is cropped and resized to 512×512512512512\times 512512 × 512. The output feature map size of every hourglass module is set to 128×128128128128\times 128128 × 128, which is 4×4\times4 × smaller than the input image size. The ground truth heatmaps are generated by a Gaussian with σ=1.5𝜎1.5\sigma=1.5italic_σ = 1.5 and radius r=5𝑟5r=5italic_r = 5. During training process, we used AdamW [Loshchilov and Hutter(2017)] to optimize our network with the initial learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and trained for 250 epochs. Apart from the proposed augmentation (FiFA), other standard data augmentations (fA1subscript𝑓subscript𝐴1f_{A_{1}}italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT) are employed at training time, such as random masking, bilinear interpolation, random occlusion, random gray, random gamma, random blur, noise fusion. For effective learning, along with the DCCA loss (i.e., DCCAsubscript𝐷𝐶𝐶𝐴\mathcal{L}_{DCCA}caligraphic_L start_POSTSUBSCRIPT italic_D italic_C italic_C italic_A end_POSTSUBSCRIPT), we also employ the standard BCE loss (i.e., BCEsubscript𝐵𝐶𝐸\mathcal{L}_{BCE}caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT) and mean absolute error loss (i.e., L1subscript𝐿1\mathcal{L}_{L1}caligraphic_L start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT) for heatmap and coordinate regression, respectively with equal weights (i.e., 1.0). For evaluation, we used the standard evaluation metrics i.e., Normalized Mean Error (NME𝑁𝑀𝐸NMEitalic_N italic_M italic_E) variants (i.e., NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT, NMEbox𝑁𝑀subscript𝐸𝑏𝑜𝑥NME_{box}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT, NMEdiag𝑁𝑀subscript𝐸𝑑𝑖𝑎𝑔NME_{diag}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_d italic_i italic_a italic_g end_POSTSUBSCRIPT), Failure Rate (FRic10𝐹subscriptsuperscript𝑅10𝑖𝑐FR^{10}_{ic}italic_F italic_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT), Area Under the Curve (AUCbox𝐴𝑈subscript𝐶𝑏𝑜𝑥AUC_{box}italic_A italic_U italic_C start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT). Detailed definitions of these metrics have been discussed in the Supplementary material. For comparison, we choose recent baselines such as FaRL [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen], ADNet [Huang et al.(2021)Huang, Yang, Li, Kim, and Wei], SH-FAN [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos], PropNet [Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye], HIH [Lan et al.(2021)Lan, Hu, Chen, Xue, and Cheng], SLPT [Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu], PicassoNet [Wen et al.(2022)Wen, Ding, Yao, Wang, and Qian] and DTLD [Li et al.(2022)Li, Guo, Rhee, Han, and Han]. All the experiments were implemented using PyTorch and the network was trained on 4 GPUs (40GB NVIDIA A100), with batch size 5 per GPU.

4.1 Result Analysis

Table 1: Comparison against the state-of-the-art on COFW, 300W and AFLW dataset. Best result is bolded and second best result is underlined.

Method Remarks COFW 300W AFLW NMEic𝑁𝑀subscript𝐸𝑖𝑐absentNME_{ic}\downarrowitalic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT ↓ FRic10𝐹subscriptsuperscript𝑅10𝑖𝑐absentFR^{10}_{ic}\downarrowitalic_F italic_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT ↓ NMEic𝑁𝑀subscript𝐸𝑖𝑐absentNME_{ic}\downarrowitalic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT ↓ NMEdiag𝑁𝑀subscript𝐸𝑑𝑖𝑎𝑔absentNME_{diag}\downarrowitalic_N italic_M italic_E start_POSTSUBSCRIPT italic_d italic_i italic_a italic_g end_POSTSUBSCRIPT ↓ NMEbox𝑁𝑀subscript𝐸𝑏𝑜𝑥absentNME_{box}\downarrowitalic_N italic_M italic_E start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT ↓ AUCbox𝐴𝑈subscript𝐶𝑏𝑜𝑥absentAUC_{box}\uparrowitalic_A italic_U italic_C start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT ↑ Full Common Challenge Full Frontal Full Full FaRL [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen] CVPR ’22 3.11 0.12 2.93 2.56 4.45 0.94 0.82 1.33 81.3 ADNet [Huang et al.(2021)Huang, Yang, Li, Kim, and Wei] ICCV ’21 4.68 0.59 2.93 2.53 4.58 SH-FAN [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos] BMVC ’21 3.02 0.00 2.94 2.61 4.13 1.31 1.12 2.14 70.0 PropNet [Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye] CVPR ’20 3.71 0.20 2.93 2.67 3.99 HIH [Lan et al.(2021)Lan, Hu, Chen, Xue, and Cheng] ICCVW ’21 3.21 0.00 3.09 2.65 4.89 SLPT [Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu] CVPR ’22 3.32 0.59 3.17 2.75 4.90 DTLD [Li et al.(2022)Li, Guo, Rhee, Han, and Han] CVPR ’22 3.02 2.96 2.60 4.48 1.37 PicassoNet [Wen et al.(2022)Wen, Ding, Yao, Wang, and Qian] TNNLS ’22 3.58 3.03 5.81 1.59 1.30 FiFA (Ours) 2.96 0.00 2.89 2.51 4.47 0.92 0.80 1.31 81.8

Table 2: Comparison against the state-of-the-art on WFLW testset. Best result is bolded and second best result is underlined.

Metric Models Remarks Fullset Subset Pose Expression Illumination Make Up Occlusion Blur NMEic𝑁𝑀subscript𝐸icNME_{\text{ic}}italic_N italic_M italic_E start_POSTSUBSCRIPT ic end_POSTSUBSCRIPT(%)\downarrow FaRL [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen] CVPR’22 3.99 6.61 4.18 3.90 3.84 4.71 4.53 ADNet [Huang et al.(2021)Huang, Yang, Li, Kim, and Wei] ICCV’21 4.14 6.96 4.38 4.09 4.05 5.06 4.79 SH-FAN [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos] BMVC’21 3.72 PropNet [Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye] CVPR’20 4.05 6.92 3.87 4.07 3.76 4.58 4.36 HIH [Lan et al.(2021)Lan, Hu, Chen, Xue, and Cheng] ICCVW’21 4.08 6.87 4.06 4.34 3.85 4.85 4.66 SLPT [Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu] CVPR’22 4.14 6.96 4.45 4.05 4.00 5.06 4.79 DTLD [Li et al.(2022)Li, Guo, Rhee, Han, and Han] CVPR’22 4.05 PicassoNet [Wen et al.(2022)Wen, Ding, Yao, Wang, and Qian] TNNLS’22 4.82 8.61 5.14 4.73 4.68 5.91 5.56 FiFA (Ours) 3.89 6.47 4.09 3.80 3.76 4.63 4.43 FRic10𝐹superscriptsubscript𝑅ic10FR_{\text{ic}}^{10}italic_F italic_R start_POSTSUBSCRIPT ic end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT(%)\downarrow FaRL [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen] CVPR’22 1.76 ADNet [Huang et al.(2021)Huang, Yang, Li, Kim, and Wei] ICCV’21 2.72 12.72 2.15 2.44 1.94 5.79 3.54 SH-FAN [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos] BMVC’21 1.55 PropNet [Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye] CVPR’20 2.96 12.58 2.55 2.44 1.46 5.16 3.75 HIH [Lan et al.(2021)Lan, Hu, Chen, Xue, and Cheng] ICCVW’21 2.60 12.88 1.27 2.43 1.45 5.16 3.10 SLPT [Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu] CVPR’22 2.76 12.72 2.23 1.86 3.40 5.98 3.88 DTLD [Li et al.(2022)Li, Guo, Rhee, Han, and Han] CVPR’22 2.68 PicassoNet [Wen et al.(2022)Wen, Ding, Yao, Wang, and Qian] TNNLS’22 5.64 25.46 5.10 4.30 5.34 10.59 7.12 FiFA (Ours) 1.60 7.05 1.27 1.43 1.45 3.39 1.94

Comparison on COFW: In Table 1, we presents a comparison of the proposed FiFA approach with existing SOTA methods on the COFW testset, which is a well-known benchmark for heavy occlusion and a wide range of head pose variation. It is noteworthy that the proposed FiFA model outperforms the existing SOTA methods. The leading NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT and 0%percent00\%0 % FRic10𝐹superscriptsubscript𝑅𝑖𝑐10FR_{ic}^{10}italic_F italic_R start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT demonstrate its robustness against extreme situations.

Comparison on 300W: On the 300W dataset, our approach exhibits superior performance in comparison to SOTA methods in terms of NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT, and is given in Table 1. In challenge-set, the proposed approach performs slightly lower than PropNet [Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye] and SH-FAN [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos] methods. However, it has achieved SOTA results in other scenarios (i.e., full-set and common-set), which suggests that our method makes plausible predictions even in deplorable situations.

Comparison on AFLW: The results on AFLW testset are presented in Table 1. Adhering to the evaluation protocol adopted in [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen], we report comparisons in terms of NMEdiag𝑁𝑀subscript𝐸𝑑𝑖𝑎𝑔NME_{diag}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_d italic_i italic_a italic_g end_POSTSUBSCRIPT, NMEbox𝑁𝑀subscript𝐸𝑏𝑜𝑥NME_{box}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT and AUCbox7𝐴𝑈superscriptsubscript𝐶𝑏𝑜𝑥7AUC_{box}^{7}italic_A italic_U italic_C start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT. This table clearly indicates that our approach has outperformed the SOTA results, despite the fact that the dataset is almost saturated.

Refer to caption
Figure 3: Qualitative results on WFLW testset. Landmarks shown in green are produced by our method, while the ones in red by the state-of-the-art approach of [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen].

Comparison on WFLW: In Table 2, we compare results in terms of NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT, and FRic10𝐹superscriptsubscript𝑅𝑖𝑐10FR_{ic}^{10}italic_F italic_R start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT. Here, it is observed that the proposed FiFA approach obtains better NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT for Pose, Illumination and Make Up subsets. Additionally, in comparison on FRic10𝐹superscriptsubscript𝑅𝑖𝑐10FR_{ic}^{10}italic_F italic_R start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, the proposed approach achieves higher performance in all subsets i.e., Pose, Expression, Illumination, Make Up, Occlusion, Blur by 44%, 41%, 23%, 1%, 34%, 37.4%, respectively over the previous best performing SOTA methods. These results show that our method improves the accuracy in challenging scenarios while also reducing the overall failure ratio for difficult images. Moreover, Fig. 3 visually conveys that the proposed approach delivers significantly more precise landmarks in challenging scenarios.

4.2 Ablation Studies & Analysis

This section presents the ablation analysis carried out to establish the efficacy of the proposed framework. To ensure fair comparison, all experiments were performed on COFW dataset.

Effects of method’s components: Herein, we investigate the impact of each component of the proposed framework. The results, presented in Table 5, reveal that the baseline network, i.e., Vanilla backbone (ViT-B/16), attains an NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT of 3.11 when trained solely with standard augmentations, i.e., fA1subscript𝑓subscript𝐴1f_{A_{1}}italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. When anti-aliased CNN-based hourglass modules are incorporated into baseline, an improvement in NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT to 3.07 is observed. By employing the proposed augmentation, fA2subscript𝑓subscript𝐴2f_{A_{2}}italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, on the input images during training, a remarkable performance boost is achieved, with an NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT of 3.00. The highest NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT of 2.96 is attained when incorporating the Siamese training approach with DCCA loss on both fA1subscript𝑓subscript𝐴1f_{A_{1}}italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and fA2subscript𝑓subscript𝐴2f_{A_{2}}italic_f start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT augmented images. This finding demonstrates that training the backbone with proposed components gives best performance in results.

Table 3: Effect of method’s components on COFW.
Table 4: Effect of patch sizes in FiFA on COFW.
Table 5: Effect of FiFA over standard augmentations on COFW. BI = Bilinear Interpolation; RM = Random Masking; RO = Random Occlusion; RGr = Random Gray; RGm = Random Gamma; RB = Random Blur; NF = noise fusion.

Method NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT(%)\downarrow Vanilla backbone (ViT-B/16) 3.11 + anti-aliased CNN-based hourglass 3.07 + Fiducial Focus Augmentation 3.00 + Siamese training (w DCCA) 2.96

FiFA patch progression NMEic(%)NME_{ic}(\%)\downarrowitalic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT ( % ) ↓ 3×\times×3 1×1absent11absent\rightarrow\cdot\cdot\cdot\rightarrow 1\times 1\rightarrow→ ⋯ → 1 × 1 → no patch 3.05 4×\times×4 1×1absent11absent\rightarrow\cdot\cdot\cdot\rightarrow 1\times 1\rightarrow→ ⋯ → 1 × 1 → no patch 3.00 5×\times×5 1×1absent11absent\rightarrow\cdot\cdot\cdot\rightarrow 1\times 1\rightarrow→ ⋯ → 1 × 1 → no patch 2.96 6×\times×6 1×1absent11absent\rightarrow\cdot\cdot\cdot\rightarrow 1\times 1\rightarrow→ ⋯ → 1 × 1 → no patch 2.99 7×\times×7 1×1absent11absent\rightarrow\cdot\cdot\cdot\rightarrow 1\times 1\rightarrow→ ⋯ → 1 × 1 → no patch 3.02

Augmentations NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT(%)\downarrow RM + RO 3.15  + FiFA 3.08 RM + {RO, RGr} 3.12  + FiFA 3.07 RM + {RO, RGr, RGm} 3.10  + FiFA 3.04 RM + {RO, RGr, RGm, RB} 3.10  + FiFA 3.04 RM + BI + {RO, RGr, RGm, RB} 3.08  + FiFA 3.03 RM + BI + NF + {RO, RGr, RGm, RB} 3.07  + FiFA 3.00

Table 4: Effect of patch sizes in FiFA on COFW.
Table 5: Effect of FiFA over standard augmentations on COFW. BI = Bilinear Interpolation; RM = Random Masking; RO = Random Occlusion; RGr = Random Gray; RGm = Random Gamma; RB = Random Blur; NF = noise fusion.

Effects of fiducial mask sizes: We have conducted a series of experiments to determine the optimal initial patch size for the proposed FiFA. As shown in Table 5, a patch size of 5×5555\times 55 × 5 yields the best NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT of 2.96, while deviating from this size leads to a deterioration in performance. This can be attributed to the fact that during the initial stages of training, when the network weights are not yet sufficiently tuned, a patch size that is either too large or too small will result in a confidence region that is either too broad or too narrow for the network to focus on the landmarks. This, in turn, has an adverse effect on the learning process and ultimately on the performance of the network.

Effect of FiFA over standard augmentations: Several experiments were conducted to prove the effectiveness of our proposed FiFA over other standard augmentations. Due to the availability of only one view of augmented images, all these experiments were performed without a Siamese-based training mechanism. Table 5 displays the results obtained in terms of NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT on the COFW testset. One can notice that the inclusion of our proposed FiFA in standard augmentation techniques leads to a notable improvement in the NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT value.

Comparison with other losses in Siamese training: We employ DCCA loss [Andrew et al.(2013)Andrew, Arora, Bilmes, and Livescu] in Siamese training to maximize the correlation between different views. To demonstrate the efficacy of DCCA loss, we conducted several experiments with different losses (i.e., L2, L1, Smooth L1, and Wing loss [Feng et al.(2018)Feng, Kittler, Awais, Huber, and Wu]), and the corresponding results are presented in Table 6. One can observe that the DCCA loss helps to obtain better NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT, exhibiting a 3% increase as compared to previous best-performing Wing loss.

Table 6: Effect of different losses in Siamese training on COFW.

Loss L2 L1 Smooth L1 Wing [Feng et al.(2018)Feng, Kittler, Awais, Huber, and Wu] DCCA [Andrew et al.(2013)Andrew, Arora, Bilmes, and Livescu] NMEic𝑁𝑀subscript𝐸𝑖𝑐NME_{ic}italic_N italic_M italic_E start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT(%)\downarrow 3.14 3.09 3.11 3.05 2.96

Effectiveness of the proposed components to other SOTA methods: To validate the effectiveness of the proposed components, we conducted a series of experiments wherein the proposed FiFA augmentation and Siamese network based DCCA loss were implemented on other baseline methods such as HRNet [Wang et al.(2020)Wang, Sun, Cheng, Jiang, Deng, Zhao, Liu, Mu, Tan, Wang, et al.], ADNet [Huang et al.(2021)Huang, Yang, Li, Kim, and Wei], SH-FAN backbone [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos], FaRL [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen], SLPT [Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu] and the corresponding results are summarized in Table 7. The proposed FiFA augmentation technique improved the performance of baseline methods. Additionally, the Siamese network based DCCA loss contributed to improve the NME score further. This clearly indicates the generalization capability of our method.

Table 7: Effect of proposed FiFA augmentation technique and Siamese-based DCCA loss on baseline methods on COFW testset.

Methods Remarks Baseline + FiFA + FiFA + Siamese training (w DCCA) HRNet [Wang et al.(2020)Wang, Sun, Cheng, Jiang, Deng, Zhao, Liu, Mu, Tan, Wang, et al.] ICCV21𝐼𝐶𝐶subscript𝑉21ICCV_{21}italic_I italic_C italic_C italic_V start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT 3.45 3.32 3.28 ADNet [Huang et al.(2021)Huang, Yang, Li, Kim, and Wei] ICCV21𝐼𝐶𝐶subscript𝑉21ICCV_{21}italic_I italic_C italic_C italic_V start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT 4.68 4.51 4.45 SH-FAN Backbone [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos] BMVC21𝐵𝑀𝑉subscript𝐶21BMVC_{21}italic_B italic_M italic_V italic_C start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT 3.25 3.12 3.07 FaRL [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen] CVPR22𝐶𝑉𝑃subscript𝑅22CVPR_{22}italic_C italic_V italic_P italic_R start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT 3.11 3.04 3.01 SLPT [Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu] CVPR22𝐶𝑉𝑃subscript𝑅22CVPR_{22}italic_C italic_V italic_P italic_R start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT 3.32 3.15 3.10

5 Conclusion & Future Work

In this paper, we successfully proposed a simple yet effective image augmentation technique called Fiducial Focus Augmentation (FiFA) for facial landmark detection task. The integration of FiFA during training significantly enhanced the accuracy of proposed approach on testing benchmarks without extreme modifications to its backbone network and the loss function. Our findings suggest that the employment of FiFA as an image augmentation technique, when used in conjunction with a Siamese-based training with DCCA loss results in state-of-the-art performance. Additionally, we employed an anti-aliased CNN-based hourglass network with ViT as our backbone network to address shift invariance and noise. We performed extensive experimentation and ablation studies to validate the effectiveness of the proposed approach. In future work, FiFA can be studied further to extend it for other face-related tasks.

References

  • [WFL()] Look at boundary: A boundary-aware face alignment algorithm. https://wywu.github.io/projects/LAB/WFLW.html.
  • [Andrew et al.(2013)Andrew, Arora, Bilmes, and Livescu] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In International conference on machine learning, pages 1247–1255. PMLR, 2013.
  • [Bulat and Tzimiropoulos(2017)] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE international conference on computer vision, pages 1021–1030, 2017.
  • [Bulat et al.(2021)Bulat, Sanchez, and Tzimiropoulos] Adrian Bulat, Enrique Sanchez, and Georgios Tzimiropoulos. Subpixel heatmap regression for facial landmark localization. arXiv preprint arXiv:2111.02360, 2021.
  • [Burgos-Artizzu et al.(2013)Burgos-Artizzu, Perona, and Dollár] Xavier P. Burgos-Artizzu, Pietro Perona, and Piotr Dollár. Robust face landmark estimation under occlusion. In 2013 IEEE International Conference on Computer Vision, pages 1513–1520, 2013. 10.1109/ICCV.2013.191.
  • [Deng et al.(2019)Deng, Trigeorgis, Zhou, and Zafeiriou] Jiankang Deng, George Trigeorgis, Yuxiang Zhou, and Stefanos Zafeiriou. Joint multi-view face alignment in the wild. IEEE Transactions on Image Processing, 28(7):3636–3648, 2019.
  • [Dong et al.(2018)Dong, Yan, Ouyang, and Yang] Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style aggregated network for facial landmark detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 379–388, 2018.
  • [Dosovitskiy et al.(2020)Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, et al.] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [Fabian Benitez-Quiroz et al.(2016)Fabian Benitez-Quiroz, Srinivasan, and Martinez] C Fabian Benitez-Quiroz, Ramprakash Srinivasan, and Aleix M Martinez. Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5562–5570, 2016.
  • [Feng et al.(2018)Feng, Kittler, Awais, Huber, and Wu] Zhen-Hua Feng, Josef Kittler, Muhammad Awais, Patrik Huber, and Xiao-Jun Wu. Wing loss for robust facial landmark localisation with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2235–2245, 2018.
  • [Hacohen and Weinshall(2019)] Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in training deep networks. In International Conference on Machine Learning, pages 2535–2544. PMLR, 2019.
  • [Hassner et al.(2015)Hassner, Harel, Paz, and Enbar] Tal Hassner, Shai Harel, Eran Paz, and Roee Enbar. Effective face frontalization in unconstrained images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4295–4304, 2015.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. cvpr. 2016. arXiv preprint arXiv:1512.03385, 2016.
  • [Huang et al.(2020)Huang, Deng, Shen, Zhang, and Ye] ** Ye. Propagationnet: Propagate points to curve to learn structure information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7265–7274, 2020.
  • [Huang et al.(2021)Huang, Yang, Li, Kim, and Wei] Yangyu Huang, Hao Yang, Chong Li, Jongyoo Kim, and Fangyun Wei. Adnet: Leveraging error-bias towards normal direction in face alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3080–3090, 2021.
  • [Kittler et al.(2016)Kittler, Huber, Feng, Hu, and Christmas] Josef Kittler, Patrik Huber, Zhen-Hua Feng, Guosheng Hu, and William Christmas. 3d morphable face models and their applications. In Articulated Motion and Deformable Objects: 9th International Conference, AMDO 2016, Palma de Mallorca, Spain, July 13-15, 2016, Proceedings 9, pages 185–206. Springer, 2016.
  • [Koppen et al.(2018)Koppen, Feng, Kittler, Awais, Christmas, Wu, and Yin] Paul Koppen, Zhen-Hua Feng, Josef Kittler, Muhammad Awais, William Christmas, Xiao-Jun Wu, and He-Feng Yin. Gaussian mixture 3d morphable face model. Pattern Recognition, 74:617–628, 2018.
  • [Kumar et al.(2020)Kumar, Marks, Mou, Wang, Jones, Cherian, Koike-Akino, Liu, and Feng] Abhinav Kumar, Tim K Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xiaoming Liu, and Chen Feng. Luvli face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8236–8246, 2020.
  • [Köstinger et al.(2011)Köstinger, Wohlhart, Roth, and Bischof] Martin Köstinger, Paul Wohlhart, Peter M. Roth, and Horst Bischof. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 2144–2151, 2011. 10.1109/ICCVW.2011.6130513.
  • [Lan et al.(2021)Lan, Hu, Chen, Xue, and Cheng] Xing Lan, Qinghao Hu, Qiang Chen, Jian Xue, and Jian Cheng. Hih: Towards more accurate face alignment via heatmap in heatmap. arXiv preprint arXiv:2104.03100, 2021.
  • [Li et al.(2022)Li, Guo, Rhee, Han, and Han] Hui Li, Zidong Guo, Seon-Min Rhee, Seungju Han, and Jae-Joon Han. Towards accurate facial landmark detection via cascaded transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4176–4185, 2022.
  • [Li et al.(2017)Li, Deng, and Du] Shan Li, Weihong Deng, and Jun** Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2852–2861, 2017.
  • [Loshchilov and Hutter(2017)] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • [Lv et al.(2017)Lv, Shao, Xing, Cheng, and Zhou] Jiang**g Lv, Xiaohu Shao, Junliang Xing, Cheng Cheng, and Xi Zhou. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3317–3326, 2017.
  • [Masi et al.(2016)Masi, Rawls, Medioni, and Natarajan] Iacopo Masi, Stephen Rawls, Gérard Medioni, and Prem Natarajan. Pose-aware face recognition in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4838–4846, 2016.
  • [Newell et al.(2016)Newell, Yang, and Deng] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 483–499. Springer, 2016.
  • [Roth et al.(2016)Roth, Tong, and Liu] Joseph Roth, Yiying Tong, and Xiaoming Liu. Adaptive 3d face reconstruction from unconstrained photo collections. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4197–4206, 2016.
  • [Sagonas et al.(2016)Sagonas, Antonakos, Tzimiropoulos, Zafeiriou, and Pantic] Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: Database and results. Image and vision computing, 47:3–18, 2016.
  • [Sun et al.(2019)Sun, Zhao, Jiang, Cheng, Xiao, Liu, Mu, Wang, Liu, and Wang] Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and **gdong Wang. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514, 2019.
  • [Sun et al.(2013)Sun, Wang, and Tang] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3476–3483, 2013.
  • [Taigman et al.(2014)Taigman, Yang, Ranzato, and Wolf] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014.
  • [Toshev and Szegedy(2014)] Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660, 2014.
  • [Trigeorgis et al.(2016)Trigeorgis, Snape, Nicolaou, Antonakos, and Zafeiriou] George Trigeorgis, Patrick Snape, Mihalis A Nicolaou, Epameinondas Antonakos, and Stefanos Zafeiriou. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4177–4187, 2016.
  • [Walecki et al.(2016)Walecki, Rudovic, Pavlovic, and Pantic] Robert Walecki, Ognjen Rudovic, Vladimir Pavlovic, and Maja Pantic. Copula ordinal regression for joint estimation of facial action unit intensity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4902–4910, 2016.
  • [Wang et al.(2020)Wang, Sun, Cheng, Jiang, Deng, Zhao, Liu, Mu, Tan, Wang, et al.] **gdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10):3349–3364, 2020.
  • [Wang et al.(2019)Wang, Bo, and Fuxin] Xinyao Wang, Liefeng Bo, and Li Fuxin. Adaptive wing loss for robust face alignment via heatmap regression. In Proceedings of the IEEE International Conference on Computer Vision, pages 6971–6981, 2019.
  • [Wei et al.(2016)Wei, Ramakrishna, Kanade, and Sheikh] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 4724–4732, 2016.
  • [Wen et al.(2022)Wen, Ding, Yao, Wang, and Qian] Tiancheng Wen, Zhonggan Ding, Yongqiang Yao, Yaxiong Wang, and Xueming Qian. Picassonet: Searching adaptive architecture for efficient facial landmark localization. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [Wu et al.(2022)Wu, Fang, Zhang, Yang, and Xu] Junde Wu, Huihui Fang, Yu Zhang, Yehui Yang, and Yanwu Xu. Medsegdiff: Medical image segmentation with diffusion probabilistic model. arXiv preprint arXiv:2211.00611, 2022.
  • [Xia et al.(2022)Xia, Qu, Huang, Zhang, Wang, and Xu] Jiahao Xia, Weiwei Qu, Wenjian Huang, Jianguo Zhang, Xi Wang, and Min Xu. Sparse local patch transformer for robust face alignment and landmarks inherent relation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4052–4061, 2022.
  • [Xiao et al.(2018)Xiao, Wu, and Wei] Bin ** Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pages 466–481, 2018.
  • [Yang et al.(2017)Yang, Ren, Zhang, Chen, Wen, Li, and Hua] Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen, Hongdong Li, and Gang Hua. Neural aggregation network for video face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4362–4371, 2017.
  • [Zhang et al.(2014)Zhang, Shan, Kan, and Chen] Jie Zhang, Shiguang Shan, Meina Kan, and Xilin Chen. Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pages 1–16. Springer, 2014.
  • [Zhang(2019)] Richard Zhang. Making convolutional networks shift-invariant again. In International conference on machine learning, pages 7324–7334. PMLR, 2019.
  • [Zheng et al.(2022)Zheng, Yang, Zhang, Bao, Chen, Huang, Yuan, Chen, Zeng, and Wen] Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representation learning in a visual-linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18697–18709, 2022.
  • [Zhou et al.(2013a)Zhou, Fan, Cao, Jiang, and Yin] Er** Zhou, Haoqiang Fan, Zhimin Cao, Yuning Jiang, and Qi Yin. Extensive facial landmark localization with coarse-to-fine convolutional network cascade. In Proceedings of the IEEE international conference on computer vision workshops, pages 386–391, 2013a.
  • [Zhou et al.(2013b)Zhou, Fan, Cao, Jiang, and Yin] Er** Zhou, Haoqiang Fan, Zhimin Cao, Yuning Jiang, and Qi Yin. Extensive facial landmark localization with coarse-to-fine convolutional network cascade. In Proceedings of the IEEE international conference on computer vision workshops, pages 386–391, 2013b.