GRACE: Graph-Regularized Attentive Convolutional Entanglement with Laplacian Smoothing for Robust DeepFake Video Detection thanks: This study was supported in part by the Ministry of Science and Technology (MOST), Taiwan, under grants MOST XXX; and partly by the Higher Education Sprout Project of Ministry of Education (MOE) to the Headquarters of University Advancement at National Cheng Kung University (NCKU). thanks: (Corresponding author: Chih-Chung Hsu.) thanks: C.-C. Hsu, S.-N. Chen, M.-H. Wu, Y.-F. Wang, C.-M. Lee and Y.-S. Chou are with Institute of Data Science and Department of Statistics, National Cheng Kung University, Tainan, Taiwan (R.O.C.), (e-mail:[email protected], [email protected], [email protected], [email protected], [email protected], [email protected].)

Chih-Chung Hsu, , Shao-Ning Chen, Mei-Hsuan Wu,
Yi-Fang Wang, Chia-Ming Lee, Yi-Shiuan Chou
Abstract

As DeepFake video manipulation techniques escalate, posing profound threats, the urgent need to develop efficient detection strategies is underscored. However, one particular issue lies with facial images being mis-detected, often originating from degraded videos or adversarial attacks, leading to unexpected temporal artifacts that can undermine the efficacy of DeepFake video detection techniques. This paper introduces a novel method for robust DeepFake video detection, harnessing the power of the proposed Graph-Regularized Attentive Convolutional Entanglement (GRACE) based on the graph convolutional network with graph Laplacian to address the aforementioned challenges. First, conventional Convolution Neural Networks are deployed to perform spatiotemporal features for the entire video. Then, the spatial and temporal features are mutually entangled by constructing a graph with sparse constraint, enforcing essential features of valid face images in the noisy face sequences remaining, thus augmenting stability and performance for DeepFake video detection. Furthermore, the Graph Laplacian prior is proposed in the graph convolutional network to remove the noise pattern in the feature space to further improve the performance. Comprehensive experiments are conducted to illustrate that our proposed method delivers state-of-the-art performance in DeepFake video detection under noisy face sequences. The source code is available at https://github.com/ming053l/GRACE.

Index Terms:
DeepFake Detection, Feature Entanglement, Graph Convolution Network, Adversarial Attack, Forgery Detection.

1 Introduction

With the widespread use of fake images and videos on various social network platforms for creating fake news and defrauding personal information, identifying synthesized content generated by generative adversarial networks (GANs) and variational autoencoders (VAEs) has become a critical challenge. As generative models advance and improve rapidly, current DeepFake detection techniques struggle to maintain effectiveness. To address this, several large-scale fake image datasets, such as FaceForensics++ (FF++) [1], DeepFake Challenge Dataset (DFCD) [2], Celeb-DF [3], and WildDeepFake [4] have been established to promote the development of effective DeepFake detection techniques.

DeepFake image and video manipulation techniques have emerged as the most well-known forgery generation applications, with far-reaching impacts on numerous individuals. Generally, facial manipulation schemes can be classified into four categories [1]: 1) entire face synthesis, 2) attribute manipulation, 3) identity swap, and 4) expression swap. Identity swap schemes have the most significant impact as they can be used to fabricate fake news targeting specific politicians. Many DeepFake detection techniques focus on identifying such fake videos using supervised learning methods with a pre-collected large-scale training set [5, 6, 7, 8, 9, 10, 11, 12].

Several advanced learning strategies have been proposed to enhance the performance of DeepFake image detection. For instance, methods in [7, 8] treat DeepFake image detection as binary classification tasks. The method in [10] introduces a novel multi-task learning approach to improve robustness and effectiveness. The authors in [9] also assert that traditional convolutional neural networks (CNNs) can be used to easily extract fake traces. However, the generalizability of such supervised learning strategies may be limited, as it is challenging to recognize DeepFake images generated by unknown GANs [13] due to difficulties in discerning out-of-distribution feature representations. To address the generalizability issue, semi-supervised learning is considered in [13, 14, 15] to capture common fake features from selected representative GANs, assuming that most GANs might share similar identifiable clues. Pairwise learning is then employed to learn these common features from the training set, improving generalizability for DeepFake image detection [14, 13]. Additionally, knowledge distillation is proposed for DeepFake detection by effectively transferring the weights of complex models to smaller models for enhanced generalizability [11].

In the domain of DeepFake video detection, numerous sophisticated approaches have recently emerged [16][17][18][19][20][21] [22][23][24][25]. On the one hand, these methods extend DeepFake image detection techniques by averaging the predictions of individual frames to assess a video’s authenticity [18][23][24][25]. On the other hand, the temporal inconsistency feature is exploited for DeepFake video classification using supervised learning approaches, as demonstrated in [20][21][16][17]. Specifically, state-of-the-art DeepFake video detection primarily focuses on exploiting various priors, such as bio-informatics clues [24], facial war** artifacts [25], and noise patterns [18]. Recently, several advanced techniques have been proposed to enhance DeepFake video detection performance. CORE [26] introduces a novel approach for learning consistent representations across different frames, while RECCE [27] employs a reconstruction-classification learning scheme to capture more discriminative features. DFIL [28] proposes an incremental learning framework that exploits domain-invariant forgery clues to improve generalization ability. TALL-Swin [29] utilizes a thumbnail layout and Swin Transformer to learn robust spatiotemporal features for DeepFake detection. UCF [30] focuses on uncovering common features shared by different manipulation techniques to enhance generalizability.

Another strategy to effectively detect the DeepFake image/video could be the extra-clue-inspired approach. In [24], a novel bio-feature—the Photoplethysmography (PPG) response—is utilized to differentiate DeepFake videos, as real and fake videos exhibit distinct PPG features. A critical limitation is the necessity for high-resolution videos and images to effectively capture PPG cues. Moreover, [25] investigates war** artifacts at the boundaries of DeepFake videos, which arise due to the limited resolution of synthesized facial components. However, contemporary GANs can generate high-resolution, realistic faces, rendering the resolution-inconsistency clues in [25] potentially less significant. Similarly, Face X-ray [18] leverages the boundary between real and fake facial regions as features, positing that the noise patterns of these parts differ and enabling traditional deep neural networks to identify DeepFake videos. As Face X-ray [18] demonstrates superior performance and robustness, recent DeepFake video detection techniques concentrate on uncovering more reliable signatures produced by GANs to enhance detection performance. Concurrently, [16] introduces deep Laplacian of Gaussian and the loss of isolated manipulated faces to bolster the generalizability of DeepFake video tasks.

Recent research has concentrated on develo** robust DeepFake detection models for compressed videos, as seen in [20][31][32]. The frequency component analysis method is employed to uncover intrinsic features and enhance performance under compression settings [31]. However, frequency-aware features may prove ineffective under high compression (with high-frequency reduction) or noisy conditions (with high-frequency amplification). The F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT-net [32] selects two complementary frequency bands as clues, devising a novel network to learn frequency-aware features that reveal subtle forgery artifacts. Specifically, Frequency-aware Image Decomposition (FAD) is designed to learn subtle forgery patterns, while Local Frequency Statistics (LFS) primarily extracts high-level semantic features. This approach improves performance for low-quality inputs. A recent development in DeepFake video detection, the Anti-DeepFake Transformer (ADT), is proposed in [19], with robustness confirmed through cross-dataset evaluation. Recent studies [33][34][35] have highlighted the vulnerability of DeepFake detectors to adversarial perturbations. Therefore, adversarial defense with DeepFake detection, such as [36], have been attracting recently. In [36], it shows the effective solution by leveraging the statistical inference on the CNNs for achieving better robustness to adversarial examples.

All of the DeepFake video detection models, however, often assume that the input facial sequence is reliable and well-detected, as the current state-of-the-art face detectors show promising performance. A promising strategy to prevent manipulated faces from being detected by DeepFake video detectors, could be making adversarial examples for face detectors since these are the first pipelines for all DeepFake detection techniques. Numerous studies have demonstrated the effectiveness of adversarial attacks on face detectors [37]. Several recent adversarial perturbation strategies [38] [39] [40] [41] have been proposed, potentially rendering face detectors ineffective. For instance, the methods introduced in [37] and [29] indicate that the detection rate can decay to less than 10%percent1010\%10 %, implying the 90%percent9090\%90 % facial images in a face sequence could be invalid. These perturbed DeepFake videos can yield noisy face sequences with many invalid facial images, leading to unintended temporal feature jittering in temporal-clue-aware methods [20][19][31]. These temporal artifacts can significantly degrade their performance, while invalid facial images may also diminish the effectiveness of frame-level DeepFake video detection schemes [6][18] because the final decision of a video is based on the majority voting.

Refer to caption
Figure 1: Example of the detected faces from two videos using RetinaFace (top) and Dlib (bottom).
TABLE I: The confidence range of the detected faces using RetinaFace and Dlib for 200 videos sampled from FF++ [1]. Conf. and Det. stand for the confidence range of the detected faces using the specific face detector.
Det./Conf. [0,0.33] (0.33,0.66] (0.66,0.1] Total
RetinaFace (raw) 5 19 176 200
RetinaFace (c23) 6 28 166 200
RetinaFace (c40) 8 43 149 200
Dlib (raw) 0 13 187 200
Dlib (c23) 0 18 182 200
Dlib (c40) 2 33 165 200

Even video compression could reduce the detection rate of the face detectors. We randomly select 200 videos from the FF++ dataset [1], featuring varying compression ratios (raw, c23, c40), and extract 16 frames from each video for face detection analysis. We employ state-of-the-art face detection tools, such as RetinaFace [42] and Dlib [43], to substantiate our observations. Table I presents the face detection outcomes for the sampled videos. Notably, the predicted probability of 8 and 2 videos using RetinaFace [42] and Dlib [43] falls within the [0,0.33]00.33[0,0.33][ 0 , 0.33 ] range, respectively, implying there are 8 and 2 facial images are mis-detected. Additionally, for uncompressed videos, 19 videos exhibit accuracy lower than 66%percent6666\%66 %, highlighting the imperfections of face detectors. The question is raised: Could the current DeepFake detection methods be robust to such noisy face sequences? The answer is negative. We simply replace the 40%percent4040\%40 % facial images with background ones for the testing set of FF++ [1] with raw setting and evaluate the performance using Xception [6]. Unsurprisingly, the accuracy dropped significantly after replacement. An effective solution to deal with the issues raised by noisy face sequences for DeepFake video detection is highly desired.

Refer to caption
Figure 2: Flowchart of the proposed GRACE with Graph Laplacian regularizer for robust DeepFake video detection.

In light of the escalating threat posed by various malicious attacks on face detectors that aim to undermine their reliability, this paper presents a pioneering Graph-Regularized Attentive Convolutional Entanglement (GRACE) with Laplacian Smoothing learning approach. GRACE leverages contextual features in both temporal and spatial domains to effectively detect DeepFake videos under noisy face sequences. We meticulously incorporate sparsity regularization into our model to prioritize the features of valid face images within the noisy face sequence. By employing the proposed Feature Entanglement (FE) technique, an affinity matrix is constructed to amalgamate the spatiotemporal features, ensuring that each node possesses at least one feature descriptor originating from valid face images. Ultimately, Graph Laplacian (GL) smoothing regularization is ingeniously integrated into the Graph Convolutional Network (GCN) to further suppress noisy nodes, thereby significantly enhancing the performance of DeepFake video detection. The main contributions of this paper are three-fold:

  • We propose a novel GRACE with a Laplacian Smoothing learning framework that exploits contextual features in both temporal and spatial domains for robust DeepFake video detection under noisy face sequences. To the best of our knowledge, this is the first work to address the issue of unreliable face sequences for DeepFake video detection.

  • We introduce a Feature Entanglement (FE) mechanism to construct an affinity matrix that mixes the spatiotemporal features together, ensuring each node contains at least one feature from valid face images. This approach effectively mitigates the impact of invalid facial images in the noisy face sequence.

  • We propose a GL smoothing regularizer in the GCN to filter the noisy nodes further and improve the performance of DeepFake video detection. Comprehensive experiments demonstrate that our method achieves state-of-the-art performance, especially under challenging scenarios with unreliable and noisy face sequences.

The rest of this paper is organized as follows. Section 2 presents the proposed GRACE architecture design. In Section 3, the superiority of GRACE over benchmark methods is experimentally demonstrated. Finally, conclusions are drawn in Section 4.

2 Proposed Graph-Regularized Attentive Convolutional Entanglement

2.1 Overview of the Proposed Method

The proposed GRACE’s flowchart is illustrated in Fig. 2. First, a face detector extracts facial images from each video frame. A CNN-based backbone network extracts high-level semantic features from the spatial domain of the acquired facial parts, as displayed in the center of 2. Using the extracted spatial features, the spatial and temporal representations at frame n𝑛nitalic_n and location (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) across all feature maps 𝑿d×c𝑿superscript𝑑𝑐{\bm{X}}\in\mathbb{R}^{d\times c}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_c end_POSTSUPERSCRIPT can be obtained for the face sequence, where d=N×w×h𝑑𝑁𝑤d=N\times w\times hitalic_d = italic_N × italic_w × italic_h, potentially including partially invalid faces. This feature representation captures feature responses at frame n𝑛nitalic_n and location (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) across all frames, thereby integrating temporal information, as shown in Fig. 2.

To augment the correlation between the spatial and temporal feature representation 𝑿𝑿{\bm{X}}bold_italic_X acquired in the previous step, we introduce a novel Feature Entanglement (FE) with sparse constraint, denoted as 𝑿FE=GFE(𝑿)d×dsubscript𝑿FEsubscript𝐺FE𝑿superscript𝑑𝑑{\bm{X}}_{\textrm{FE}}=G_{\textrm{FE}}({\bm{X}})\in\mathbb{R}^{d\times d}bold_italic_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT ( bold_italic_X ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT, which carefully embeds both temporal and spatial features into its graph representation by affinity matrix 𝑿FEsubscript𝑿FE{\bm{X}}_{\textrm{FE}}bold_italic_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT from original feature 𝑿𝑿{\bm{X}}bold_italic_X. In highly noisy face sequences, the number of invalid faces could be more than that of valid ones. Therefore, the essential features could be relatively fewer, motivating us to introduce the sparsity constraint into our GRACE to focus on those essential features. Then, to efficiently discern the importance of the graph representation 𝑿FEsubscript𝑿FE{\bm{X}}_{\textrm{FE}}bold_italic_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT, we introduce the GCN to capture the contextual features between nodes (spatiotemporal features) in 𝑿𝑿{\bm{X}}bold_italic_X. To further remove the noisy nodes from the original 𝑿FEsubscript𝑿FE{\bm{X}}_{\text{FE}}bold_italic_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT, Graph Laplacian is judiciously adapted to each layer of the GCN for better performance under noisy face sequences. Finally, a softmax classifier is connected to the outcome of GCN to evaluate the authenticity of the supplied facial parts.

2.2 Feature Entanglement with Sparse Constraint

We develop a method inspired by spatiotemporal feature extraction [44][45]. Traditional CNNs serve as the backbone network to obtain the spatial feature representation 𝑿nc×h×wsuperscript𝑿𝑛superscript𝑐𝑤{\bm{X}}^{n}\in\mathbb{R}^{c\times h\times w}bold_italic_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_h × italic_w end_POSTSUPERSCRIPT for each frame 𝒚dnsuperscriptsubscript𝒚𝑑𝑛{\bm{y}}_{d}^{n}bold_italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT of the video. Assuming that the size of the extracted feature map is c×h×w𝑐𝑤c\times h\times witalic_c × italic_h × italic_w, the spatial feature representation of a specific video at location (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) via the backbone network can be vectorized into 𝒙n,i,j=[x(n,i,j)1,x(n,i,j)2,,x(n,i,j)c]c×1{\bm{x}}_{n,i,j}=[x_{(}n,i,j)^{1},x_{(}n,i,j)^{2},...,x_{(}n,i,j)^{c}]\in% \mathbb{R}^{c\times 1}bold_italic_x start_POSTSUBSCRIPT italic_n , italic_i , italic_j end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT ( end_POSTSUBSCRIPT italic_n , italic_i , italic_j ) start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT ( end_POSTSUBSCRIPT italic_n , italic_i , italic_j ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT ( end_POSTSUBSCRIPT italic_n , italic_i , italic_j ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × 1 end_POSTSUPERSCRIPT, where c𝑐citalic_c is the number of channels in the extracted spatial feature map and i=1,,w𝑖1𝑤i=1,...,witalic_i = 1 , … , italic_w, j=1,,h𝑗1j=1,...,hitalic_j = 1 , … , italic_h. Then, let d=Nwh𝑑𝑁𝑤d=Nwhitalic_d = italic_N italic_w italic_h, we create the feature context 𝑿d×c𝑿superscript𝑑𝑐{\bm{X}}\in\mathbb{R}^{d\times c}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_c end_POSTSUPERSCRIPT based on location-wise feature concatenation, as follows:

𝑿=[\displaystyle{\bm{X}}=[bold_italic_X = [ (𝒙1,1,11,𝒙1,1,12,,𝒙1,1,1c);subscriptsuperscript𝒙1111subscriptsuperscript𝒙2111subscriptsuperscript𝒙𝑐111\displaystyle({\bm{x}}^{1}_{1,1,1},\ {\bm{x}}^{2}_{1,1,1},\ ...,\ {\bm{x}}^{c}% _{1,1,1});( bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 1 , 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 1 , 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 1 , 1 end_POSTSUBSCRIPT ) ; (1)
(𝒙1,1,21,𝒙1,1,22,,𝒙1,1,2c);subscriptsuperscript𝒙1112subscriptsuperscript𝒙2112subscriptsuperscript𝒙𝑐112\displaystyle({\bm{x}}^{1}_{1,1,2},\ {\bm{x}}^{2}_{1,1,2},\ ...,\ {\bm{x}}^{c}% _{1,1,2});( bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 1 , 2 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 1 , 2 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 1 , 2 end_POSTSUBSCRIPT ) ;
\displaystyle\vdots
(𝒙1,h,w1,𝒙1,h,w2,,𝒙1,h,wc);subscriptsuperscript𝒙11𝑤subscriptsuperscript𝒙21𝑤subscriptsuperscript𝒙𝑐1𝑤\displaystyle({\bm{x}}^{1}_{1,h,w},\ {\bm{x}}^{2}_{1,h,w},\ ...,\ {\bm{x}}^{c}% _{1,h,w});( bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_h , italic_w end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_h , italic_w end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_h , italic_w end_POSTSUBSCRIPT ) ;
\displaystyle\vdots
(𝒙N,h,w1,𝒙N,h,w2,,𝒙N,h,wc)],\displaystyle({\bm{x}}^{1}_{N,h,w},\ {\bm{x}}^{2}_{N,h,w},\ ...,\ {\bm{x}}^{c}% _{N,h,w})],( bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N , italic_h , italic_w end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N , italic_h , italic_w end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N , italic_h , italic_w end_POSTSUBSCRIPT ) ] ,

where 𝑿𝑿{\bm{X}}bold_italic_X represents the spatiotemporal feature.

The efficient extraction of joint spatial and temporal feature representations from 𝑿𝑿{\bm{X}}bold_italic_X necessitates addressing potential inefficiencies linked to considerable distances between the n𝑛nitalic_n-th and m𝑚mitalic_m-th feature vectors, 𝒙nsuperscript𝒙𝑛{\bm{x}}^{n}bold_italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝒙msuperscript𝒙𝑚{\bm{x}}^{m}bold_italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, especially when m𝑚mitalic_m and n𝑛nitalic_n are significantly apart. Directing adopting the Transformer [44][46] could lead to large computational complexity since the number of tokens could be large enough to have a long-range correlation.

To address these issues and efficiently exploit joint spatial and temporal feature representations, we propose a novel concept, i.e., FE, which is exactly an affinity matrix, creating a graph representation that encapsulates the relationships between nodes—each node representing entangled spatial and temporal features based on edges. Every element (node) in 𝑿FEsubscript𝑿FE{\bm{X}}_{\textrm{FE}}bold_italic_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT intertwines spatial and temporal information, thereby facilitating graph neural networks to learn the relationships among spatiotemporal features more effectively, in which the features of the first and last frames might still have a link. The feature entanglement 𝑿FEsubscript𝑿FE{\bm{X}}_{\textrm{FE}}bold_italic_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT is defined as follows:

𝐗FE=GFE(𝑿)=𝐗𝐗T,subscript𝐗FEsubscript𝐺FE𝑿superscript𝐗𝐗𝑇\mathbf{X}_{\text{FE}}=G_{\textrm{FE}}({\bm{X}})=\mathbf{X}\mathbf{X}^{T},bold_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT ( bold_italic_X ) = bold_XX start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (2)

where 𝐗FEd×dsubscript𝐗FEsuperscript𝑑𝑑\mathbf{X}_{\text{FE}}\in\mathbb{R}^{d\times d}bold_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT is the affinity matrix obtained through feature entanglement. Consider, for instance, the feature element 𝐗FE(1,c)subscript𝐗FE1𝑐\mathbf{X}_{\text{FE}}({1,c})bold_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT ( 1 , italic_c ), which is computed by taking the inner product of the first and the c𝑐citalic_c-th row of 𝐗𝐗\mathbf{X}bold_X. This element encapsulates the correlation between the spatial features at location (1,1)11(1,1)( 1 , 1 ) across all frames and the spatial features at location (w,h)𝑤(w,h)( italic_w , italic_h ) in the first frame. By constructing the affinity matrix 𝐗FEsubscript𝐗FE\mathbf{X}_{\text{FE}}bold_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT in this manner, we effectively capture the spatiotemporal dependencies within the feature representation. However, learning discriminative features from 𝐗FEsubscript𝐗FE\mathbf{X}_{\text{FE}}bold_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT can be challenging, particularly in the presence of highly noisy face sequences, where a significant portion of the facial images may be invalid. To address this issue, we introduce a sparsity constraint as a regularization term in the learning objective, formulated as the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm of 𝐗𝐗\mathbf{X}bold_X, denoted by |𝐗|1subscript𝐗1|\mathbf{X}|_{1}| bold_X | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. By enforcing sparsity on 𝐗𝐗\mathbf{X}bold_X, we encourage the model to focus on the most informative features, thereby enhancing its robustness to noisy face sequences and improving its effectiveness in DeepFake video detection.

2.3 Proposed GCN with Graph Laplacian

In the previous subsection, we introduced the overall pipeline of our proposed GRACE method. A crucial component of GRACE is the FE with sparse constraint, which aims to efficiently exploit joint spatial and temporal feature representations. In this subsection, we delve into the details of FE and discuss how it addresses the challenges associated with unreliable face sequences in DeepFake detection.

Graph Convolutional Networks (GCNs) have emerged as a powerful tool for processing graph-structured data [47, 48], making them a suitable choice for handling the graph embedding obtained through feature entanglement. However, the noise level in each node can vary depending on the degree of distortion. For highly distorted or noisy nodes (i.e., those with entangled features primarily contributed by invalid facial images), it is beneficial to eliminate them to ensure stable and excellent performance in DeepFake video detection. To mitigate the impact of features from invalid facial images, we judiciously integrate the Graph Laplacian Smoothing Prior (GLSP), a well-established concept in Graph Signal Processing (GSP) [49], as a regularizer into the GCN to filter out highly noisy nodes [50]. It is important to note that GSP and GCN are distinct domains, with GSP focusing on the analysis and processing of signals defined on graphs, while GCN aims to learn representations by exploiting the graph structure. In this study, we ingeniously leverage the properties of GLSP from GSP and seamlessly incorporate it into the GCN framework, enabling end-to-end training without explicitly computing the eigendecomposition of the Graph Laplacian matrix.

Let 𝐗d×c𝐗superscript𝑑𝑐\mathbf{X}\in\mathbb{R}^{d\times c}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_c end_POSTSUPERSCRIPT denote the SFE feature matrix, where d=Nwh𝑑𝑁𝑤d=Nwhitalic_d = italic_N italic_w italic_h represents the spatiotemporal dimensions, with N𝑁Nitalic_N being the number of frames, w𝑤witalic_w and hhitalic_h being the width and height of the feature maps, and c𝑐citalic_c represents the feature dimension. This matrix encapsulates the spatiotemporal features extracted from the entire video sequence.

We construct affinity matrix 𝑨=𝐗FEd×d𝑨subscript𝐗FEsuperscript𝑑𝑑{\bm{A}}=\mathbf{X}_{\text{FE}}\in\mathbb{R}^{d\times d}bold_italic_A = bold_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT through feature entanglement, where 𝐀ijsubscript𝐀𝑖𝑗\mathbf{A}_{ij}bold_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the edge weight between nodes i𝑖iitalic_i and j𝑗jitalic_j. This matrix captures the similarity between different nodes. The degree matrix 𝐃d×d𝐃superscript𝑑𝑑\mathbf{D}\in\mathbb{R}^{d\times d}bold_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT is a diagonal matrix where 𝐃ii=j𝐀ijsubscript𝐃𝑖𝑖subscript𝑗subscript𝐀𝑖𝑗\mathbf{D}_{ii}=\sum_{j}\mathbf{A}_{ij}bold_D start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the sum of edge weights connected to each node. The Graph Laplacian matrix 𝐋=𝐃𝐀𝐋𝐃𝐀\mathbf{L}=\mathbf{D}-\mathbf{A}bold_L = bold_D - bold_A captures the topological structure of the graph and the differences between nodes.

To further remove the redundancy among nodes, we apply adaptively thresholding to 𝐀𝐀\mathbf{A}bold_A to filter out weak or irrelevant connections. For each sample i𝑖iitalic_i, we compute the mean value of its feature entanglement matrix 𝐀𝐀\mathbf{A}bold_A and keep only the elements that are greater than half of the mean value. The indices and values of these elements are then extracted to form the edge indices and edge weights of a sparse affinity matrix 𝐀(i)superscript𝐀𝑖\mathbf{A}^{(i)}bold_A start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, as follows:

𝐀jk(i)={𝐗FE(i)jk,if 𝐗FE(i)jk>q×𝐗FE(i)¯0,otherwisesubscriptsuperscript𝐀𝑖𝑗𝑘casessuperscriptsubscript𝐗FEsubscript𝑖𝑗𝑘if superscriptsubscript𝐗FEsubscript𝑖𝑗𝑘𝑞¯superscriptsubscript𝐗FE𝑖0otherwise\mathbf{A}^{(i)}_{jk}=\begin{cases}\mathbf{X}_{\text{FE}}^{(i)_{jk}},&\text{if% }\mathbf{X}_{\text{FE}}^{(i)_{jk}}>q\times\overline{\mathbf{X}_{\text{FE}}^{(% i)}}\\ 0,&\text{otherwise}\end{cases}bold_A start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT = { start_ROW start_CELL bold_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , end_CELL start_CELL if bold_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT > italic_q × over¯ start_ARG bold_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW (3)

where 𝐀jk(i)subscriptsuperscript𝐀𝑖𝑗𝑘\mathbf{A}^{(i)}_{jk}bold_A start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT denotes the edge weight between nodes j𝑗jitalic_j and k𝑘kitalic_k in the adjacency matrix of sample i𝑖iitalic_i, and q𝑞qitalic_q is the factor controlling how strict the node being filtered. In this study, q=0.5𝑞0.5q=0.5italic_q = 0.5 for all experiments. This thresholding operation helps to focus on the most important connections and reduces the computational burden of the GCN.

The resulting sparse adjacency matrix 𝐀(i)superscript𝐀𝑖\mathbf{A}^{(i)}bold_A start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, along with the node features 𝐗(i)superscript𝐗𝑖\mathbf{X}^{(i)}bold_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, serve as the input to the GCN for learning the graph structure and node representations. By combining the FE matrix 𝑿FEsubscript𝑿FE{\bm{X}}_{\text{FE}}bold_italic_X start_POSTSUBSCRIPT FE end_POSTSUBSCRIPT and the sparse adjacency matrix 𝑨𝑨{\bm{A}}bold_italic_A, our method leverages the advantages of both representations, capturing rich spatiotemporal correlations while focusing on the most informative connections for DeepFake detection.

Graph Convolutional Networks (GCNs) have shown remarkable performance in various tasks by leveraging the power of graph-structured data. The core operation of GCNs in the l𝑙litalic_l-th layer can be described as follows:

𝐙(l+1)=σ(𝐃^12𝐀^𝐃^12𝐙(l)𝐖(l)),superscript𝐙𝑙1𝜎superscript^𝐃12^𝐀superscript^𝐃12superscript𝐙𝑙superscript𝐖𝑙\mathbf{Z}^{(l+1)}=\sigma(\hat{\mathbf{D}}^{-\frac{1}{2}}\hat{\mathbf{A}}\hat{% \mathbf{D}}^{-\frac{1}{2}}\mathbf{Z}^{(l)}\mathbf{W}^{(l)}),bold_Z start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = italic_σ ( over^ start_ARG bold_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over^ start_ARG bold_A end_ARG over^ start_ARG bold_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT bold_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) , (4)

where 𝐀^=𝐀+𝐈d^𝐀𝐀subscript𝐈𝑑\hat{\mathbf{A}}=\mathbf{A}+\mathbf{I}_{d}over^ start_ARG bold_A end_ARG = bold_A + bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is the adjacency matrix with self-loops, 𝐃^ii=j=1d𝐀^ijsubscript^𝐃𝑖𝑖superscriptsubscript𝑗1𝑑subscript^𝐀𝑖𝑗\hat{\mathbf{D}}_{ii}=\sum_{j=1}^{d}\hat{\mathbf{A}}_{ij}over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT over^ start_ARG bold_A end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the corresponding degree matrix, 𝐖(l)superscript𝐖𝑙\mathbf{W}^{(l)}bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is the weight matrix of the l𝑙litalic_l-th layer, and σ𝜎\sigmaitalic_σ is the activation function. This equation describes the aggregation and transformation of node features based on the graph structure.

However, the performance of GCNs may be compromised when dealing with highly noisy scenarios, limiting their applicability in real-world situations. To address this challenge, we propose the incorporation of Graph Laplacian regularization into GCNs to enhance their robustness and improve their performance in the presence of significant noise.

Given an undirected graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), where V𝑉Vitalic_V is the set of nodes and E𝐸Eitalic_E is the set of edges, the Graph Laplacian matrix 𝐋𝐋\mathbf{L}bold_L is defined as:

𝐋=𝐃𝐀,𝐋𝐃𝐀\mathbf{L}=\mathbf{D}-\mathbf{A},bold_L = bold_D - bold_A , (5)

where 𝐃𝐃\mathbf{D}bold_D is the degree matrix and 𝐀𝐀\mathbf{A}bold_A is the adjacency matrix of the graph G𝐺Gitalic_G. The degree matrix 𝐃𝐃\mathbf{D}bold_D is a diagonal matrix, where 𝐃iisubscript𝐃𝑖𝑖\mathbf{D}_{ii}bold_D start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT equals the degree of node i𝑖iitalic_i in G𝐺Gitalic_G. It is worth noting that the Graph Laplacian matrix 𝐋𝐋\mathbf{L}bold_L is different from the matrix 𝐋^^𝐋\hat{\mathbf{L}}over^ start_ARG bold_L end_ARG used in the GCN propagation rule, which will be discussed later.

As a real symmetric matrix, the Graph Laplacian matrix 𝐋𝐋\mathbf{L}bold_L possesses an eigendecomposition:

𝐋=𝐔Λ𝐔T,𝐋𝐔Λsuperscript𝐔𝑇\mathbf{L}=\mathbf{U}\Lambda\mathbf{U}^{T},bold_L = bold_U roman_Λ bold_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (6)

where 𝐔𝐔\mathbf{U}bold_U is the matrix of eigenvectors and ΛΛ\Lambdaroman_Λ is the diagonal matrix of eigenvalues. The eigenvalues of 𝐋𝐋\mathbf{L}bold_L represent the frequencies of the graph signals, with smaller eigenvalues corresponding to lower frequencies and larger eigenvalues corresponding to higher frequencies. In GSP, the Graph Laplacian matrix is often used as a low-pass filter to smooth signals defined on graphs, effectively suppressing high-frequency noise while preserving low-frequency information. Although we do not explicitly compute the eigendecomposition of 𝐋𝐋\mathbf{L}bold_L in our implementation, it is essential to understand that the Graph Laplacian matrix inherently encodes the spectral properties of the graph, which enables effective feature smoothing and noise suppression.

In practice, we calculate the Graph Laplacian matrix 𝐋𝐋\mathbf{L}bold_L using the degree matrix 𝐃𝐃\mathbf{D}bold_D and the adjacency matrix 𝐀𝐀\mathbf{A}bold_A, which are obtained from the graph structure. This calculation does not involve the explicit computation of eigenvalues and eigenvectors. However, the resulting matrix 𝐋𝐋\mathbf{L}bold_L still possesses the spectral properties that enable effective feature smoothing and noise suppression.

To integrate the Graph Laplacian smoothing prior into the GCN propagation rule, we propose a modified version of the Graph Laplacian matrix that incorporates the graph’s structural information and enables effective feature smoothing and noise suppression. Our proposed Graph Laplacian matrix, denoted as 𝐋^^𝐋\hat{\mathbf{L}}over^ start_ARG bold_L end_ARG, is defined as follows:

𝐋^=𝐃^12𝐀^𝐃^12,^𝐋superscript^𝐃12^𝐀superscript^𝐃12\hat{\mathbf{L}}=\hat{\mathbf{D}}^{-\frac{1}{2}}\hat{\mathbf{A}}\hat{\mathbf{D% }}^{-\frac{1}{2}},over^ start_ARG bold_L end_ARG = over^ start_ARG bold_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over^ start_ARG bold_A end_ARG over^ start_ARG bold_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , (7)

where 𝐀^=𝐀+𝐈d^𝐀𝐀subscript𝐈𝑑\hat{\mathbf{A}}=\mathbf{A}+\mathbf{I}_{d}over^ start_ARG bold_A end_ARG = bold_A + bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is the adjacency matrix with self-loops, 𝐃^^𝐃\hat{\mathbf{D}}over^ start_ARG bold_D end_ARG is the corresponding degree matrix with 𝐃^ii=j=1d𝐀^ijsubscript^𝐃𝑖𝑖superscriptsubscript𝑗1𝑑subscript^𝐀𝑖𝑗\hat{\mathbf{D}}_{ii}=\sum_{j=1}^{d}\hat{\mathbf{A}}_{ij}over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT over^ start_ARG bold_A end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, and 𝐈dsubscript𝐈𝑑\mathbf{I}_{d}bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is the identity matrix.

The matrix 𝐋^^𝐋\hat{\mathbf{L}}over^ start_ARG bold_L end_ARG is a normalized version of the Graph Laplacian matrix, which captures the graph’s structure and enables the smoothing of node features. By incorporating self-loops into the adjacency matrix, we ensure that each node’s feature is considered during the smoothing process, enhancing the stability and expressiveness of the learned representations.

To integrate the Graph Laplacian smoothing prior into the GCN propagation rule, we modify the propagation equation as follows:

𝐙(l+1)=σ(𝐋^𝐙(l)𝐖(l)),superscript𝐙𝑙1𝜎^𝐋superscript𝐙𝑙superscript𝐖𝑙\mathbf{Z}^{(l+1)}=\sigma(\hat{\mathbf{L}}\mathbf{Z}^{(l)}\mathbf{W}^{(l)}),bold_Z start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = italic_σ ( over^ start_ARG bold_L end_ARG bold_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) , (8)

where 𝐙(l)superscript𝐙𝑙\mathbf{Z}^{(l)}bold_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT represents the node features at layer l𝑙litalic_l, 𝐖(l)superscript𝐖𝑙\mathbf{W}^{(l)}bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is the learnable weight matrix, and σ𝜎\sigmaitalic_σ is the activation function.

The term 𝐋^𝐙(l)^𝐋superscript𝐙𝑙\hat{\mathbf{L}}\mathbf{Z}^{(l)}over^ start_ARG bold_L end_ARG bold_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT effectively applies the Graph Laplacian smoothing prior to the node features. By multiplying the node features with the normalized Graph Laplacian matrix 𝐋^^𝐋\hat{\mathbf{L}}over^ start_ARG bold_L end_ARG, we achieve a smoothing effect that takes into account the graph’s structure. This operation allows the model to leverage the connectivity information encoded in the graph to refine the node features and suppress high-frequency noise.

It is important to note that although we do not explicitly compute the eigendecomposition of 𝐋^^𝐋\hat{\mathbf{L}}over^ start_ARG bold_L end_ARG in our implementation, the matrix 𝐋^^𝐋\hat{\mathbf{L}}over^ start_ARG bold_L end_ARG inherently possesses the spectral properties of the Graph Laplacian. By using 𝐋^^𝐋\hat{\mathbf{L}}over^ start_ARG bold_L end_ARG in the GCN propagation rule, we implicitly leverage these spectral properties to achieve effective feature smoothing and noise suppression.

The effectiveness of the Graph Laplacian matrix 𝐋^^𝐋\hat{\mathbf{L}}over^ start_ARG bold_L end_ARG as a low-pass filter can be understood by examining its eigendecomposition:

𝐋^=𝐔Λ𝐔T,^𝐋𝐔Λsuperscript𝐔𝑇\hat{\mathbf{L}}=\mathbf{U}\Lambda\mathbf{U}^{T},over^ start_ARG bold_L end_ARG = bold_U roman_Λ bold_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (9)

where 𝐔𝐔\mathbf{U}bold_U is the matrix of eigenvectors and ΛΛ\Lambdaroman_Λ is the diagonal matrix of eigenvalues. The eigenvalues of 𝐋^^𝐋\hat{\mathbf{L}}over^ start_ARG bold_L end_ARG represent the frequencies of the graph signals, with smaller eigenvalues corresponding to lower frequencies and larger eigenvalues corresponding to higher frequencies. By multiplying the node features with 𝐋^^𝐋\hat{\mathbf{L}}over^ start_ARG bold_L end_ARG, we essentially apply a low-pass filter to the graph signals, attenuating the high-frequency components while preserving the low-frequency information. This operation effectively suppresses noise and promotes the smoothness of the node features across the graph. In practice, we compute 𝐋^^𝐋\hat{\mathbf{L}}over^ start_ARG bold_L end_ARG using the normalized adjacency matrix 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG and the corresponding degree matrix 𝐃^^𝐃\hat{\mathbf{D}}over^ start_ARG bold_D end_ARG, as shown in Equation (6). This computation can be efficiently performed using sparse matrix operations, without the need for explicit eigendecomposition.

To provide a rigorous theoretical analysis of the effectiveness of our proposed Graph Laplacian regularization, we can examine the convergence and generalization properties of the method. Let 𝐙superscript𝐙\mathbf{Z}^{*}bold_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denote the optimal node features that minimize the loss function (𝐙)𝐙\mathcal{L}(\mathbf{Z})caligraphic_L ( bold_Z ). We can show that by incorporating the Graph Laplacian regularization term 𝐋^𝐙(l)^𝐋superscript𝐙𝑙\hat{\mathbf{L}}\mathbf{Z}^{(l)}over^ start_ARG bold_L end_ARG bold_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT into the GCN propagation rule, the learned node features 𝐙(l)superscript𝐙𝑙\mathbf{Z}^{(l)}bold_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT converge to 𝐙superscript𝐙\mathbf{Z}^{*}bold_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT under mild assumptions on the graph structure and the loss function. Specifically, if the graph is connected and the loss function is convex and smooth, the iterative updates of the node features using Equation (8) will converge to the optimal solution 𝐙superscript𝐙\mathbf{Z}^{*}bold_Z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (see Theorem 1). Furthermore, the generalization error of the learned node features can be bounded by the graph Laplacian regularization term, indicating that the proposed method effectively controls the model complexity and prevents overfitting to noisy or irrelevant features.

By incorporating the Graph Laplacian smoothing prior into the GCN, our method effectively addresses the challenges posed by noisy scenarios. The smoothing operation helps to mitigate the impact of noisy or irrelevant features, enhancing the robustness and generalization ability of the learned representations.

Finally, the output features can be obtained by passing the final layer’s features 𝐙(L)superscript𝐙𝐿\mathbf{Z}^{(L)}bold_Z start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT through a fully connected layer (FC):

𝐙=σ(𝑾out𝐙(L)𝐖(L))𝐙𝜎subscript𝑾outsuperscript𝐙𝐿superscript𝐖𝐿\mathbf{Z}=\sigma({\bm{W}}_{\text{out}}\mathbf{Z}^{(L)}\mathbf{W}^{(L)})bold_Z = italic_σ ( bold_italic_W start_POSTSUBSCRIPT out end_POSTSUBSCRIPT bold_Z start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ) (10)

where 𝑾outgdim×noutsubscript𝑾outsuperscriptsubscript𝑔dimsubscript𝑛out{\bm{W}}_{\text{out}}\in\mathbb{R}^{g_{\text{dim}}\times n_{\text{out}}}bold_italic_W start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the weight of the FC, noutsubscript𝑛outn_{\text{out}}italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT and gdimsubscript𝑔dimg_{\text{dim}}italic_g start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT stand for the number of neurons of the FC and the embedding dimension of GCN. Finally, the predicted result could be done via

𝐘^=Softmax(𝑾cls𝒁)^𝐘Softmaxsubscript𝑾cls𝒁\mathbf{\hat{Y}}=\text{Softmax}({\bm{W}}_{\text{cls}}{\bm{Z}})over^ start_ARG bold_Y end_ARG = Softmax ( bold_italic_W start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT bold_italic_Z ) (11)

where 𝑾clsnout×nclssubscript𝑾clssuperscriptsubscript𝑛outsubscript𝑛cls{\bm{W}}_{\text{cls}}\in\mathbb{R}^{n_{\text{out}}\times n_{\text{cls}}}bold_italic_W start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the weight of the FC, nclssubscript𝑛clsn_{\text{cls}}italic_n start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT stands for the number of classes.

During the training phase, the cross-entropy loss is employed to optimize the model parameters:

=c=1ncls𝐘^clog(𝐘c)+α|𝑿|1superscriptsubscript𝑐1subscript𝑛clssubscript^𝐘𝑐subscript𝐘𝑐𝛼subscript𝑿1\mathcal{L}=-\sum_{c=1}^{n_{\text{cls}}}\mathbf{\hat{Y}}_{c}\log(\mathbf{Y}_{c% })+\alpha|{\bm{X}}|_{1}caligraphic_L = - ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT roman_log ( bold_Y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) + italic_α | bold_italic_X | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (12)

where 𝐘𝐘\mathbf{Y}bold_Y is the one-hot encoded ground-truth label, and α𝛼\alphaitalic_α stands for weight of the sparsity constraint, where α=1e5𝛼1superscript𝑒5\alpha=1e^{-5}italic_α = 1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT in this study for all experiments.

By incorporating Graph Laplacian regularization into the GCN, our proposed method effectively addresses the challenges posed by noisy face sequences in DeepFake video detection. The Graph Laplacian smoothing helps to filter out noisy nodes and enhances the robustness and stability of the model. The combination of feature entanglement, sparse regularization, and Graph Laplacian regularization enables our method to make the most of the available valid information, suppress the influence of noisy features, and improve the correlation between relevant features. As a result, our approach achieves high-performance and robust DeepFake video detection, even in the presence of low-quality and noisy face sequences commonly encountered in real-world scenarios.

In summary, our proposed GCN with Graph Laplacian regularization effectively leverages the spectral properties of the Graph Laplacian matrix to achieve feature smoothing and noise suppression, without explicitly computing the eigendecomposition. By integrating the Graph Laplacian smoothing prior into the GCN propagation rule, our method enhances the robustness and generalization ability of the learned representations, making it highly suitable for detecting DeepFake videos in challenging real-world scenarios with noisy and unreliable face sequences. The theoretical analysis of the convergence and generalization properties of our method further validates its effectiveness and provides a solid foundation for its application.

3 Experimental Results

3.1 Experimental Configuration

The robustness validation of the proposed method is the core of our investigation, particularly when applied to noisy face sequences containing many invalid faces. To achieve this, the use of representative benchmark datasets is essential. Therefore, we selected three well-established benchmark datasets for performance evaluation: FF++ [1], Celeb-DFv2 [3], and the large-scale DFDC dataset [2]. The FF++ dataset [1] comprises four distinct classes of manipulation methods: 1) DeepFakes (DF), 2) Face2Face (F2F), 3) FaceSwap (FS), and 4) NeuralTextures (NT). For each class, a set of 1,000 original videos was used to generate 1,000 manipulated versions, resulting in a total of 1,000 authentic and 4,000 doctored videos. The Celeb-DF dataset [3] contains 590 original videos and 5,639 manipulated counterparts, generated using improved generative adversarial networks at a resolution of 256×256256256256\times 256256 × 256. To enhance the quality of the manipulated videos, a Kalman filter is employed in Celeb-DF [3] to mitigate temporal inconsistencies between successive frames. In addition to FF++ and Celeb-DF, we also utilize the DeepFake Detection Challenge (DFDC) dataset [2] to further validate the effectiveness of our proposed method. The DFDC dataset, created by Facebook in collaboration with other organizations, is a large-scale dataset designed to facilitate the development of DeepFake detection algorithms. It consists of over 100,000 videos, containing a mix of authentic and manipulated content generated using various state-of-the-art face swap** and facial reenactment techniques, ensuring a diverse and challenging set of DeepFakes for evaluation.

Training Hyperparameters of Our GRACE. To ensure a balanced performance appraisal, the Celeb-DF [3], FF++ [1], and DFDC [2] datasets were divided into training, validation, and testing sets following an 8:1:1:81:18:1:18 : 1 : 1 ratio. In line with our objective to ascertain the efficacy of GRACE in the presence of unstable face detectors, we trained separate GRACE models independently on each dataset. During the training phase, the Adam optimizer [51] was utilized with an initial learning rate of 1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and a step-learning-decay schedule. We employed the 53-layer Cross Stage Partial Network (CSPNet) [52] as the backbone network. Note that any CNNs could be used in our GRACE as backbone network. The standard GCN with our Graph Laplacian was implemented for stacking gnsubscript𝑔𝑛g_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT layers, with gn=8subscript𝑔𝑛8g_{n}=8italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 8 and embedding size gdim=400subscript𝑔dim400g_{\text{dim}}=400italic_g start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT = 400 being the default setting in this study. The number of neurons of the last fully connected layer noutsubscript𝑛outn_{\text{out}}italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT is 2048204820482048 for our experiments. All facial images were resized to 144×144144144144\times 144144 × 144 during both training and inference stages. Standard data augmentation techniques, such as random noise, crop**, and flip**, were adopted during the training phase. The training phase consisted of 200 epochs, with a learning rate decay of 0.1 every 100 epochs. We randomly sampled N=16𝑁16N=16italic_N = 16 successive facial images to form the input tensor for our experiments. All comparison methods, including the proposed GRACE, were trained on the training set and evaluated on the testing set.

Training Hyperparameters of Peer Methods. For performance evaluation, we compared our proposed method with several state-of-the-art DeepFake detection techniques, including MesoNet [5], Xception [6], F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT-net [32], RECCE [27], DFIL [28], UCF [30], CORE [26], and TALL-Swin [53]. The frame-level approaches, namely MesoNet, Xception, F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT-net, RECCE, DFIL, UCF, and CORE, were trained using the same strategy as described previously, with their default settings. However, the learning rates of Xception [6] and UCF [30] were adjusted to 2e42superscript𝑒42e^{-4}2 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for better performance. The video-based approach, TALL-Swin [53], was trained using their default settings. During the training phase, we randomly selected N𝑁Nitalic_N facial images from the training set. The final authenticity verdict for the input video was determined by averaging the N𝑁Nitalic_N prediction outcomes corresponding to the N𝑁Nitalic_N facial images extracted from the input video, using a temporally centered crop** strategy. For all other methods, the number of frames N𝑁Nitalic_N used was set to 16161616. The image size for Xception [6], F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT-net [32], RECCE [27], UCF [30], and CORE [26] is 256×256256256256\times 256256 × 256, suggested by their default settings, while that for DFIL [28] and TALL-Swin [53] are 299×299299299299\times 299299 × 299 and 224×224224224224\times 224224 × 224, respectively.

Settings in Inference Phase. To evaluate the model’s performance under the influence of an unstable face detector, we randomly replaced certain facial images with background segments, as determined by the masking ratio mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. We experimented with masking ratios ranging from 0.1 to 0.8 to assess the effectiveness of GRACE under varying levels of noise in the face sequences. For instance, with N=16𝑁16N=16italic_N = 16 and mr=0.5subscript𝑚𝑟0.5m_{r}=0.5italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.5, up to eight facial images could be replaced with background images in the corresponding frames, simulating real-world scenarios where face detection may be challenging or unreliable. In our experimental setup, we sampled N=16𝑁16N=16italic_N = 16 frames from the middle portion of each video, following the same approach used during the training process. When mr=0.5subscript𝑚𝑟0.5m_{r}=0.5italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.5, half of the 16 frames (i.e., 8) were randomly replaced with either background or completely black images. By varying the masking ratio, we evaluated the robustness and stability of each method under different levels of noise in the face sequences.

Furthermore, we assumed that each frame should contain at least one face to simulate adversarial attacks on face detectors in real-world scenarios. In cases where no face was detected in a frame, we replaced that frame with a black image, generating a noisy face sequence that allowed us to assess the robustness of GRACE under challenging conditions. Our experimental analysis employed three performance metrics: accuracy, Macro F1-Score, and Area Under the Receiver Operating Characteristic Curve (AUC). The Macro F1-Score accurately reflects the model’s performance under label imbalance situations. For simplicity, these metrics are referred to as Accuracy (Acc.), F1-Score, and AUC throughout the experimental sections.

3.2 Quantitative Results

TABLE II: Quantitative comparison of the noisy face sequences under different masking rations mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT between the proposed ML-SELF and other state-of-the-art methods.Highlighting the best performance in red and the second-best performance in blue, considering the utilization of the FF++ Celeb-DF, and DFDC datasets alongside the mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT variable.
mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT FF++ [1] Celeb-DF [3] DFDC [2]
ACC F1 AUC ACC F1 AUC ACC F1 AUC
Xception [6] 0.0 0.925 0.894 0.972 0.861 0.806 0.910 0.953 0.910 0.981
0.4 0.869 0.780 0.871 0.631 0.614 0.782 0.908 0.788 0.866
0.8 0.814 0.594 0.654 0.398 0.389 0.604 0.864 0.598 0.647
F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT-net [32] 0.0 0.950 0.928 0.986 0.965 0.957 0.993 0.957 0.921 0.986
0.4 0.883 0.798 0.888 0.691 0.684 0.895 0.864 0.655 0.755
0.8 0.818 0.599 0.662 0.418 0.407 0.664 0.850 0.539 0.595
RECCE [27] 0.0 0.938 0.911 0.979 0.941 0.925 0.985 0.940 0.872 0.973
0.4 0.878 0.790 0.874 0.678 0.669 0.869 0.900 0.752 0.863
0.8 0.817 0.599 0.655 0.414 0.404 0.648 0.861 0.579 0.648
UCF [30] 0.0 0.937 0.911 0.982 0.856 0.792 0.891 0.890 0.815 0.939
0.4 0.875 0.790 0.882 0.626 0.607 0.642 0.871 0.733 0.812
0.8 0.815 0.598 0.660 0.397 0.389 0.516 0.851 0.586 0.620
CORE [26] 0.0 0.948 0.925 0.984 0.953 0.940 0.989 0.950 0.903 0.977
0.4 0.883 0.799 0.888 0.858 0.790 0.890 0.907 0.781 0.870
0.8 0.818 0.601 0.663 0.764 0.572 0.661 0.863 0.595 0.651
TALL-Swin [53] 0.0 0.913 0.868 0.881 0.913 0.933 0.924 0.911 0.812 0.984
0.4 0.867 0.767 0.740 0.847 0.789 0.825 0.872 0.758 0.786
0.8 0.827 0.605 0.589 0.745 0.680 0.645 0.845 0.688 0.650
DFIL [28] 0.0 0.954 0.939 0.987 0.957 0.954 0.964 0.940 0.881 0.955
0.4 0.876 0.808 0.893 0.695 0.684 0.825 0.886 0.720 0.813
0.8 0.759 0.603 0.665 0.518 0.350 0.644 0.855 0.565 0.621
GRACE [Our] 0.0 0.962 0.942 0.989 0.989 0.968 0.998 0.969 0.942 0.988
0.4 0.958 0.936 0.987 0.970 0.920 0.998 0.969 0.940 0.988
0.8 0.944 0.916 0.983 0.857 0.738 0.980 0.962 0.925 0.979
TABLE III: Comparison of different methods in terms of FLOPs (Floating-point Operations), MACs (Multiply-Accumulate Operations), and number of parameters (##\##Params.).
Method FLOP (T) MACs (T) ##\##Params (M)
Xception [6] 60.796 30.356 21.861
F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT-net [31] 192.604 95.880 22.125
RECCE [27] 81.655 40.667 47.693
UCF [30] 180.738 90.087 46.838
CORE [26] 60.978 30.356 21.861
TALL-Swin [29] 30.318 15.125 86.920
DFIL [28] 60.976 30.356 20.811
GRACE (Ours) 70.751 35.246 29.661

The primary performance assessment comparing the handling of invalid facial images between our proposed model, GRACE, and various state-of-the-art schemes is provided in Table II. Under optimal conditions, where most facial images are valid, GRACE exhibits competitive results, holding its own against other cutting-edge DeepFake video detection methods such as Xception [6], MesoNet [5], F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT-Net [32], RECCE [27], CORE [26], TALL-Swin [53], and DFIL [28]. It is worth noting that TALL-Swin [53] is a video-based approach.

Specifically, the F1-Score of GRACE for DeepFake video detection, when evaluated on FF++ [1], Celeb-DF [3], and DFDC [2], slightly surpasses those of its contemporaries under clean cases (i.e., mr=0subscript𝑚𝑟0m_{r}=0italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0). This outcome implies that the proposed GRACE with Graph Laplacian is effective and reliable for DeepFake video detection. However, in scenarios where partial face images are invalid due to purposeful attacks on face detectors, the performance of traditional frame-level methods, including Xception [6], MesoNet [5], F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT-Net [32], RECCE [27], CORE [26], UCF [30], and DFIL [28], may substantially deteriorate since they fail to consider noisy face sequences in real-world scenarios.

Similarly, the video-level DeepFake detection methods, TALL-Swin [53], which heavily rely on temporal cues, may suffer further performance degradation when the masking ratio increases. Invalid faces can cause landmark detection failures and incorrect temporal trajectories. Consequently, the F1-Score of TALL-Swin under a masking ratio of 0.8 in the testing phase is lower than 0.7, implying that all predictions would be categorized as either entirely fake or real. Likewise, the performance of another state-of-the-art video-based DeepFake detection method, TALL-Swin [53], is poor when mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is increased. In stark contrast, all quality indices of our proposed GRACE, evaluated on different datasets, display promising results, suggesting that GRACE is robust and reliable even under highly noisy face sequences (e.g., when mr=0.8subscript𝑚𝑟0.8m_{r}=0.8italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.8). Remarkably, since most DeepFake detection methods fail to discuss the impact of unreliable face sequences, the degraded performance is most likely predictable.

To further demonstrate the efficiency and practicality of the proposed GRACE method, we conduct a comprehensive complexity analysis and compare it with other state-of-the-art DeepFake detection methods. Table III presents the comparison results in terms of floating-point operations (FLOPs), multiply-accumulate operations (MACs), and the number of parameters for each method with 16×3×144×14416314414416\times 3\times 144\times 14416 × 3 × 144 × 144 tensor for the fair comparison. It is evident that GRACE achieves a remarkable balance between computational complexity and performance. With 70.751 trillion FLOPs, 35.246 trillion MACs, and 29.661 million parameters, GRACE exhibits a moderate computational overhead compared to other methods, such as TALL-Swin [29], UCF [30], and RECCE [27]. Notably, GRACE outperforms these methods in terms of FLOPs and MACs while maintaining a comparable number of parameters. Moreover, GRACE demonstrates superior performance in handling noisy face sequences, as shown in the experimental results, despite having a similar complexity to methods like CORE [26], Xception [6], and DFIL [28]. This highlights the effectiveness of the proposed feature entanglement, graph convolutional network, and graph Laplacian regularization techniques in learning discriminative and robust representations for DeepFake detection. The complexity analysis further substantiates GRACE as a practical and efficient solution for real-world DeepFake detection challenges, offering a compelling trade-off between computational resources and detection accuracy.

Refer to caption
(a) Comparison of AUC for FF++
Refer to caption
(b) Comparison of AUC for DFDC
Refer to caption
(c) Comparison of AUC for CelebDF
Figure 3: The performance comparison of the proposed GRACE and other state-of-the-art methods in terms of AUC under different masking ratios for (a) FF++[1], (b) DFDC [2], and (c) Celeb-DF [3].

The detailed quantitative results, evaluated on the FF++ [1], Celeb-DF [3], and DFDC [2] datasets, are illustrated in Figures 3(a) and 3(c), respectively. In the clean case, i.e., when mr=0subscript𝑚𝑟0m_{r}=0italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0, the performance of the proposed method is comparable to other state-of-the-art methods. It is observed that performance degradation becomes increasingly pronounced with a rise in the masking ratio during the testing phase, particularly when the masking ratio (mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT) exceeds 0.5. The performance of the previously established TALL-Swin [53] also declines when the masking ratio surpasses 0.2. A similar trend is discernible in Fig. 3(c), which evaluates the DFDC testing set. The performance of contemporary methods diminishes at higher masking ratios, whereas the proposed GRACE method maintains relatively high performance even at a masking ratio of 0.8. We also draw the AUC comparison between the proposed GRACE and other peer methods in Fig. 3. We show that the proposed GRACE significantly outperforms other state-of-the-art DeepFake detectors, especially under noisy face sequences.

More specifically, most existing DeepFake video/image detection algorithms do not address the impact of noisy face sequences. Although state-of-the-art face detectors perform exceptionally well under pristine conditions, their performance can be severely undermined when subjected to well-engineered post-processing techniques, particularly adversarial perturbations targeting the face detector. Our GRACE method successfully overcomes this shortcoming and introduces a novel and robust DeepFake video detection approach for real-world challenges.

3.3 Hyperparameters Selection

TABLE IV: Performance evaluation of the proposed GRACE with different hyperparameter settings using FF++ [1]. gdimsubscript𝑔dimg_{\text{dim}}italic_g start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT and gnsubscript𝑔𝑛g_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are the embedding dimension and number of layers of GCN, respectively; N𝑁Nitalic_N is the frames extracted from the video; noutsubscript𝑛outn_{\text{out}}italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT is the number of neurons of FC; α𝛼\alphaitalic_α stands for weights of sparsity.
mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT N𝑁Nitalic_N ACC F1 AUC gnsubscript𝑔𝑛g_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ACC F1 AUC
0.8 12 0.922 0.875 0.971 12 0.856 0.711 0.950
0.8 20 0.948 0.918 0.978 4 0.948 0.918 0.982
0.8 16 0.944 0.916 0.983 8 0.944 0.916 0.983
0.7 12 0.958 0.936 0.983 12 0.896 0.816 0.964
0.7 20 0.954 0.930 0.986 4 0.952 0.926 0.983
0.7 16 0.960 0.938 0.985 8 0.960 0.938 0.985
- α𝛼\alphaitalic_α ACC F1 AUC gdimsubscript𝑔dimg_{\text{dim}}italic_g start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT ACC F1 AUC
0.8 1e71superscript𝑒71e^{-7}1 italic_e start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 0.928 0.887 0.978 600 0.896 0.805 0.966
0.8 1e61superscript𝑒61e^{-6}1 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 0.942 0.910 0.974 200 0.934 0.886 0.981
0.8 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 0.944 0.916 0.983 400 0.944 0.916 0.983
0.7 1e71superscript𝑒71e^{-7}1 italic_e start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 0.954 0.930 0.981 600 0.938 0.896 0.984
0.7 1e61superscript𝑒61e^{-6}1 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 0.944 0.914 0.979 200 0.942 0.905 0.987
0.7 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 0.960 0.938 0.985 400 0.960 0.938 0.985
- noutsubscript𝑛outn_{\text{out}}italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ACC F1 AUC
0.8 1024 0.926 0.871 0.969
0.8 3072 0.924 0.876 0.939
0.8 2048 0.944 0.916 0.983
0.7 1024 0.944 0.908 0.984
0.7 3072 0.926 0.880 0.968
0.7 2048 0.960 0.938 0.985

To achieve optimal performance and robustness, we conducted a comprehensive ablation study to investigate the impact of various hyperparameters on the proposed GRACE method. This analysis provides valuable insights into the design choices and trade-offs involved in develo** an effective DeepFake video detection system for real-world scenarios with noisy face sequences. Table IV presents the performance comparison of GRACE under different hyperparameter settings, evaluated on the challenging FF++ dataset [1].

3.3.1 Number of Extracted Frames (N𝑁Nitalic_N)

The number of frames employed during the training and testing phases is a crucial aspect of GRACE. While using a larger number of frames might intuitively improve performance, it also significantly increases the computational complexity. To strike an optimal balance, we investigated the impact of varying the number of extracted frames. As shown in Table IV, using N=8𝑁8N=8italic_N = 8 frames results in the lowest computational complexity but slightly compromises performance in terms of Accuracy, Macro F1-Score, and AUC. Conversely, increasing the number of frames to N=20𝑁20N=20italic_N = 20 achieves state-of-the-art performance for most masking ratios during testing. Considering the trade-off between effectiveness and efficiency, we recommend using N=16𝑁16N=16italic_N = 16 frames as the optimal setting for GRACE.

3.3.2 Number of GCN Layers (gnsubscript𝑔𝑛g_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT)

The depth of the Graph Convolutional Network (GCN) plays a vital role in learning robust feature representations. However, stacking too many layers with the Graph Laplacian smooth prior may lead to over-smoothing of nodes and reduce the discriminative power. We explored the impact of varying the number of GCN layers (gnsubscript𝑔𝑛g_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) in GRACE. As presented in Table IV, setting gn=12subscript𝑔𝑛12g_{n}=12italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 12 results in suboptimal performance compared to gn=8subscript𝑔𝑛8g_{n}=8italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 8 and gn=4subscript𝑔𝑛4g_{n}=4italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 4, likely due to convergence difficulties within the given 200 epochs. While gn=4subscript𝑔𝑛4g_{n}=4italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 4 achieves outstanding performance overall, it slightly underperforms in highly noisy conditions (i.e., mr=0.8subscript𝑚𝑟0.8m_{r}=0.8italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.8) compared to gn=8subscript𝑔𝑛8g_{n}=8italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 8. Therefore, we suggest using gn=8subscript𝑔𝑛8g_{n}=8italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 8 as a balanced choice for stable and robust performance across various noise levels.

3.3.3 Sparsity Penalty Term (α𝛼\alphaitalic_α)

The sparsity penalty term α𝛼\alphaitalic_α in the proposed loss function controls the balance between the sparsity constraint and the classification objective. A higher value of α𝛼\alphaitalic_α encourages GRACE to learn a sparser feature representation, which is particularly beneficial for DeepFake video detection in the presence of invalid facial images. We investigated the impact of α𝛼\alphaitalic_α by varying its value from 1e71superscript𝑒71e^{-7}1 italic_e start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT to 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. As shown in Table IV, a higher sparsity penalty enhances the network’s ability to learn essential and discriminative features, thereby reducing the influence of invalid faces and improving overall performance. However, setting α𝛼\alphaitalic_α higher than 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT leads to convergence difficulties. Based on our analysis, we recommend using α=1e5𝛼1superscript𝑒5\alpha=1e^{-5}italic_α = 1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT to achieve a balanced trade-off between sparsity and convergence stability.

3.3.4 GCN Embedding Dimension (gdimsubscript𝑔dimg_{\text{dim}}italic_g start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT)

The embedding dimension of the GCN (gdimsubscript𝑔dimg_{\text{dim}}italic_g start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT) determines the richness of the learned feature representations for DeepFake video detection. We investigated the impact of gdimsubscript𝑔dimg_{\text{dim}}italic_g start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT by comparing the performance of GRACE with gdim200,400,600subscript𝑔dim200400600g_{\text{dim}}\in{200,400,600}italic_g start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT ∈ 200 , 400 , 600, as shown in Table IV. Since the dimension of the graph representation 𝑨𝑨{\bm{A}}bold_italic_A is 400×400400400400\times 400400 × 400, intuitively, the best performance is achieved when gdim=400subscript𝑔dim400g_{\text{dim}}=400italic_g start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT = 400. Reducing gdimsubscript𝑔dimg_{\text{dim}}italic_g start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT below this value limits the expressive power of the GCN, while increasing it beyond introduces redundancy and harms performance. Therefore, we suggest setting gdim=400subscript𝑔dim400g_{\text{dim}}=400italic_g start_POSTSUBSCRIPT dim end_POSTSUBSCRIPT = 400 for optimal results.

3.3.5 Number of Fully Connected Layer Neurons (noutsubscript𝑛outn_{\text{out}}italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT)

To aggregate the output of the GCN and feed it into the softmax classifier, a simple fully connected (FC) layer is employed, projecting the graph representation to an noutsubscript𝑛outn_{\text{out}}italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT-dimensional feature vector. We investigated the impact of noutsubscript𝑛outn_{\text{out}}italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT by comparing the performance of GRACE with nout1024,2048,3072subscript𝑛out102420483072n_{\text{out}}\in{1024,2048,3072}italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ∈ 1024 , 2048 , 3072, as shown in Table IV. While nout=2048subscript𝑛out2048n_{\text{out}}=2048italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = 2048 achieves excellent performance under highly noisy face sequences, the performance gap between nout=2048subscript𝑛out2048n_{\text{out}}=2048italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = 2048 and nout=1024subscript𝑛out1024n_{\text{out}}=1024italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = 1024 is insignificant, suggesting that the choice of noutsubscript𝑛outn_{\text{out}}italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT is not highly sensitive. Based on our analysis, we recommend setting nout=2048subscript𝑛out2048n_{\text{out}}=2048italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = 2048 for a good balance between performance and computational complexity.

The comprehensive analysis of the hyperparameters presented in this section highlights the robustness and effectiveness of the proposed GRACE method under various hyperparameter settings. By carefully selecting these hyperparameters, GRACE achieves state-of-the-art performance in DeepFake video detection, even in challenging real-world scenarios with noisy face sequences. The insights gained from this analysis provide valuable guidance for practitioners and researchers aiming to develop robust and efficient DeepFake detection systems.

3.4 Ablation Study

Table V presents an ablation study for the proposed modules in our GRACE, i.e., GCN, Graph Laplacian smooth prior, and Sparsity regularizer, where the performance is evaluated in noisy face sequences (say, mr=0.8subscript𝑚𝑟0.8m_{r}=0.8italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.8 and mr=0.7subscript𝑚𝑟0.7m_{r}=0.7italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.7). Note that when none of the proposed modules is adopted, we adopt the Transformer [46] as the classification head with four-head multi-head self-attention (MHSA) with the embedding size of 512 to meet a similar number of parameters with that of our GRACE, which could be treated as a variant of Convolutional Transformer. When we enable the GCN for the proposed feature entanglement and its affinity matrix, the performance of the DeepFake video detection under noisy face sequence, implying that the feature entanglement and its graph representation judiciously embeds the different spatiotemporal features into every node, thereby reducing the impact of invalid faces under noisy face sequences. Furthermore, the Graph Laplacian smooth prior could improve the robustness since it could filter noisy nodes that might contain many invalid faces without significantly increasing computational complexity. As shown in Fig.4, the convergence of the proposed GRACE with Graph Laplacian remains stable and shows outstanding performance on the FF++ [1] validation set with mr=0subscript𝑚𝑟0m_{r}=0italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0. Finally, the sparsity regularizer significantly benefits the robustness of the DeepFake detection since we could enforce our GRACE learning essential and sparse features for the valid faces.

TABLE V: Ablation study of the proposed GRACE using different classification heads and components. GCN, GL, and Spa. denote the Graph Convolutional Network, Graph Laplacian smooth prior, and Sparsity regularizer, respectively.
mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT GCN GL Spa. ACC F1 AUC #Param / MACs
0.8 0.844 0.655 0.705 40.79M, 38.82
0.8 0.926 0.873 0.931 29.66M / 35.25
0.8 0.924 0.876 0.977
0.8 0.946 0.912 0.975
0.8 0.944 0.916 0.983
0.7 0.858 0.700 0.794 40.79M / 38.82
0.7 0.938 0.900 0.965 29.66M / 35.25
0.7 0.950 0.920 0.982
0.7 0.952 0.925 0.974
0.7 0.960 0.938 0.985
Refer to caption
Figure 4: The validation accuracy curve evaluated on FF++ [1] of the proposed GRACE with and without Graph Laplacian and Sparsity regularizer.
Refer to caption
Figure 5: (a) An example of the noisy face sequence caused by PGD-like adversarial attack with ϵ=0.04italic-ϵ0.04\epsilon=0.04italic_ϵ = 0.04, αatk=0.01subscript𝛼atk0.01\alpha_{\text{atk}}=0.01italic_α start_POSTSUBSCRIPT atk end_POSTSUBSCRIPT = 0.01, and s=10𝑠10s=10italic_s = 10, resulting in approximated mr=0.2subscript𝑚𝑟0.2m_{r}=0.2italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.2 condition, where mis-detected faces were replaced with black ones. (b) The corresponding face sequence without adversarial perturbation.

3.5 Adversarial Attack on Face Detector

As previously discussed, DeepFake videos can be intentionally perturbed to evade detection by face detectors, rendering DeepFake detection ineffective. To emulate this real-world challenge, we employ an open-source adversarial attack on the MTCNN face detector [54], leveraging a PGD-like algorithm [39] with a maximum perturbation value of ϵ=0.04italic-ϵ0.04\epsilon=0.04italic_ϵ = 0.04, step size αadv=0.01subscript𝛼adv0.01\alpha_{\text{adv}}=0.01italic_α start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT = 0.01, and the number of iterations s=10𝑠10s=10italic_s = 10 to perturb the test sets of FF++ [1]. The step size αadvsubscript𝛼adv\alpha_{\text{adv}}italic_α start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT determines the magnitude of each perturbation step applied to the input image during the iterative adversarial attack process. Assuming that each frame must contain at least one face detectable by MTCNN to assess the rate of missed detections, a black image will replace the frame when no face is detected. The results reveal that the average number of missed detected faces is 3.58 for the FF++ [1] test set, closely mirroring an mr=0.2subscript𝑚𝑟0.2m_{r}=0.2italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.2 scenario. However, the perturbed face sequences not only make it more difficult to recognize whether the video is fake due to the adversarial noise affecting the image quality, as exemplified in Fig. 5, but also introduce another challenge: the adversarial attack could cause the face detector to extract non-facial regions (e.g., background). This implies that the actual mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT in this case could be even higher than the estimated 0.20.20.20.2, as some of the detected faces might not be genuine facial regions. Despite these challenging conditions, our GRACE method maintains strong performance compared to other state-of-the-art methods, as illustrated in Table VI. Remarkably, the performance of all peer methods dropped significantly, partly due to the missed detections and partly because the adversarial noise introduces spatial distortions in the facial images. This real-world simulation further substantiates GRACE as a generalized, robust, and effective DeepFake detection model capable of handling noisy face sequences.

TABLE VI: The performance comparison of the proposed GRACE and other methods trained on FF++ under simulated real-world scenarios (i.e., adversarial attack on face detector).
FF++ [1]
Method ACC F1 AUC
Xception [6] 0.614 0.565 0.723
F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT-net [32] 0.698 0.601 0.754
RECCE [27] 0.760 0.633 0.824
UCF [30] 0.745 0.613 0.811
CORE [26] 0.712 0.590 0.754
TALL-Swin [53] 0.796. 0.715 0.848
DFIL [28] 0.755 0.691 0.805
GRACE (Ours) 0.910 0.883 0.937

3.6 Limitations and Discussion

This study introduces a novel approach, GRACE, to address the challenge of DeepFake video detection in the presence of noisy face sequences. GRACE leverages feature entanglement with sparse constraints and a graph convolutional network with graph Laplacian regularization to effectively exploit the spatial-temporal correlations in face sequences while suppressing the impact of noise and distortions. The experimental results demonstrate the efficacy of GRACE in handling noisy face sequences and achieving state-of-the-art performance on several benchmark datasets.

However, it is essential to acknowledge the limitations of the current study and discuss potential future directions. One limitation is that while GRACE has shown strong performance on the evaluated datasets, its effectiveness on cross-dataset scenarios, where the training and testing data come from different sources, has not been extensively explored. The robustness of GRACE to domain shifts and variations in noise characteristics across different datasets requires further investigation. Nonetheless, it is worth emphasizing that GRACE represents the first dedicated effort to tackle the problem of noisy face sequences in DeepFake video detection, which has been largely overlooked in previous research. The proposed methodology and insights from this study lay a solid foundation for future work in this important direction.

Another aspect to consider is that GRACE currently does not incorporate masked learning strategies, which have shown promise in handling occlusions and missing data in various computer vision tasks. Integrating masked learning techniques into the GRACE framework could potentially further enhance its robustness to partial occlusions and incomplete face sequences. Moreover, the use of graph convolutional networks in GRACE allows for flexible processing of video frames, as the input frames are not required to be strictly sequential. This property could be leveraged to develop more efficient and adaptive sampling strategies for processing long video sequences.

It is also worth noting that while GRACE has demonstrated significant improvements over existing methods, there is still room for further enhancements. One direction could be to explore more advanced graph neural network architectures, such as graph attention networks or graph transformers, to better capture the complex dependencies and interactions among the spatial-temporal features. Additionally, incorporating prior knowledge or constraints specific to the DeepFake detection domain, such as the consistency of facial landmarks or the coherence of audio-visual signals, could potentially boost the performance and generalizability of the proposed approach.

In conclusion, GRACE represents a significant step forward in addressing the challenge of DeepFake video detection in the presence of noisy face sequences. While acknowledging the limitations and potential areas for improvement, we believe that the proposed methodology opens up new avenues for research in this critical domain. Future work could focus on extending GRACE to handle cross-dataset scenarios, integrating masked learning techniques, exploring more advanced graph neural network architectures, and incorporating domain-specific prior knowledge. As DeepFake techniques continue to evolve and become more sophisticated, develo** robust and reliable detection methods that can operate effectively in real-world scenarios with noisy and challenging data remains an ongoing research endeavor of paramount importance.

4 Conclusions

In this work, we proposed a robust and generalized Graph-Regularized Attentive Convolutional Entanglement (GRACE) approach for DeepFake video detection, specifically designed to address the challenges posed by noisy and unreliable face sequences. The proposed GRACE framework leverages spatiotemporal feature entanglement, graph convolutional networks, and graph Laplacian regularization to effectively capture discriminative features while mitigating the impact of invalid facial images. Extensive experiments on benchmark datasets, including FF++ [1], Celeb-DF [3], and DFDC [2], demonstrate the superior performance of GRACE compared to state-of-the-art methods, especially under noisy face sequences. The robustness of GRACE is further validated through real-world simulations involving adversarial attacks on face detectors. The proposed GRACE represents a significant step forward in robust and generalized DeepFake video detection under challenging conditions, contributing to the development of more reliable multimedia forensics techniques in the era of deepfakes.

References

  • [1] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “Faceforensics++: Learning to detect manipulated facial images,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1–11.
  • [2] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and C. C. Ferrer, “The deepfake detection challenge (dfdc) dataset,” arXiv preprint arXiv:2006.07397, 2020.
  • [3] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A large-scale challenging dataset for deepfake forensics,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3207–3216.
  • [4] B. Zi, M. Chang, J. Chen, X. Ma, and Y.-G. Jiang, “Wilddeepfake: A challenging real-world dataset for deepfake detection,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 2382–2390.
  • [5] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “Mesonet: A compact facial video forgery detection network,” in 2018 IEEE international workshop on information forensics and security (WIFS).   IEEE, 2018, pp. 1–7.
  • [6] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
  • [7] H. Mo, B. Chen, and W. Luo, “Fake faces identification via convolutional neural network,” in Proceedings of the 6th ACM workshop on information hiding and multimedia security, 2018, pp. 43–47.
  • [8] F. Marra, D. Gragnaniello, D. Cozzolino, and L. Verdoliva, “Detection of gan-generated fake images over social networks,” in 2018 IEEE conference on multimedia information processing and retrieval (MIPR).   IEEE, 2018, pp. 384–389.
  • [9] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “Cnn-generated images are surprisingly easy to spot… for now,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8695–8704.
  • [10] H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen, “Multi-task learning for detecting and segmenting manipulated facial images and videos,” in 2019 IEEE 10th international conference on biometrics theory, applications and systems (BTAS).   IEEE, 2019, pp. 1–8.
  • [11] M. Kim, S. Tariq, and S. S. Woo, “Fretal: Generalizing deepfake detection using knowledge distillation and representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1001–1012.
  • [12] H. H. Nguyen, J. Yamagishi, and I. Echizen, “Use of a capsule network to detect fake images and videos,” arXiv preprint arXiv:1910.12467, 2019.
  • [13] C.-C. Hsu, Y.-X. Zhuang, and C.-Y. Lee, “Deep fake image detection based on pairwise learning,” Applied Sciences, vol. 10, no. 1, p. 370, 2020.
  • [14] Y.-X. Zhuang and C.-C. Hsu, “Detecting generated image based on a coupled network with two-step pairwise learning,” in 2019 IEEE international conference on image processing (ICIP).   IEEE, 2019, pp. 3212–3216.
  • [15] C.-C. Hsu, C.-Y. Lee, and Y.-X. Zhuang, “Learning to detect fake face images in the wild,” in 2018 international symposium on computer, consumer and control (IS3C).   IEEE, 2018, pp. 388–391.
  • [16] I. Masi, A. Killekar, R. M. Mascarenhas, S. P. Gurudatt, and W. AbdAlmageed, “Two-branch recurrent network for isolating deepfakes in videos,” in European conference on computer vision.   Springer, 2020, pp. 667–684.
  • [17] D. Güera and E. J. Delp, “Deepfake video detection using recurrent neural networks,” in 2018 15th IEEE international conference on advanced video and signal based surveillance (AVSS).   IEEE, 2018, pp. 1–6.
  • [18] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo, “Face x-ray for more general face forgery detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5001–5010.
  • [19] P. Wang, K. Liu, W. Zhou, H. Zhou, H. Liu, W. Zhang, and N. Yu, “Adt: Anti-deepfake transformer,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2022, pp. 2899–1903.
  • [20] Z. Sun, Y. Han, Z. Hua, N. Ruan, and W. Jia, “Improving the efficiency and robustness of deepfakes detection through precise geometric features,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3609–3618.
  • [21] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P. Natarajan, “Recurrent convolutional strategies for face manipulation detection in videos,” Interfaces (GUI), vol. 3, no. 1, pp. 80–87, 2019.
  • [22] X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using inconsistent head poses,” in ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2019, pp. 8261–8265.
  • [23] Y. Li, M.-C. Chang, and S. Lyu, “In ictu oculi: Exposing ai created fake videos by detecting eye blinking,” in 2018 IEEE International workshop on information forensics and security (WIFS).   IEEE, 2018, pp. 1–7.
  • [24] U. A. Ciftci, I. Demir, and L. Yin, “Fakecatcher: Detection of synthetic portrait videos using biological signals,” IEEE transactions on pattern analysis and machine intelligence, 2020.
  • [25] Y. Li and S. Lyu, “Exposing deepfake videos by detecting face war** artifacts,” arXiv preprint arXiv:1811.00656, 2018.
  • [26] Y. Ni, D. Meng, C. Yu, C. Quan, D. Ren, and Y. Zhao, “Core: Consistent representation learning for face forgery detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12–21.
  • [27] J. Cao, C. Ma, T. Yao, S. Chen, S. Ding, and X. Yang, “End-to-end reconstruction-classification learning for face forgery detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4113–4122.
  • [28] K. Pan, Y. Yin, Y. Wei, F. Lin, Z. Ba, Z. Liu, Z. Wang, L. Cavallaro, and K. Ren, “Dfil: Deepfake incremental learning by exploiting domain-invariant forgery clues,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 8035–8046.
  • [29] R. Creager, “hideface: Exploring a Non-Traditional Adversarial Attack,” https://github.com/rccreager/hideface, 2022.
  • [30] Z. Yan, Y. Zhang, Y. Fan, and B. Wu, “Ucf: Uncovering common features for generalizable deepfake detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 412–22 423.
  • [31] J. Frank, T. Eisenhofer, L. Schönherr, A. Fischer, D. Kolossa, and T. Holz, “Leveraging frequency analysis for deep fake image recognition,” in International conference on machine learning.   PMLR, 2020, pp. 3247–3258.
  • [32] Y. Qian, G. Yin, L. Sheng, Z. Chen, and J. Shao, “Thinking in frequency: Face forgery detection by mining frequency-aware clues,” in European conference on computer vision.   Springer, 2020, pp. 86–103.
  • [33] N. Carlini and H. Farid, “Evading deepfake-image detectors with white-and black-box attacks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 658–659.
  • [34] S. Hussain, P. Neekhara, M. Jere, F. Koushanfar, and J. McAuley, “Adversarial deepfakes: Evaluating vulnerability of deepfake detectors to adversarial examples,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 3348–3357.
  • [35] P. Neekhara, B. Dolhansky, J. Bitton, and C. C. Ferrer, “Adversarial threats to deepfake detection: A practical perspective,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 923–932.
  • [36] G.-L. Chen and C.-C. Hsu, “Jointly defending deepfake manipulation and adversarial attack using decoy mechanism,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–11, 2023.
  • [37] A. J. Bose and P. Aarabi, “Adversarial attacks on face detectors using neural net based constrained optimization,” in 2018 IEEE 20th international workshop on multimedia signal processing (MMSP).   IEEE, 2018, pp. 1–6.
  • [38] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
  • [39] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 IEEE symposium on security and privacy (SP).   IEEE, 2017, pp. 39–57.
  • [40] S. Baluja and I. Fischer, “Adversarial transformation networks: Learning to generate adversarial examples,” arXiv preprint arXiv:1703.09387, 2017.
  • [41] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille, “Adversarial examples for semantic segmentation and object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1369–1378.
  • [42] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Retinaface: Single-shot multi-level face localisation in the wild,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5203–5212.
  • [43] D. E. King, “Dlib-ml: A machine learning toolkit,” The Journal of Machine Learning Research, vol. 10, pp. 1755–1758, 2009.
  • [44] D. Neimark, O. Bar, M. Zohar, and D. Asselmann, “Video transformer network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3163–3172.
  • [45] D. Wodajo and S. Atnafu, “Deepfake video detection using convolutional vision transformer,” arXiv preprint arXiv:2102.11126, 2021.
  • [46] A. Dosovitskiy and L. Beyer, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [47] S. Zhang, H. Tong, J. Xu, and R. Maciejewski, “Graph convolutional networks: a comprehensive review,” Computational Social Networks, vol. 6, no. 1, pp. 1–23, 2019.
  • [48] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in International conference on machine learning.   PMLR, 2019, pp. 6861–6871.
  • [49] A. Ortega, P. Frossard, J. Kovačević, J. M. Moura, and P. Vandergheynst, “Graph signal processing: Overview, challenges, and applications,” Proceedings of the IEEE, vol. 106, no. 5, pp. 808–828, 2018.
  • [50] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “Laplacian matrix learning for smooth graph signal representation,” in 2015 IEEE international conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 3736–3740.
  • [51] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.
  • [52] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
  • [53] Y. Xu, J. Liang, G. Jia, Z. Yang, Y. Zhang, and R. He, “Tall: Thumbnail layout for deepfake video detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 658–22 668.
  • [54] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016.