GRACE: Graph-Regularized Attentive Convolutional Entanglement with Laplacian Smoothing for Robust DeepFake Video Detection ^†^†thanks: This study was supported in part by the Ministry of Science and Technology (MOST), Taiwan, under grants MOST XXX; and partly by the Higher Education Sprout Project of Ministry of Education (MOE) to the Headquarters of University Advancement at National Cheng Kung University (NCKU). ^†^†thanks: (Corresponding author: Chih-Chung Hsu.) ^†^†thanks: C.-C. Hsu, S.-N. Chen, M.-H. Wu, Y.-F. Wang, C.-M. Lee and Y.-S. Chou are with Institute of Data Science and Department of Statistics, National Cheng Kung University, Tainan, Taiwan (R.O.C.), (e-mail:[email protected], [email protected], [email protected], [email protected], [email protected], [email protected].)

Chih-Chung Hsu, , Shao-Ning Chen, Mei-Hsuan Wu,
Yi-Fang Wang, Chia-Ming Lee, Yi-Shiuan Chou

Abstract

As DeepFake video manipulation techniques escalate, posing profound threats, the urgent need to develop efficient detection strategies is underscored. However, one particular issue lies with facial images being mis-detected, often originating from degraded videos or adversarial attacks, leading to unexpected temporal artifacts that can undermine the efficacy of DeepFake video detection techniques. This paper introduces a novel method for robust DeepFake video detection, harnessing the power of the proposed Graph-Regularized Attentive Convolutional Entanglement (GRACE) based on the graph convolutional network with graph Laplacian to address the aforementioned challenges. First, conventional Convolution Neural Networks are deployed to perform spatiotemporal features for the entire video. Then, the spatial and temporal features are mutually entangled by constructing a graph with sparse constraint, enforcing essential features of valid face images in the noisy face sequences remaining, thus augmenting stability and performance for DeepFake video detection. Furthermore, the Graph Laplacian prior is proposed in the graph convolutional network to remove the noise pattern in the feature space to further improve the performance. Comprehensive experiments are conducted to illustrate that our proposed method delivers state-of-the-art performance in DeepFake video detection under noisy face sequences. The source code is available at https://github.com/ming053l/GRACE.

Index Terms:

DeepFake Detection, Feature Entanglement, Graph Convolution Network, Adversarial Attack, Forgery Detection.

1 Introduction

With the widespread use of fake images and videos on various social network platforms for creating fake news and defrauding personal information, identifying synthesized content generated by generative adversarial networks (GANs) and variational autoencoders (VAEs) has become a critical challenge. As generative models advance and improve rapidly, current DeepFake detection techniques struggle to maintain effectiveness. To address this, several large-scale fake image datasets, such as FaceForensics++ (FF++) [1], DeepFake Challenge Dataset (DFCD) [2], Celeb-DF [3], and WildDeepFake [4] have been established to promote the development of effective DeepFake detection techniques.

DeepFake image and video manipulation techniques have emerged as the most well-known forgery generation applications, with far-reaching impacts on numerous individuals. Generally, facial manipulation schemes can be classified into four categories [1]: 1) entire face synthesis, 2) attribute manipulation, 3) identity swap, and 4) expression swap. Identity swap schemes have the most significant impact as they can be used to fabricate fake news targeting specific politicians. Many DeepFake detection techniques focus on identifying such fake videos using supervised learning methods with a pre-collected large-scale training set [5, 6, 7, 8, 9, 10, 11, 12].

Several advanced learning strategies have been proposed to enhance the performance of DeepFake image detection. For instance, methods in [7, 8] treat DeepFake image detection as binary classification tasks. The method in [10] introduces a novel multi-task learning approach to improve robustness and effectiveness. The authors in [9] also assert that traditional convolutional neural networks (CNNs) can be used to easily extract fake traces. However, the generalizability of such supervised learning strategies may be limited, as it is challenging to recognize DeepFake images generated by unknown GANs [13] due to difficulties in discerning out-of-distribution feature representations. To address the generalizability issue, semi-supervised learning is considered in [13, 14, 15] to capture common fake features from selected representative GANs, assuming that most GANs might share similar identifiable clues. Pairwise learning is then employed to learn these common features from the training set, improving generalizability for DeepFake image detection [14, 13]. Additionally, knowledge distillation is proposed for DeepFake detection by effectively transferring the weights of complex models to smaller models for enhanced generalizability [11].

In the domain of DeepFake video detection, numerous sophisticated approaches have recently emerged [16][17][18][19][20][21] [22][23][24][25]. On the one hand, these methods extend DeepFake image detection techniques by averaging the predictions of individual frames to assess a video’s authenticity [18][23][24][25]. On the other hand, the temporal inconsistency feature is exploited for DeepFake video classification using supervised learning approaches, as demonstrated in [20][21][16][17]. Specifically, state-of-the-art DeepFake video detection primarily focuses on exploiting various priors, such as bio-informatics clues [24], facial war** artifacts [25], and noise patterns [18]. Recently, several advanced techniques have been proposed to enhance DeepFake video detection performance. CORE [26] introduces a novel approach for learning consistent representations across different frames, while RECCE [27] employs a reconstruction-classification learning scheme to capture more discriminative features. DFIL [28] proposes an incremental learning framework that exploits domain-invariant forgery clues to improve generalization ability. TALL-Swin [29] utilizes a thumbnail layout and Swin Transformer to learn robust spatiotemporal features for DeepFake detection. UCF [30] focuses on uncovering common features shared by different manipulation techniques to enhance generalizability.

Another strategy to effectively detect the DeepFake image/video could be the extra-clue-inspired approach. In [24], a novel bio-feature—the Photoplethysmography (PPG) response—is utilized to differentiate DeepFake videos, as real and fake videos exhibit distinct PPG features. A critical limitation is the necessity for high-resolution videos and images to effectively capture PPG cues. Moreover, [25] investigates war** artifacts at the boundaries of DeepFake videos, which arise due to the limited resolution of synthesized facial components. However, contemporary GANs can generate high-resolution, realistic faces, rendering the resolution-inconsistency clues in [25] potentially less significant. Similarly, Face X-ray [18] leverages the boundary between real and fake facial regions as features, positing that the noise patterns of these parts differ and enabling traditional deep neural networks to identify DeepFake videos. As Face X-ray [18] demonstrates superior performance and robustness, recent DeepFake video detection techniques concentrate on uncovering more reliable signatures produced by GANs to enhance detection performance. Concurrently, [16] introduces deep Laplacian of Gaussian and the loss of isolated manipulated faces to bolster the generalizability of DeepFake video tasks.

Recent research has concentrated on develo** robust DeepFake detection models for compressed videos, as seen in [20][31][32]. The frequency component analysis method is employed to uncover intrinsic features and enhance performance under compression settings [31]. However, frequency-aware features may prove ineffective under high compression (with high-frequency reduction) or noisy conditions (with high-frequency amplification). The $F^{3}$ -net [32] selects two complementary frequency bands as clues, devising a novel network to learn frequency-aware features that reveal subtle forgery artifacts. Specifically, Frequency-aware Image Decomposition (FAD) is designed to learn subtle forgery patterns, while Local Frequency Statistics (LFS) primarily extracts high-level semantic features. This approach improves performance for low-quality inputs. A recent development in DeepFake video detection, the Anti-DeepFake Transformer (ADT), is proposed in [19], with robustness confirmed through cross-dataset evaluation. Recent studies [33][34][35] have highlighted the vulnerability of DeepFake detectors to adversarial perturbations. Therefore, adversarial defense with DeepFake detection, such as [36], have been attracting recently. In [36], it shows the effective solution by leveraging the statistical inference on the CNNs for achieving better robustness to adversarial examples.

All of the DeepFake video detection models, however, often assume that the input facial sequence is reliable and well-detected, as the current state-of-the-art face detectors show promising performance. A promising strategy to prevent manipulated faces from being detected by DeepFake video detectors, could be making adversarial examples for face detectors since these are the first pipelines for all DeepFake detection techniques. Numerous studies have demonstrated the effectiveness of adversarial attacks on face detectors [37]. Several recent adversarial perturbation strategies [38] [39] [40] [41] have been proposed, potentially rendering face detectors ineffective. For instance, the methods introduced in [37] and [29] indicate that the detection rate can decay to less than $10\%$ , implying the $90\%$ facial images in a face sequence could be invalid. These perturbed DeepFake videos can yield noisy face sequences with many invalid facial images, leading to unintended temporal feature jittering in temporal-clue-aware methods [20][19][31]. These temporal artifacts can significantly degrade their performance, while invalid facial images may also diminish the effectiveness of frame-level DeepFake video detection schemes [6][18] because the final decision of a video is based on the majority voting.

Refer to caption — Figure 1: Example of the detected faces from two videos using RetinaFace (top) and Dlib (bottom).

TABLE I: The confidence range of the detected faces using RetinaFace and Dlib for 200 videos sampled from FF++ [1]. Conf. and Det. stand for the confidence range of the detected faces using the specific face detector.

Det./Conf.	[0,0.33]	(0.33,0.66]	(0.66,0.1]	Total
RetinaFace (raw)	5	19	176	200
RetinaFace (c23)	6	28	166	200
RetinaFace (c40)	8	43	149	200
Dlib (raw)	0	13	187	200
Dlib (c23)	0	18	182	200
Dlib (c40)	2	33	165	200

Even video compression could reduce the detection rate of the face detectors. We randomly select 200 videos from the FF++ dataset [1], featuring varying compression ratios (raw, c23, c40), and extract 16 frames from each video for face detection analysis. We employ state-of-the-art face detection tools, such as RetinaFace [42] and Dlib [43], to substantiate our observations. Table I presents the face detection outcomes for the sampled videos. Notably, the predicted probability of 8 and 2 videos using RetinaFace [42] and Dlib [43] falls within the $[0,0.33]$ range, respectively, implying there are 8 and 2 facial images are mis-detected. Additionally, for uncompressed videos, 19 videos exhibit accuracy lower than $66\%$ , highlighting the imperfections of face detectors. The question is raised: Could the current DeepFake detection methods be robust to such noisy face sequences? The answer is negative. We simply replace the $40\%$ facial images with background ones for the testing set of FF++ [1] with raw setting and evaluate the performance using Xception [6]. Unsurprisingly, the accuracy dropped significantly after replacement. An effective solution to deal with the issues raised by noisy face sequences for DeepFake video detection is highly desired.

In light of the escalating threat posed by various malicious attacks on face detectors that aim to undermine their reliability, this paper presents a pioneering Graph-Regularized Attentive Convolutional Entanglement (GRACE) with Laplacian Smoothing learning approach. GRACE leverages contextual features in both temporal and spatial domains to effectively detect DeepFake videos under noisy face sequences. We meticulously incorporate sparsity regularization into our model to prioritize the features of valid face images within the noisy face sequence. By employing the proposed Feature Entanglement (FE) technique, an affinity matrix is constructed to amalgamate the spatiotemporal features, ensuring that each node possesses at least one feature descriptor originating from valid face images. Ultimately, Graph Laplacian (GL) smoothing regularization is ingeniously integrated into the Graph Convolutional Network (GCN) to further suppress noisy nodes, thereby significantly enhancing the performance of DeepFake video detection. The main contributions of this paper are three-fold:

•

We propose a novel GRACE with a Laplacian Smoothing learning framework that exploits contextual features in both temporal and spatial domains for robust DeepFake video detection under noisy face sequences. To the best of our knowledge, this is the first work to address the issue of unreliable face sequences for DeepFake video detection.
•

We introduce a Feature Entanglement (FE) mechanism to construct an affinity matrix that mixes the spatiotemporal features together, ensuring each node contains at least one feature from valid face images. This approach effectively mitigates the impact of invalid facial images in the noisy face sequence.
•

We propose a GL smoothing regularizer in the GCN to filter the noisy nodes further and improve the performance of DeepFake video detection. Comprehensive experiments demonstrate that our method achieves state-of-the-art performance, especially under challenging scenarios with unreliable and noisy face sequences.

The rest of this paper is organized as follows. Section 2 presents the proposed GRACE architecture design. In Section 3, the superiority of GRACE over benchmark methods is experimentally demonstrated. Finally, conclusions are drawn in Section 4.

2 Proposed Graph-Regularized Attentive Convolutional Entanglement

2.1 Overview of the Proposed Method

The proposed GRACE’s flowchart is illustrated in Fig. 2. First, a face detector extracts facial images from each video frame. A CNN-based backbone network extracts high-level semantic features from the spatial domain of the acquired facial parts, as displayed in the center of 2. Using the extracted spatial features, the spatial and temporal representations at frame $n$ and location $(i,j)$ across all feature maps ${\bm{X}}\in\mathbb{R}^{d\times c}$ can be obtained for the face sequence, where $d=N\times w\times h$ , potentially including partially invalid faces. This feature representation captures feature responses at frame $n$ and location $(i,j)$ across all frames, thereby integrating temporal information, as shown in Fig. 2.

To augment the correlation between the spatial and temporal feature representation ${\bm{X}}$ acquired in the previous step, we introduce a novel Feature Entanglement (FE) with sparse constraint, denoted as ${\bm{X}}_{\textrm{FE}}=G_{\textrm{FE}}({\bm{X}})\in\mathbb{R}^{d\times d}$ , which carefully embeds both temporal and spatial features into its graph representation by affinity matrix ${\bm{X}}_{\textrm{FE}}$ from original feature ${\bm{X}}$ . In highly noisy face sequences, the number of invalid faces could be more than that of valid ones. Therefore, the essential features could be relatively fewer, motivating us to introduce the sparsity constraint into our GRACE to focus on those essential features. Then, to efficiently discern the importance of the graph representation ${\bm{X}}_{\textrm{FE}}$ , we introduce the GCN to capture the contextual features between nodes (spatiotemporal features) in ${\bm{X}}$ . To further remove the noisy nodes from the original ${\bm{X}}_{\text{FE}}$ , Graph Laplacian is judiciously adapted to each layer of the GCN for better performance under noisy face sequences. Finally, a softmax classifier is connected to the outcome of GCN to evaluate the authenticity of the supplied facial parts.

2.2 Feature Entanglement with Sparse Constraint

We develop a method inspired by spatiotemporal feature extraction [44][45]. Traditional CNNs serve as the backbone network to obtain the spatial feature representation ${\bm{X}}^{n}\in\mathbb{R}^{c\times h\times w}$ for each frame ${\bm{y}}_{d}^{n}$ of the video. Assuming that the size of the extracted feature map is $c\times h\times w$ , the spatial feature representation of a specific video at location $(i,j)$ via the backbone network can be vectorized into ${\bm{x}}_{n,i,j}=[x_{(}n,i,j)^{1},x_{(}n,i,j)^{2},...,x_{(}n,i,j)^{c}]\in% \mathbb{R}^{c\times 1}$ , where $c$ is the number of channels in the extracted spatial feature map and $i=1,...,w$ , $j=1,...,h$ . Then, let $d=Nwh$ , we create the feature context ${\bm{X}}\in\mathbb{R}^{d\times c}$ based on location-wise feature concatenation, as follows:

$\displaystyle{\bm{X}}=[$	$\displaystyle({\bm{x}}^{1}_{1,1,1},\ {\bm{x}}^{2}_{1,1,1},\ ...,\ {\bm{x}}^{c}% _{1,1,1});$	(1)
	$\displaystyle({\bm{x}}^{1}_{1,1,2},\ {\bm{x}}^{2}_{1,1,2},\ ...,\ {\bm{x}}^{c}% _{1,1,2});$
	$\displaystyle\vdots$
	$\displaystyle({\bm{x}}^{1}_{1,h,w},\ {\bm{x}}^{2}_{1,h,w},\ ...,\ {\bm{x}}^{c}% _{1,h,w});$
	$\displaystyle\vdots$
	$\displaystyle({\bm{x}}^{1}_{N,h,w},\ {\bm{x}}^{2}_{N,h,w},\ ...,\ {\bm{x}}^{c}% _{N,h,w})],$

where ${\bm{X}}$ represents the spatiotemporal feature.

The efficient extraction of joint spatial and temporal feature representations from ${\bm{X}}$ necessitates addressing potential inefficiencies linked to considerable distances between the $n$ -th and $m$ -th feature vectors, ${\bm{x}}^{n}$ and ${\bm{x}}^{m}$ , especially when $m$ and $n$ are significantly apart. Directing adopting the Transformer [44][46] could lead to large computational complexity since the number of tokens could be large enough to have a long-range correlation.

To address these issues and efficiently exploit joint spatial and temporal feature representations, we propose a novel concept, i.e., FE, which is exactly an affinity matrix, creating a graph representation that encapsulates the relationships between nodes—each node representing entangled spatial and temporal features based on edges. Every element (node) in ${\bm{X}}_{\textrm{FE}}$ intertwines spatial and temporal information, thereby facilitating graph neural networks to learn the relationships among spatiotemporal features more effectively, in which the features of the first and last frames might still have a link. The feature entanglement ${\bm{X}}_{\textrm{FE}}$ is defined as follows:

\mathbf{X}_{\text{FE}}=G_{\textrm{FE}}({\bm{X}})=\mathbf{X}\mathbf{X}^{T},

(2)

where $\mathbf{X}_{\text{FE}}\in\mathbb{R}^{d\times d}$ is the affinity matrix obtained through feature entanglement. Consider, for instance, the feature element $\mathbf{X}_{\text{FE}}({1,c})$ , which is computed by taking the inner product of the first and the $c$ -th row of $\mathbf{X}$ . This element encapsulates the correlation between the spatial features at location $(1,1)$ across all frames and the spatial features at location $(w,h)$ in the first frame. By constructing the affinity matrix $\mathbf{X}_{\text{FE}}$ in this manner, we effectively capture the spatiotemporal dependencies within the feature representation. However, learning discriminative features from $\mathbf{X}_{\text{FE}}$ can be challenging, particularly in the presence of highly noisy face sequences, where a significant portion of the facial images may be invalid. To address this issue, we introduce a sparsity constraint as a regularization term in the learning objective, formulated as the $\ell_{1}$ norm of $\mathbf{X}$ , denoted by $|\mathbf{X}|_{1}$ . By enforcing sparsity on $\mathbf{X}$ , we encourage the model to focus on the most informative features, thereby enhancing its robustness to noisy face sequences and improving its effectiveness in DeepFake video detection.

2.3 Proposed GCN with Graph Laplacian

In the previous subsection, we introduced the overall pipeline of our proposed GRACE method. A crucial component of GRACE is the FE with sparse constraint, which aims to efficiently exploit joint spatial and temporal feature representations. In this subsection, we delve into the details of FE and discuss how it addresses the challenges associated with unreliable face sequences in DeepFake detection.

Graph Convolutional Networks (GCNs) have emerged as a powerful tool for processing graph-structured data [47, 48], making them a suitable choice for handling the graph embedding obtained through feature entanglement. However, the noise level in each node can vary depending on the degree of distortion. For highly distorted or noisy nodes (i.e., those with entangled features primarily contributed by invalid facial images), it is beneficial to eliminate them to ensure stable and excellent performance in DeepFake video detection. To mitigate the impact of features from invalid facial images, we judiciously integrate the Graph Laplacian Smoothing Prior (GLSP), a well-established concept in Graph Signal Processing (GSP) [49], as a regularizer into the GCN to filter out highly noisy nodes [50]. It is important to note that GSP and GCN are distinct domains, with GSP focusing on the analysis and processing of signals defined on graphs, while GCN aims to learn representations by exploiting the graph structure. In this study, we ingeniously leverage the properties of GLSP from GSP and seamlessly incorporate it into the GCN framework, enabling end-to-end training without explicitly computing the eigendecomposition of the Graph Laplacian matrix.

Let $\mathbf{X}\in\mathbb{R}^{d\times c}$ denote the SFE feature matrix, where $d=Nwh$ represents the spatiotemporal dimensions, with $N$ being the number of frames, $w$ and $h$ being the width and height of the feature maps, and $c$ represents the feature dimension. This matrix encapsulates the spatiotemporal features extracted from the entire video sequence.

We construct affinity matrix ${\bm{A}}=\mathbf{X}_{\text{FE}}\in\mathbb{R}^{d\times d}$ through feature entanglement, where $\mathbf{A}_{ij}$ represents the edge weight between nodes $i$ and $j$ . This matrix captures the similarity between different nodes. The degree matrix $\mathbf{D}\in\mathbb{R}^{d\times d}$ is a diagonal matrix where $\mathbf{D}_{ii}=\sum_{j}\mathbf{A}_{ij}$ represents the sum of edge weights connected to each node. The Graph Laplacian matrix $\mathbf{L}=\mathbf{D}-\mathbf{A}$ captures the topological structure of the graph and the differences between nodes.

To further remove the redundancy among nodes, we apply adaptively thresholding to $\mathbf{A}$ to filter out weak or irrelevant connections. For each sample $i$ , we compute the mean value of its feature entanglement matrix $\mathbf{A}$ and keep only the elements that are greater than half of the mean value. The indices and values of these elements are then extracted to form the edge indices and edge weights of a sparse affinity matrix $\mathbf{A}^{(i)}$ , as follows:

\mathbf{A}^{(i)}_{jk}=\begin{cases}\mathbf{X}_{\text{FE}}^{(i)_{jk}},&\text{if% }\mathbf{X}_{\text{FE}}^{(i)_{jk}}>q\times\overline{\mathbf{X}_{\text{FE}}^{(% i)}}\\ 0,&\text{otherwise}\end{cases}

(3)

where $\mathbf{A}^{(i)}_{jk}$ denotes the edge weight between nodes $j$ and $k$ in the adjacency matrix of sample $i$ , and $q$ is the factor controlling how strict the node being filtered. In this study, $q=0.5$ for all experiments. This thresholding operation helps to focus on the most important connections and reduces the computational burden of the GCN.

The resulting sparse adjacency matrix $\mathbf{A}^{(i)}$ , along with the node features $\mathbf{X}^{(i)}$ , serve as the input to the GCN for learning the graph structure and node representations. By combining the FE matrix ${\bm{X}}_{\text{FE}}$ and the sparse adjacency matrix ${\bm{A}}$ , our method leverages the advantages of both representations, capturing rich spatiotemporal correlations while focusing on the most informative connections for DeepFake detection.

Graph Convolutional Networks (GCNs) have shown remarkable performance in various tasks by leveraging the power of graph-structured data. The core operation of GCNs in the $l$ -th layer can be described as follows:

\mathbf{Z}^{(l+1)}=\sigma(\hat{\mathbf{D}}^{-\frac{1}{2}}\hat{\mathbf{A}}\hat{% \mathbf{D}}^{-\frac{1}{2}}\mathbf{Z}^{(l)}\mathbf{W}^{(l)}),

(4)

where $\hat{\mathbf{A}}=\mathbf{A}+\mathbf{I}_{d}$ is the adjacency matrix with self-loops, $\hat{\mathbf{D}}_{ii}=\sum_{j=1}^{d}\hat{\mathbf{A}}_{ij}$ is the corresponding degree matrix, $\mathbf{W}^{(l)}$ is the weight matrix of the $l$ -th layer, and $\sigma$ is the activation function. This equation describes the aggregation and transformation of node features based on the graph structure.

However, the performance of GCNs may be compromised when dealing with highly noisy scenarios, limiting their applicability in real-world situations. To address this challenge, we propose the incorporation of Graph Laplacian regularization into GCNs to enhance their robustness and improve their performance in the presence of significant noise.

Given an undirected graph $G=(V,E)$ , where $V$ is the set of nodes and $E$ is the set of edges, the Graph Laplacian matrix $\mathbf{L}$ is defined as:

\mathbf{L}=\mathbf{D}-\mathbf{A},

(5)

where $\mathbf{D}$ is the degree matrix and $\mathbf{A}$ is the adjacency matrix of the graph $G$ . The degree matrix $\mathbf{D}$ is a diagonal matrix, where $\mathbf{D}_{ii}$ equals the degree of node $i$ in $G$ . It is worth noting that the Graph Laplacian matrix $\mathbf{L}$ is different from the matrix $\hat{\mathbf{L}}$ used in the GCN propagation rule, which will be discussed later.

As a real symmetric matrix, the Graph Laplacian matrix $\mathbf{L}$ possesses an eigendecomposition:

\mathbf{L}=\mathbf{U}\Lambda\mathbf{U}^{T},

(6)

where $\mathbf{U}$ is the matrix of eigenvectors and $\Lambda$ is the diagonal matrix of eigenvalues. The eigenvalues of $\mathbf{L}$ represent the frequencies of the graph signals, with smaller eigenvalues corresponding to lower frequencies and larger eigenvalues corresponding to higher frequencies. In GSP, the Graph Laplacian matrix is often used as a low-pass filter to smooth signals defined on graphs, effectively suppressing high-frequency noise while preserving low-frequency information. Although we do not explicitly compute the eigendecomposition of $\mathbf{L}$ in our implementation, it is essential to understand that the Graph Laplacian matrix inherently encodes the spectral properties of the graph, which enables effective feature smoothing and noise suppression.

In practice, we calculate the Graph Laplacian matrix $\mathbf{L}$ using the degree matrix $\mathbf{D}$ and the adjacency matrix $\mathbf{A}$ , which are obtained from the graph structure. This calculation does not involve the explicit computation of eigenvalues and eigenvectors. However, the resulting matrix $\mathbf{L}$ still possesses the spectral properties that enable effective feature smoothing and noise suppression.

To integrate the Graph Laplacian smoothing prior into the GCN propagation rule, we propose a modified version of the Graph Laplacian matrix that incorporates the graph’s structural information and enables effective feature smoothing and noise suppression. Our proposed Graph Laplacian matrix, denoted as $\hat{\mathbf{L}}$ , is defined as follows:

\hat{\mathbf{L}}=\hat{\mathbf{D}}^{-\frac{1}{2}}\hat{\mathbf{A}}\hat{\mathbf{D% }}^{-\frac{1}{2}},

(7)

where $\hat{\mathbf{A}}=\mathbf{A}+\mathbf{I}_{d}$ is the adjacency matrix with self-loops, $\hat{\mathbf{D}}$ is the corresponding degree matrix with $\hat{\mathbf{D}}_{ii}=\sum_{j=1}^{d}\hat{\mathbf{A}}_{ij}$ , and $\mathbf{I}_{d}$ is the identity matrix.

The matrix $\hat{\mathbf{L}}$ is a normalized version of the Graph Laplacian matrix, which captures the graph’s structure and enables the smoothing of node features. By incorporating self-loops into the adjacency matrix, we ensure that each node’s feature is considered during the smoothing process, enhancing the stability and expressiveness of the learned representations.

To integrate the Graph Laplacian smoothing prior into the GCN propagation rule, we modify the propagation equation as follows:

\mathbf{Z}^{(l+1)}=\sigma(\hat{\mathbf{L}}\mathbf{Z}^{(l)}\mathbf{W}^{(l)}),

(8)

where $\mathbf{Z}^{(l)}$ represents the node features at layer $l$ , $\mathbf{W}^{(l)}$ is the learnable weight matrix, and $\sigma$ is the activation function.

The term $\hat{\mathbf{L}}\mathbf{Z}^{(l)}$ effectively applies the Graph Laplacian smoothing prior to the node features. By multiplying the node features with the normalized Graph Laplacian matrix $\hat{\mathbf{L}}$ , we achieve a smoothing effect that takes into account the graph’s structure. This operation allows the model to leverage the connectivity information encoded in the graph to refine the node features and suppress high-frequency noise.

It is important to note that although we do not explicitly compute the eigendecomposition of $\hat{\mathbf{L}}$ in our implementation, the matrix $\hat{\mathbf{L}}$ inherently possesses the spectral properties of the Graph Laplacian. By using $\hat{\mathbf{L}}$ in the GCN propagation rule, we implicitly leverage these spectral properties to achieve effective feature smoothing and noise suppression.

The effectiveness of the Graph Laplacian matrix $\hat{\mathbf{L}}$ as a low-pass filter can be understood by examining its eigendecomposition:

\hat{\mathbf{L}}=\mathbf{U}\Lambda\mathbf{U}^{T},

(9)

where $\mathbf{U}$ is the matrix of eigenvectors and $\Lambda$ is the diagonal matrix of eigenvalues. The eigenvalues of $\hat{\mathbf{L}}$ represent the frequencies of the graph signals, with smaller eigenvalues corresponding to lower frequencies and larger eigenvalues corresponding to higher frequencies. By multiplying the node features with $\hat{\mathbf{L}}$ , we essentially apply a low-pass filter to the graph signals, attenuating the high-frequency components while preserving the low-frequency information. This operation effectively suppresses noise and promotes the smoothness of the node features across the graph. In practice, we compute $\hat{\mathbf{L}}$ using the normalized adjacency matrix $\hat{\mathbf{A}}$ and the corresponding degree matrix $\hat{\mathbf{D}}$ , as shown in Equation (6). This computation can be efficiently performed using sparse matrix operations, without the need for explicit eigendecomposition.

To provide a rigorous theoretical analysis of the effectiveness of our proposed Graph Laplacian regularization, we can examine the convergence and generalization properties of the method. Let $\mathbf{Z}^{*}$ denote the optimal node features that minimize the loss function $\mathcal{L}(\mathbf{Z})$ . We can show that by incorporating the Graph Laplacian regularization term $\hat{\mathbf{L}}\mathbf{Z}^{(l)}$ into the GCN propagation rule, the learned node features $\mathbf{Z}^{(l)}$ converge to $\mathbf{Z}^{*}$ under mild assumptions on the graph structure and the loss function. Specifically, if the graph is connected and the loss function is convex and smooth, the iterative updates of the node features using Equation (8) will converge to the optimal solution $\mathbf{Z}^{*}$ (see Theorem 1). Furthermore, the generalization error of the learned node features can be bounded by the graph Laplacian regularization term, indicating that the proposed method effectively controls the model complexity and prevents overfitting to noisy or irrelevant features.

By incorporating the Graph Laplacian smoothing prior into the GCN, our method effectively addresses the challenges posed by noisy scenarios. The smoothing operation helps to mitigate the impact of noisy or irrelevant features, enhancing the robustness and generalization ability of the learned representations.

Finally, the output features can be obtained by passing the final layer’s features $\mathbf{Z}^{(L)}$ through a fully connected layer (FC):

\mathbf{Z}=\sigma({\bm{W}}_{\text{out}}\mathbf{Z}^{(L)}\mathbf{W}^{(L)})

(10)

where ${\bm{W}}_{\text{out}}\in\mathbb{R}^{g_{\text{dim}}\times n_{\text{out}}}$ denotes the weight of the FC, $n_{\text{out}}$ and $g_{\text{dim}}$ stand for the number of neurons of the FC and the embedding dimension of GCN. Finally, the predicted result could be done via

\mathbf{\hat{Y}}=\text{Softmax}({\bm{W}}_{\text{cls}}{\bm{Z}})

(11)

where ${\bm{W}}_{\text{cls}}\in\mathbb{R}^{n_{\text{out}}\times n_{\text{cls}}}$ denotes the weight of the FC, $n_{\text{cls}}$ stands for the number of classes.

During the training phase, the cross-entropy loss is employed to optimize the model parameters:

\mathcal{L}=-\sum_{c=1}^{n_{\text{cls}}}\mathbf{\hat{Y}}_{c}\log(\mathbf{Y}_{c% })+\alpha|{\bm{X}}|_{1}

(12)

where $\mathbf{Y}$ is the one-hot encoded ground-truth label, and $\alpha$ stands for weight of the sparsity constraint, where $\alpha=1e^{-5}$ in this study for all experiments.

By incorporating Graph Laplacian regularization into the GCN, our proposed method effectively addresses the challenges posed by noisy face sequences in DeepFake video detection. The Graph Laplacian smoothing helps to filter out noisy nodes and enhances the robustness and stability of the model. The combination of feature entanglement, sparse regularization, and Graph Laplacian regularization enables our method to make the most of the available valid information, suppress the influence of noisy features, and improve the correlation between relevant features. As a result, our approach achieves high-performance and robust DeepFake video detection, even in the presence of low-quality and noisy face sequences commonly encountered in real-world scenarios.

In summary, our proposed GCN with Graph Laplacian regularization effectively leverages the spectral properties of the Graph Laplacian matrix to achieve feature smoothing and noise suppression, without explicitly computing the eigendecomposition. By integrating the Graph Laplacian smoothing prior into the GCN propagation rule, our method enhances the robustness and generalization ability of the learned representations, making it highly suitable for detecting DeepFake videos in challenging real-world scenarios with noisy and unreliable face sequences. The theoretical analysis of the convergence and generalization properties of our method further validates its effectiveness and provides a solid foundation for its application.

3 Experimental Results

3.1 Experimental Configuration

The robustness validation of the proposed method is the core of our investigation, particularly when applied to noisy face sequences containing many invalid faces. To achieve this, the use of representative benchmark datasets is essential. Therefore, we selected three well-established benchmark datasets for performance evaluation: FF++ [1], Celeb-DFv2 [3], and the large-scale DFDC dataset [2]. The FF++ dataset [1] comprises four distinct classes of manipulation methods: 1) DeepFakes (DF), 2) Face2Face (F2F), 3) FaceSwap (FS), and 4) NeuralTextures (NT). For each class, a set of 1,000 original videos was used to generate 1,000 manipulated versions, resulting in a total of 1,000 authentic and 4,000 doctored videos. The Celeb-DF dataset [3] contains 590 original videos and 5,639 manipulated counterparts, generated using improved generative adversarial networks at a resolution of $256\times 256$ . To enhance the quality of the manipulated videos, a Kalman filter is employed in Celeb-DF [3] to mitigate temporal inconsistencies between successive frames. In addition to FF++ and Celeb-DF, we also utilize the DeepFake Detection Challenge (DFDC) dataset [2] to further validate the effectiveness of our proposed method. The DFDC dataset, created by Facebook in collaboration with other organizations, is a large-scale dataset designed to facilitate the development of DeepFake detection algorithms. It consists of over 100,000 videos, containing a mix of authentic and manipulated content generated using various state-of-the-art face swap** and facial reenactment techniques, ensuring a diverse and challenging set of DeepFakes for evaluation.

Training Hyperparameters of Our GRACE. To ensure a balanced performance appraisal, the Celeb-DF [3], FF++ [1], and DFDC [2] datasets were divided into training, validation, and testing sets following an $8:1:1$ ratio. In line with our objective to ascertain the efficacy of GRACE in the presence of unstable face detectors, we trained separate GRACE models independently on each dataset. During the training phase, the Adam optimizer [51] was utilized with an initial learning rate of $1e^{-4}$ and a step-learning-decay schedule. We employed the 53-layer Cross Stage Partial Network (CSPNet) [52] as the backbone network. Note that any CNNs could be used in our GRACE as backbone network. The standard GCN with our Graph Laplacian was implemented for stacking $g_{n}$ layers, with $g_{n}=8$ and embedding size $g_{\text{dim}}=400$ being the default setting in this study. The number of neurons of the last fully connected layer $n_{\text{out}}$ is $2048$ for our experiments. All facial images were resized to $144\times 144$ during both training and inference stages. Standard data augmentation techniques, such as random noise, crop**, and flip**, were adopted during the training phase. The training phase consisted of 200 epochs, with a learning rate decay of 0.1 every 100 epochs. We randomly sampled $N=16$ successive facial images to form the input tensor for our experiments. All comparison methods, including the proposed GRACE, were trained on the training set and evaluated on the testing set.

Training Hyperparameters of Peer Methods. For performance evaluation, we compared our proposed method with several state-of-the-art DeepFake detection techniques, including MesoNet [5], Xception [6], $F^{3}$ -net [32], RECCE [27], DFIL [28], UCF [30], CORE [26], and TALL-Swin [53]. The frame-level approaches, namely MesoNet, Xception, $F^{3}$ -net, RECCE, DFIL, UCF, and CORE, were trained using the same strategy as described previously, with their default settings. However, the learning rates of Xception [6] and UCF [30] were adjusted to $2e^{-4}$ for better performance. The video-based approach, TALL-Swin [53], was trained using their default settings. During the training phase, we randomly selected $N$ facial images from the training set. The final authenticity verdict for the input video was determined by averaging the $N$ prediction outcomes corresponding to the $N$ facial images extracted from the input video, using a temporally centered crop** strategy. For all other methods, the number of frames $N$ used was set to $16$ . The image size for Xception [6], $F^{3}$ -net [32], RECCE [27], UCF [30], and CORE [26] is $256\times 256$ , suggested by their default settings, while that for DFIL [28] and TALL-Swin [53] are $299\times 299$ and $224\times 224$ , respectively.

Settings in Inference Phase. To evaluate the model’s performance under the influence of an unstable face detector, we randomly replaced certain facial images with background segments, as determined by the masking ratio $m_{r}$ . We experimented with masking ratios ranging from 0.1 to 0.8 to assess the effectiveness of GRACE under varying levels of noise in the face sequences. For instance, with $N=16$ and $m_{r}=0.5$ , up to eight facial images could be replaced with background images in the corresponding frames, simulating real-world scenarios where face detection may be challenging or unreliable. In our experimental setup, we sampled $N=16$ frames from the middle portion of each video, following the same approach used during the training process. When $m_{r}=0.5$ , half of the 16 frames (i.e., 8) were randomly replaced with either background or completely black images. By varying the masking ratio, we evaluated the robustness and stability of each method under different levels of noise in the face sequences.

Furthermore, we assumed that each frame should contain at least one face to simulate adversarial attacks on face detectors in real-world scenarios. In cases where no face was detected in a frame, we replaced that frame with a black image, generating a noisy face sequence that allowed us to assess the robustness of GRACE under challenging conditions. Our experimental analysis employed three performance metrics: accuracy, Macro F1-Score, and Area Under the Receiver Operating Characteristic Curve (AUC). The Macro F1-Score accurately reflects the model’s performance under label imbalance situations. For simplicity, these metrics are referred to as Accuracy (Acc.), F1-Score, and AUC throughout the experimental sections.

3.2 Quantitative Results

TABLE II: Quantitative comparison of the noisy face sequences under different masking rations

m_{r}

between the proposed ML-SELF and other state-of-the-art methods.Highlighting the best performance in red and the second-best performance in blue, considering the utilization of the FF++ Celeb-DF, and DFDC datasets alongside the

m_{r}

variable.

	$m_{r}$	FF++ [1]			Celeb-DF [3]			DFDC [2]
	$m_{r}$	ACC	F1	AUC	ACC	F1	AUC	ACC	F1	AUC
Xception [6]	0.0	0.925	0.894	0.972	0.861	0.806	0.910	0.953	0.910	0.981
	0.4	0.869	0.780	0.871	0.631	0.614	0.782	0.908	0.788	0.866
	0.8	0.814	0.594	0.654	0.398	0.389	0.604	0.864	0.598	0.647
$F^{3}$ -net [32]	0.0	0.950	0.928	0.986	0.965	0.957	0.993	0.957	0.921	0.986
	0.4	0.883	0.798	0.888	0.691	0.684	0.895	0.864	0.655	0.755
	0.8	0.818	0.599	0.662	0.418	0.407	0.664	0.850	0.539	0.595
RECCE [27]	0.0	0.938	0.911	0.979	0.941	0.925	0.985	0.940	0.872	0.973
	0.4	0.878	0.790	0.874	0.678	0.669	0.869	0.900	0.752	0.863
	0.8	0.817	0.599	0.655	0.414	0.404	0.648	0.861	0.579	0.648
UCF [30]	0.0	0.937	0.911	0.982	0.856	0.792	0.891	0.890	0.815	0.939
	0.4	0.875	0.790	0.882	0.626	0.607	0.642	0.871	0.733	0.812
	0.8	0.815	0.598	0.660	0.397	0.389	0.516	0.851	0.586	0.620
CORE [26]	0.0	0.948	0.925	0.984	0.953	0.940	0.989	0.950	0.903	0.977
	0.4	0.883	0.799	0.888	0.858	0.790	0.890	0.907	0.781	0.870
	0.8	0.818	0.601	0.663	0.764	0.572	0.661	0.863	0.595	0.651
TALL-Swin [53]	0.0	0.913	0.868	0.881	0.913	0.933	0.924	0.911	0.812	0.984
	0.4	0.867	0.767	0.740	0.847	0.789	0.825	0.872	0.758	0.786
	0.8	0.827	0.605	0.589	0.745	0.680	0.645	0.845	0.688	0.650
DFIL [28]	0.0	0.954	0.939	0.987	0.957	0.954	0.964	0.940	0.881	0.955
	0.4	0.876	0.808	0.893	0.695	0.684	0.825	0.886	0.720	0.813
	0.8	0.759	0.603	0.665	0.518	0.350	0.644	0.855	0.565	0.621
GRACE [Our]	0.0	0.962	0.942	0.989	0.989	0.968	0.998	0.969	0.942	0.988
	0.4	0.958	0.936	0.987	0.970	0.920	0.998	0.969	0.940	0.988
	0.8	0.944	0.916	0.983	0.857	0.738	0.980	0.962	0.925	0.979

TABLE III: Comparison of different methods in terms of FLOPs (Floating-point Operations), MACs (Multiply-Accumulate Operations), and number of parameters (

\#

Params.).

Method	FLOP (T)	MACs (T)	$\#$ Params (M)
Xception [6]	60.796	30.356	21.861
$F^{3}$ -net [31]	192.604	95.880	22.125
RECCE [27]	81.655	40.667	47.693
UCF [30]	180.738	90.087	46.838
CORE [26]	60.978	30.356	21.861
TALL-Swin [29]	30.318	15.125	86.920
DFIL [28]	60.976	30.356	20.811
GRACE (Ours)	70.751	35.246	29.661

The primary performance assessment comparing the handling of invalid facial images between our proposed model, GRACE, and various state-of-the-art schemes is provided in Table II. Under optimal conditions, where most facial images are valid, GRACE exhibits competitive results, holding its own against other cutting-edge DeepFake video detection methods such as Xception [6], MesoNet [5], $F^{3}$ -Net [32], RECCE [27], CORE [26], TALL-Swin [53], and DFIL [28]. It is worth noting that TALL-Swin [53] is a video-based approach.

Specifically, the F1-Score of GRACE for DeepFake video detection, when evaluated on FF++ [1], Celeb-DF [3], and DFDC [2], slightly surpasses those of its contemporaries under clean cases (i.e., $m_{r}=0$ ). This outcome implies that the proposed GRACE with Graph Laplacian is effective and reliable for DeepFake video detection. However, in scenarios where partial face images are invalid due to purposeful attacks on face detectors, the performance of traditional frame-level methods, including Xception [6], MesoNet [5], $F^{3}$ -Net [32], RECCE [27], CORE [26], UCF [30], and DFIL [28], may substantially deteriorate since they fail to consider noisy face sequences in real-world scenarios.

Similarly, the video-level DeepFake detection methods, TALL-Swin [53], which heavily rely on temporal cues, may suffer further performance degradation when the masking ratio increases. Invalid faces can cause landmark detection failures and incorrect temporal trajectories. Consequently, the F1-Score of TALL-Swin under a masking ratio of 0.8 in the testing phase is lower than 0.7, implying that all predictions would be categorized as either entirely fake or real. Likewise, the performance of another state-of-the-art video-based DeepFake detection method, TALL-Swin [53], is poor when $m_{r}$ is increased. In stark contrast, all quality indices of our proposed GRACE, evaluated on different datasets, display promising results, suggesting that GRACE is robust and reliable even under highly noisy face sequences (e.g., when $m_{r}=0.8$ ). Remarkably, since most DeepFake detection methods fail to discuss the impact of unreliable face sequences, the degraded performance is most likely predictable.

To further demonstrate the efficiency and practicality of the proposed GRACE method, we conduct a comprehensive complexity analysis and compare it with other state-of-the-art DeepFake detection methods. Table III presents the comparison results in terms of floating-point operations (FLOPs), multiply-accumulate operations (MACs), and the number of parameters for each method with $16\times 3\times 144\times 144$ tensor for the fair comparison. It is evident that GRACE achieves a remarkable balance between computational complexity and performance. With 70.751 trillion FLOPs, 35.246 trillion MACs, and 29.661 million parameters, GRACE exhibits a moderate computational overhead compared to other methods, such as TALL-Swin [29], UCF [30], and RECCE [27]. Notably, GRACE outperforms these methods in terms of FLOPs and MACs while maintaining a comparable number of parameters. Moreover, GRACE demonstrates superior performance in handling noisy face sequences, as shown in the experimental results, despite having a similar complexity to methods like CORE [26], Xception [6], and DFIL [28]. This highlights the effectiveness of the proposed feature entanglement, graph convolutional network, and graph Laplacian regularization techniques in learning discriminative and robust representations for DeepFake detection. The complexity analysis further substantiates GRACE as a practical and efficient solution for real-world DeepFake detection challenges, offering a compelling trade-off between computational resources and detection accuracy.

The detailed quantitative results, evaluated on the FF++ [1], Celeb-DF [3], and DFDC [2] datasets, are illustrated in Figures 3(a) and 3(c), respectively. In the clean case, i.e., when $m_{r}=0$ , the performance of the proposed method is comparable to other state-of-the-art methods. It is observed that performance degradation becomes increasingly pronounced with a rise in the masking ratio during the testing phase, particularly when the masking ratio ( $m_{r}$ ) exceeds 0.5. The performance of the previously established TALL-Swin [53] also declines when the masking ratio surpasses 0.2. A similar trend is discernible in Fig. 3(c), which evaluates the DFDC testing set. The performance of contemporary methods diminishes at higher masking ratios, whereas the proposed GRACE method maintains relatively high performance even at a masking ratio of 0.8. We also draw the AUC comparison between the proposed GRACE and other peer methods in Fig. 3. We show that the proposed GRACE significantly outperforms other state-of-the-art DeepFake detectors, especially under noisy face sequences.

More specifically, most existing DeepFake video/image detection algorithms do not address the impact of noisy face sequences. Although state-of-the-art face detectors perform exceptionally well under pristine conditions, their performance can be severely undermined when subjected to well-engineered post-processing techniques, particularly adversarial perturbations targeting the face detector. Our GRACE method successfully overcomes this shortcoming and introduces a novel and robust DeepFake video detection approach for real-world challenges.

3.3 Hyperparameters Selection

TABLE IV: Performance evaluation of the proposed GRACE with different hyperparameter settings using FF++ [1].

g_{\text{dim}}

and

g_{n}

are the embedding dimension and number of layers of GCN, respectively;

N

is the frames extracted from the video;

n_{\text{out}}

is the number of neurons of FC;

\alpha

stands for weights of sparsity.

$m_{r}$	$N$	ACC	F1	AUC	$g_{n}$	ACC	F1	AUC
0.8	12	0.922	0.875	0.971	12	0.856	0.711	0.950
0.8	20	0.948	0.918	0.978	4	0.948	0.918	0.982
0.8	16	0.944	0.916	0.983	8	0.944	0.916	0.983
0.7	12	0.958	0.936	0.983	12	0.896	0.816	0.964
0.7	20	0.954	0.930	0.986	4	0.952	0.926	0.983
0.7	16	0.960	0.938	0.985	8	0.960	0.938	0.985
-	$\alpha$	ACC	F1	AUC	$g_{\text{dim}}$	ACC	F1	AUC
0.8	$1e^{-7}$	0.928	0.887	0.978	600	0.896	0.805	0.966
0.8	$1e^{-6}$	0.942	0.910	0.974	200	0.934	0.886	0.981
0.8	$1e^{-5}$	0.944	0.916	0.983	400	0.944	0.916	0.983
0.7	$1e^{-7}$	0.954	0.930	0.981	600	0.938	0.896	0.984
0.7	$1e^{-6}$	0.944	0.914	0.979	200	0.942	0.905	0.987
0.7	$1e^{-5}$	0.960	0.938	0.985	400	0.960	0.938	0.985
-	$n_{\text{out}}$	ACC	F1	AUC
0.8	1024	0.926	0.871	0.969
0.8	3072	0.924	0.876	0.939
0.8	2048	0.944	0.916	0.983
0.7	1024	0.944	0.908	0.984
0.7	3072	0.926	0.880	0.968
0.7	2048	0.960	0.938	0.985

To achieve optimal performance and robustness, we conducted a comprehensive ablation study to investigate the impact of various hyperparameters on the proposed GRACE method. This analysis provides valuable insights into the design choices and trade-offs involved in develo** an effective DeepFake video detection system for real-world scenarios with noisy face sequences. Table IV presents the performance comparison of GRACE under different hyperparameter settings, evaluated on the challenging FF++ dataset [1].

3.3.1 Number of Extracted Frames ( $N$ )

The number of frames employed during the training and testing phases is a crucial aspect of GRACE. While using a larger number of frames might intuitively improve performance, it also significantly increases the computational complexity. To strike an optimal balance, we investigated the impact of varying the number of extracted frames. As shown in Table IV, using $N=8$ frames results in the lowest computational complexity but slightly compromises performance in terms of Accuracy, Macro F1-Score, and AUC. Conversely, increasing the number of frames to $N=20$ achieves state-of-the-art performance for most masking ratios during testing. Considering the trade-off between effectiveness and efficiency, we recommend using $N=16$ frames as the optimal setting for GRACE.

3.3.2 Number of GCN Layers ( $g_{n}$ )

The depth of the Graph Convolutional Network (GCN) plays a vital role in learning robust feature representations. However, stacking too many layers with the Graph Laplacian smooth prior may lead to over-smoothing of nodes and reduce the discriminative power. We explored the impact of varying the number of GCN layers ( $g_{n}$ ) in GRACE. As presented in Table IV, setting $g_{n}=12$ results in suboptimal performance compared to $g_{n}=8$ and $g_{n}=4$ , likely due to convergence difficulties within the given 200 epochs. While $g_{n}=4$ achieves outstanding performance overall, it slightly underperforms in highly noisy conditions (i.e., $m_{r}=0.8$ ) compared to $g_{n}=8$ . Therefore, we suggest using $g_{n}=8$ as a balanced choice for stable and robust performance across various noise levels.

3.3.3 Sparsity Penalty Term ( $\alpha$ )

The sparsity penalty term $\alpha$ in the proposed loss function controls the balance between the sparsity constraint and the classification objective. A higher value of $\alpha$ encourages GRACE to learn a sparser feature representation, which is particularly beneficial for DeepFake video detection in the presence of invalid facial images. We investigated the impact of $\alpha$ by varying its value from $1e^{-7}$ to $1e^{-5}$ . As shown in Table IV, a higher sparsity penalty enhances the network’s ability to learn essential and discriminative features, thereby reducing the influence of invalid faces and improving overall performance. However, setting $\alpha$ higher than $1e^{-5}$ leads to convergence difficulties. Based on our analysis, we recommend using $\alpha=1e^{-5}$ to achieve a balanced trade-off between sparsity and convergence stability.

3.3.4 GCN Embedding Dimension ( $g_{\text{dim}}$ )

The embedding dimension of the GCN ( $g_{\text{dim}}$ ) determines the richness of the learned feature representations for DeepFake video detection. We investigated the impact of $g_{\text{dim}}$ by comparing the performance of GRACE with $g_{\text{dim}}\in{200,400,600}$ , as shown in Table IV. Since the dimension of the graph representation ${\bm{A}}$ is $400\times 400$ , intuitively, the best performance is achieved when $g_{\text{dim}}=400$ . Reducing $g_{\text{dim}}$ below this value limits the expressive power of the GCN, while increasing it beyond introduces redundancy and harms performance. Therefore, we suggest setting $g_{\text{dim}}=400$ for optimal results.

3.3.5 Number of Fully Connected Layer Neurons ( $n_{\text{out}}$ )

To aggregate the output of the GCN and feed it into the softmax classifier, a simple fully connected (FC) layer is employed, projecting the graph representation to an $n_{\text{out}}$ -dimensional feature vector. We investigated the impact of $n_{\text{out}}$ by comparing the performance of GRACE with $n_{\text{out}}\in{1024,2048,3072}$ , as shown in Table IV. While $n_{\text{out}}=2048$ achieves excellent performance under highly noisy face sequences, the performance gap between $n_{\text{out}}=2048$ and $n_{\text{out}}=1024$ is insignificant, suggesting that the choice of $n_{\text{out}}$ is not highly sensitive. Based on our analysis, we recommend setting $n_{\text{out}}=2048$ for a good balance between performance and computational complexity.

The comprehensive analysis of the hyperparameters presented in this section highlights the robustness and effectiveness of the proposed GRACE method under various hyperparameter settings. By carefully selecting these hyperparameters, GRACE achieves state-of-the-art performance in DeepFake video detection, even in challenging real-world scenarios with noisy face sequences. The insights gained from this analysis provide valuable guidance for practitioners and researchers aiming to develop robust and efficient DeepFake detection systems.

3.4 Ablation Study

Table V presents an ablation study for the proposed modules in our GRACE, i.e., GCN, Graph Laplacian smooth prior, and Sparsity regularizer, where the performance is evaluated in noisy face sequences (say, $m_{r}=0.8$ and $m_{r}=0.7$ ). Note that when none of the proposed modules is adopted, we adopt the Transformer [46] as the classification head with four-head multi-head self-attention (MHSA) with the embedding size of 512 to meet a similar number of parameters with that of our GRACE, which could be treated as a variant of Convolutional Transformer. When we enable the GCN for the proposed feature entanglement and its affinity matrix, the performance of the DeepFake video detection under noisy face sequence, implying that the feature entanglement and its graph representation judiciously embeds the different spatiotemporal features into every node, thereby reducing the impact of invalid faces under noisy face sequences. Furthermore, the Graph Laplacian smooth prior could improve the robustness since it could filter noisy nodes that might contain many invalid faces without significantly increasing computational complexity. As shown in Fig.4, the convergence of the proposed GRACE with Graph Laplacian remains stable and shows outstanding performance on the FF++ [1] validation set with $m_{r}=0$ . Finally, the sparsity regularizer significantly benefits the robustness of the DeepFake detection since we could enforce our GRACE learning essential and sparse features for the valid faces.

TABLE V: Ablation study of the proposed GRACE using different classification heads and components. GCN, GL, and Spa. denote the Graph Convolutional Network, Graph Laplacian smooth prior, and Sparsity regularizer, respectively.

$m_{r}$	GCN	GL	Spa.	ACC	F1	AUC	#Param / MACs
0.8				0.844	0.655	0.705	40.79M, 38.82
0.8	✓			0.926	0.873	0.931	29.66M / 35.25
0.8	✓	✓		0.924	0.876	0.977
0.8	✓		✓	0.946	0.912	0.975
0.8	✓	✓	✓	0.944	0.916	0.983
0.7				0.858	0.700	0.794	40.79M / 38.82
0.7	✓			0.938	0.900	0.965	29.66M / 35.25
0.7	✓	✓		0.950	0.920	0.982
0.7	✓		✓	0.952	0.925	0.974
0.7	✓	✓	✓	0.960	0.938	0.985

3.5 Adversarial Attack on Face Detector

As previously discussed, DeepFake videos can be intentionally perturbed to evade detection by face detectors, rendering DeepFake detection ineffective. To emulate this real-world challenge, we employ an open-source adversarial attack on the MTCNN face detector [54], leveraging a PGD-like algorithm [39] with a maximum perturbation value of $\epsilon=0.04$ , step size $\alpha_{\text{adv}}=0.01$ , and the number of iterations $s=10$ to perturb the test sets of FF++ [1]. The step size $\alpha_{\text{adv}}$ determines the magnitude of each perturbation step applied to the input image during the iterative adversarial attack process. Assuming that each frame must contain at least one face detectable by MTCNN to assess the rate of missed detections, a black image will replace the frame when no face is detected. The results reveal that the average number of missed detected faces is 3.58 for the FF++ [1] test set, closely mirroring an $m_{r}=0.2$ scenario. However, the perturbed face sequences not only make it more difficult to recognize whether the video is fake due to the adversarial noise affecting the image quality, as exemplified in Fig. 5, but also introduce another challenge: the adversarial attack could cause the face detector to extract non-facial regions (e.g., background). This implies that the actual $m_{r}$ in this case could be even higher than the estimated $0.2$ , as some of the detected faces might not be genuine facial regions. Despite these challenging conditions, our GRACE method maintains strong performance compared to other state-of-the-art methods, as illustrated in Table VI. Remarkably, the performance of all peer methods dropped significantly, partly due to the missed detections and partly because the adversarial noise introduces spatial distortions in the facial images. This real-world simulation further substantiates GRACE as a generalized, robust, and effective DeepFake detection model capable of handling noisy face sequences.

TABLE VI: The performance comparison of the proposed GRACE and other methods trained on FF++ under simulated real-world scenarios (i.e., adversarial attack on face detector).

	FF++ [1]
Method	ACC	F1	AUC
Xception [6]	0.614	0.565	0.723
$F^{3}$ -net [32]	0.698	0.601	0.754
RECCE [27]	0.760	0.633	0.824
UCF [30]	0.745	0.613	0.811
CORE [26]	0.712	0.590	0.754
TALL-Swin [53]	0.796.	0.715	0.848
DFIL [28]	0.755	0.691	0.805
GRACE (Ours)	0.910	0.883	0.937

3.6 Limitations and Discussion

This study introduces a novel approach, GRACE, to address the challenge of DeepFake video detection in the presence of noisy face sequences. GRACE leverages feature entanglement with sparse constraints and a graph convolutional network with graph Laplacian regularization to effectively exploit the spatial-temporal correlations in face sequences while suppressing the impact of noise and distortions. The experimental results demonstrate the efficacy of GRACE in handling noisy face sequences and achieving state-of-the-art performance on several benchmark datasets.

However, it is essential to acknowledge the limitations of the current study and discuss potential future directions. One limitation is that while GRACE has shown strong performance on the evaluated datasets, its effectiveness on cross-dataset scenarios, where the training and testing data come from different sources, has not been extensively explored. The robustness of GRACE to domain shifts and variations in noise characteristics across different datasets requires further investigation. Nonetheless, it is worth emphasizing that GRACE represents the first dedicated effort to tackle the problem of noisy face sequences in DeepFake video detection, which has been largely overlooked in previous research. The proposed methodology and insights from this study lay a solid foundation for future work in this important direction.

Another aspect to consider is that GRACE currently does not incorporate masked learning strategies, which have shown promise in handling occlusions and missing data in various computer vision tasks. Integrating masked learning techniques into the GRACE framework could potentially further enhance its robustness to partial occlusions and incomplete face sequences. Moreover, the use of graph convolutional networks in GRACE allows for flexible processing of video frames, as the input frames are not required to be strictly sequential. This property could be leveraged to develop more efficient and adaptive sampling strategies for processing long video sequences.

It is also worth noting that while GRACE has demonstrated significant improvements over existing methods, there is still room for further enhancements. One direction could be to explore more advanced graph neural network architectures, such as graph attention networks or graph transformers, to better capture the complex dependencies and interactions among the spatial-temporal features. Additionally, incorporating prior knowledge or constraints specific to the DeepFake detection domain, such as the consistency of facial landmarks or the coherence of audio-visual signals, could potentially boost the performance and generalizability of the proposed approach.

In conclusion, GRACE represents a significant step forward in addressing the challenge of DeepFake video detection in the presence of noisy face sequences. While acknowledging the limitations and potential areas for improvement, we believe that the proposed methodology opens up new avenues for research in this critical domain. Future work could focus on extending GRACE to handle cross-dataset scenarios, integrating masked learning techniques, exploring more advanced graph neural network architectures, and incorporating domain-specific prior knowledge. As DeepFake techniques continue to evolve and become more sophisticated, develo** robust and reliable detection methods that can operate effectively in real-world scenarios with noisy and challenging data remains an ongoing research endeavor of paramount importance.

4 Conclusions

In this work, we proposed a robust and generalized Graph-Regularized Attentive Convolutional Entanglement (GRACE) approach for DeepFake video detection, specifically designed to address the challenges posed by noisy and unreliable face sequences. The proposed GRACE framework leverages spatiotemporal feature entanglement, graph convolutional networks, and graph Laplacian regularization to effectively capture discriminative features while mitigating the impact of invalid facial images. Extensive experiments on benchmark datasets, including FF++ [1], Celeb-DF [3], and DFDC [2], demonstrate the superior performance of GRACE compared to state-of-the-art methods, especially under noisy face sequences. The robustness of GRACE is further validated through real-world simulations involving adversarial attacks on face detectors. The proposed GRACE represents a significant step forward in robust and generalized DeepFake video detection under challenging conditions, contributing to the development of more reliable multimedia forensics techniques in the era of deepfakes.

References

[1] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “Faceforensics++: Learning to detect manipulated facial images,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1–11.
[2] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and C. C. Ferrer, “The deepfake detection challenge (dfdc) dataset,” arXiv preprint arXiv:2006.07397, 2020.
[3] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A large-scale challenging dataset for deepfake forensics,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3207–3216.
[4] B. Zi, M. Chang, J. Chen, X. Ma, and Y.-G. Jiang, “Wilddeepfake: A challenging real-world dataset for deepfake detection,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 2382–2390.
[5] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “Mesonet: A compact facial video forgery detection network,” in 2018 IEEE international workshop on information forensics and security (WIFS). IEEE, 2018, pp. 1–7.
[6] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
[7] H. Mo, B. Chen, and W. Luo, “Fake faces identification via convolutional neural network,” in Proceedings of the 6th ACM workshop on information hiding and multimedia security, 2018, pp. 43–47.
[8] F. Marra, D. Gragnaniello, D. Cozzolino, and L. Verdoliva, “Detection of gan-generated fake images over social networks,” in 2018 IEEE conference on multimedia information processing and retrieval (MIPR). IEEE, 2018, pp. 384–389.
[9] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “Cnn-generated images are surprisingly easy to spot… for now,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8695–8704.
[10] H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen, “Multi-task learning for detecting and segmenting manipulated facial images and videos,” in 2019 IEEE 10th international conference on biometrics theory, applications and systems (BTAS). IEEE, 2019, pp. 1–8.
[11] M. Kim, S. Tariq, and S. S. Woo, “Fretal: Generalizing deepfake detection using knowledge distillation and representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1001–1012.
[12] H. H. Nguyen, J. Yamagishi, and I. Echizen, “Use of a capsule network to detect fake images and videos,” arXiv preprint arXiv:1910.12467, 2019.
[13] C.-C. Hsu, Y.-X. Zhuang, and C.-Y. Lee, “Deep fake image detection based on pairwise learning,” Applied Sciences, vol. 10, no. 1, p. 370, 2020.
[14] Y.-X. Zhuang and C.-C. Hsu, “Detecting generated image based on a coupled network with two-step pairwise learning,” in 2019 IEEE international conference on image processing (ICIP). IEEE, 2019, pp. 3212–3216.
[15] C.-C. Hsu, C.-Y. Lee, and Y.-X. Zhuang, “Learning to detect fake face images in the wild,” in 2018 international symposium on computer, consumer and control (IS3C). IEEE, 2018, pp. 388–391.
[16] I. Masi, A. Killekar, R. M. Mascarenhas, S. P. Gurudatt, and W. AbdAlmageed, “Two-branch recurrent network for isolating deepfakes in videos,” in European conference on computer vision. Springer, 2020, pp. 667–684.
[17] D. Güera and E. J. Delp, “Deepfake video detection using recurrent neural networks,” in 2018 15th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, 2018, pp. 1–6.
[18] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo, “Face x-ray for more general face forgery detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5001–5010.
[19] P. Wang, K. Liu, W. Zhou, H. Zhou, H. Liu, W. Zhang, and N. Yu, “Adt: Anti-deepfake transformer,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 2899–1903.
[20] Z. Sun, Y. Han, Z. Hua, N. Ruan, and W. Jia, “Improving the efficiency and robustness of deepfakes detection through precise geometric features,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3609–3618.
[21] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P. Natarajan, “Recurrent convolutional strategies for face manipulation detection in videos,” Interfaces (GUI), vol. 3, no. 1, pp. 80–87, 2019.
[22] X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using inconsistent head poses,” in ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2019, pp. 8261–8265.
[23] Y. Li, M.-C. Chang, and S. Lyu, “In ictu oculi: Exposing ai created fake videos by detecting eye blinking,” in 2018 IEEE International workshop on information forensics and security (WIFS). IEEE, 2018, pp. 1–7.
[24] U. A. Ciftci, I. Demir, and L. Yin, “Fakecatcher: Detection of synthetic portrait videos using biological signals,” IEEE transactions on pattern analysis and machine intelligence, 2020.
[25] Y. Li and S. Lyu, “Exposing deepfake videos by detecting face war** artifacts,” arXiv preprint arXiv:1811.00656, 2018.
[26] Y. Ni, D. Meng, C. Yu, C. Quan, D. Ren, and Y. Zhao, “Core: Consistent representation learning for face forgery detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12–21.
[27] J. Cao, C. Ma, T. Yao, S. Chen, S. Ding, and X. Yang, “End-to-end reconstruction-classification learning for face forgery detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4113–4122.
[28] K. Pan, Y. Yin, Y. Wei, F. Lin, Z. Ba, Z. Liu, Z. Wang, L. Cavallaro, and K. Ren, “Dfil: Deepfake incremental learning by exploiting domain-invariant forgery clues,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 8035–8046.
[29] R. Creager, “hideface: Exploring a Non-Traditional Adversarial Attack,” https://github.com/rccreager/hideface, 2022.
[30] Z. Yan, Y. Zhang, Y. Fan, and B. Wu, “Ucf: Uncovering common features for generalizable deepfake detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 412–22 423.
[31] J. Frank, T. Eisenhofer, L. Schönherr, A. Fischer, D. Kolossa, and T. Holz, “Leveraging frequency analysis for deep fake image recognition,” in International conference on machine learning. PMLR, 2020, pp. 3247–3258.
[32] Y. Qian, G. Yin, L. Sheng, Z. Chen, and J. Shao, “Thinking in frequency: Face forgery detection by mining frequency-aware clues,” in European conference on computer vision. Springer, 2020, pp. 86–103.
[33] N. Carlini and H. Farid, “Evading deepfake-image detectors with white-and black-box attacks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 658–659.
[34] S. Hussain, P. Neekhara, M. Jere, F. Koushanfar, and J. McAuley, “Adversarial deepfakes: Evaluating vulnerability of deepfake detectors to adversarial examples,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 3348–3357.
[35] P. Neekhara, B. Dolhansky, J. Bitton, and C. C. Ferrer, “Adversarial threats to deepfake detection: A practical perspective,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 923–932.
[36] G.-L. Chen and C.-C. Hsu, “Jointly defending deepfake manipulation and adversarial attack using decoy mechanism,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–11, 2023.
[37] A. J. Bose and P. Aarabi, “Adversarial attacks on face detectors using neural net based constrained optimization,” in 2018 IEEE 20th international workshop on multimedia signal processing (MMSP). IEEE, 2018, pp. 1–6.
[38] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
[39] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 IEEE symposium on security and privacy (SP). IEEE, 2017, pp. 39–57.
[40] S. Baluja and I. Fischer, “Adversarial transformation networks: Learning to generate adversarial examples,” arXiv preprint arXiv:1703.09387, 2017.
[41] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille, “Adversarial examples for semantic segmentation and object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1369–1378.
[42] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Retinaface: Single-shot multi-level face localisation in the wild,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5203–5212.
[43] D. E. King, “Dlib-ml: A machine learning toolkit,” The Journal of Machine Learning Research, vol. 10, pp. 1755–1758, 2009.
[44] D. Neimark, O. Bar, M. Zohar, and D. Asselmann, “Video transformer network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3163–3172.
[45] D. Wodajo and S. Atnafu, “Deepfake video detection using convolutional vision transformer,” arXiv preprint arXiv:2102.11126, 2021.
[46] A. Dosovitskiy and L. Beyer, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[47] S. Zhang, H. Tong, J. Xu, and R. Maciejewski, “Graph convolutional networks: a comprehensive review,” Computational Social Networks, vol. 6, no. 1, pp. 1–23, 2019.
[48] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in International conference on machine learning. PMLR, 2019, pp. 6861–6871.
[49] A. Ortega, P. Frossard, J. Kovačević, J. M. Moura, and P. Vandergheynst, “Graph signal processing: Overview, challenges, and applications,” Proceedings of the IEEE, vol. 106, no. 5, pp. 808–828, 2018.
[50] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “Laplacian matrix learning for smooth graph signal representation,” in 2015 IEEE international conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 3736–3740.
[51] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.
[52] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
[53] Y. Xu, J. Liang, G. Jia, Z. Yang, Y. Zhang, and R. He, “Tall: Thumbnail layout for deepfake video detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 658–22 668.
[54] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016.