Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding

Abstract

AI-synthesized text and images have gained significant attention, particularly due to the widespread dissemination of multi-modal manipulations on the internet, which has resulted in numerous negative impacts on society. Existing methods for multi-modal manipulation detection and grounding primarily focus on fusing vision-language features to make predictions, while overlooking the importance of modality-specific features, leading to sub-optimal results. In this paper, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. To achieve this, we introduce visual/language pre-trained encoders and dual-branch cross-attention (DCA) to extract and fuse modality-unique features. Furthermore, we design decoupled fine-grained classifiers (DFC) to enhance modality-specific feature mining and mitigate modality competition. Moreover, we propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality using learnable queries, thereby improving the discovery of forged details. Extensive experiments on the $\rm DGM^{4}$ dataset demonstrate the superior performance of our proposed model compared to state-of-the-art approaches.

Index Terms— multi-modal, media manipulation, transformer, modality-specific

1 Introduction

Refer to caption — Fig. 1: The overall architecture of our framework. 1) Image and text features are extracted and fused through uni-model encoders $E_{i}$ , $E_{t}$ , and modality interaction module $M_{i}$ , $M_{t}$ . 2) Decoupled fine-grained classifier $C_{i}$ , $C_{t}$ and binary classifier $C_{b}$ take image embedding $i_{cls}$ , text embedding $t_{cls}$ , and concatenated embeddings $\{i_{cls},t_{cls}\}$ as inputs, respectively. 3) Image embeddings $i_{pat}$ and text embeddings $t_{tok}$ are separately fed into the implicit manipulation query module and grounding heads.

The rapid development of deep generative models and large language models has facilitated the easy generation of massive amounts of fake facial images, videos, and synthetic text. These deepfake products [1, 2] have the potential to spread widely on social media. Consequently, this threat has garnered significant attention in the fields of computer vision and natural language processing, leading to the proposal of various methods for detecting fake faces and AI-generated text [3, 4, 5, 6, 7, 8, 9, 10]. However, these methods often focus on a single modality. Yet, in everyday life, multi-modal media content in the form of image-text pairs is more prevalent. As a result, multi-modal fake content is more likely to spread widely and cause social problems.

Previous research on multi-modal fake news primarily concentrates on the binary classification problem of determining news authenticity. CMC [11] leverages knowledge distillation to capture cross-modal feature correlations during training. CMAC [12] combines adversarial learning and contrastive learning to obtain multi-modal fused representations with modality in-variance and clear class distributions. However, these methods [11, 12, 13] are unable to determine specific types of manipulation or localize manipulation positions, lacking applicability and interpretability. HAMMER [14] constructs the first dataset for the detection and grounding of multi-modal manipulated content. Furthermore, it proposes a contrastive learning-based approach for modality alignment as well as shallow and deep manipulation reasoning. However, this approach overlooks the importance of modality-specific features, potentially leading to underutilization of the rich information present in each modality and resulting in sub-optimal performance.

In order to exploit modality-specific features, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding. We utilize visual/language pre-trained uni-modal encoders to extract modality-unique features. These features are then fused by dual-branch cross-attention (DCA) to summarize multi-modal information while preserving the individual characteristics of each modality. Furthermore, we introduce decoupled fine-grained classifiers (DFC) to mitigate modality competition, in which one modality is learned well while the other is not fully explored. Additionally, we propose an implicit manipulation query (IMQ) that facilitates reasoning between learnable queries and the global context of images or text, enabling the model to discover potential forged clues.

The main contributions of our paper are as follows: (1) We construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding. (2) We introduce DCA to fuse modality-unique features while maintaining uni-modal characteristics. Additionally, we design DFC and IMQ to promote comprehensive exploration of each modality and enhance intra-modal interactions, respectively. (3) We conduct experiments on the $\rm DGM^{4}$ dataset, and the results demonstrate the superiority and effectiveness of our approach.

2 METHODOLOGY

2.1 Feature Extraction

In order to capture modality-specific features for images and text, we employ two visual/language pre-trained uni-modal feature extractors, denoted as $E_{i}$ for images and $E_{t}$ for text. As illustrated in Figure 1, given an image-text pair $(I,T)$ , we divide the input image $I$ into $N$ patches and insert a [CLS] token. Similarly, we segment the input text $T$ into $M$ tokens and insert a [CLS] token. The image features $f_{i}$ and text features $f_{t}$ are separately extracted by $E_{i}$ and $E_{t}$ .

f_{i}=E_{i}(I),\quad f_{t}=E_{t}(T).

(1)

In a single modality, manipulation cues are often subtle and not easily discernible. However, in multi-modal manipulation content, there can be distinctive information that differs between the modalities. To effectively reason about the correlations between images and text, a deep level of interaction and fusion between their features is crucial. We employ a dual-branch cross-attention (DCA) mechanism, as depicted in Figure 1, to guide the interaction between image features $f_{i}$ and text features $f_{t}$ . Unlike HAMMER [14] which uses single-stream interaction with text features as queries but image features remain unchanged, DCA allows each embedding to capture modality-contextual information while enabling deep interactions with information from the other modality.

The attention function is applied to query ( $Q$ ), key ( $K$ ), and value ( $V$ ) features, each with a hidden size of $D$ , as follows:

\text{Attention}(Q,K,V)=\text{Softmax}(QK^{T}/\sqrt{D})V.

(2)

In DCA, given queries from one modality (e.g., image), keys and values can be taken only from the other modality (e.g., text). We denote two modality interaction modules as $M_{i}$ and $M_{t}$ , where cross-attention in $M_{i}$ using image features as queries and cross-attention in $M_{t}$ using text features as queries. Multi-layer self-attention and DCA combination to achieve complex inter-modal fusion. This can be expressed as

M_{i}(f_{i},f_{t})=\{i_{cls},i_{pat}\},\quad M_{t}(f_{t},f_{i})=\{t_{cls},t_{% tok}\}.

(3)

Here, $i_{cls}$ represents the embedding of the image [CLS] token, $i_{pat}={i_{1},...,i_{N}}$ represents the embeddings of $N$ image patches, $t_{cls}$ represents the embedding of the text [CLS] token, and $t_{tok}={t_{1},...,t_{M}}$ represents the embeddings of $M$ text tokens.

2.2 Manipulation Detection

Manipulation detection involves two tasks: fine-grained manipulation type detection and binary classification. The $\rm DGM^{4}$ dataset [14] introduces two image manipulation methods: Face Swap (FS) and Face Attribute (FA), as well as two text manipulation methods: Text Swap (TS) and Text Attribute (TA). In HAMMER [14], the embedding of the [CLS] token for binary classification of the four manipulation types. However, this approach may suffer from modality competition [15], where one modality is learned well while the other is not fully explored. For instance, text information may dominate [CLS] token, making it challenging to optimize the image part and leading to more imbalance modality information in [CLS] token. To address this issue and promote modal specificity, we introduce decoupled fine-grained classifiers (DFC) to independently guide visual and linguistic features. DFC consists of $C_{i}$ and $C_{t}$ , where $C_{i}$ outputs one of the three categories of Real/FS/FA, and $C_{t}$ outputs one of the three categories of Real/TS/TA. $i_{cls}$ and $t_{cls}$ are separately fed into $C_{i}$ and $C_{t}$ . The fine-grained classification loss is computed as follows:

\mathcal{L}_{mcls}=\mathcal{L}_{ce}(C_{i}(i_{cls}),y_{i})+\mathcal{L}_{ce}(C{t% }(t_{cls}),y_{t}),

(4)

where $\mathcal{L}_{ce}$ denotes the cross-entropy loss, $y_{i}$ represents the image’s fine-grained classification label, and $y_{t}$ represents the text’s fine-grained classification label. Additionally, we concatenate $i_{cls}$ and $t_{cls}$ and feed them to a binary classifier $C_{b}$ , computing the binary classification loss. Here, $y_{b}$ represents the binary classification label:

\mathcal{L}_{bcls}=\mathcal{L}_{ce}(C_{b}(\{i_{cls},t_{cls}\}),y_{b}).

(5)

2.3 Manipulation Grounding

Manipulation grounding involves two tasks: manipulated image grounding and manipulated text grounding. In manipulated image grounding, the goal is to output the coordinates of the manipulated region in the image. In manipulated text grounding, the objective is to classify each token in the text and determine if it is manipulated. Drawing inspiration from the works of DETR [16] and MaskFormer [17], we propose an implicit manipulation query (IMQ) module, which consists of two components: I-IMQ and T-IMQ. The IMQ module utilizes learnable queries to adaptively aggregate intra-modal forged clues and emphasize modality-specific features. Taking the text in Fig .1 as an example, there is a significant inconsistency between ”MP”, ”underpants”, and ”award”. T-IMQ can efficiently model relationships between tokens by leveraging implicit forgery features learned by queries during training. The image manipulation features $f_{im}$ and text manipulation features $f_{tm}$ are aggregated by image manipulation queries $q_{im}$ and text manipulation queries $q_{tm}$ , respectively:

	$\displaystyle f_{im}$	$\displaystyle=\text{Attention}(q_{im},i_{pat},i_{pat}),$		(6)
	$\displaystyle f_{tm}$	$\displaystyle=\text{Attention}(q_{tm},t_{tok},t_{tok}).$		(6)

The $f_{im}$ is inputted into a bbox detector $D_{i}$ to estimate the manipulated coordinates. The manipulated image grounding loss is calculated by combining the L1 loss and GIoU loss [18], where $y_{mig}$ represents the manipulated image grounding label:

\mathcal{L}_{mig}=\mathcal{L}_{L1}(D_{i}(f_{i})-y_{mig})+\mathcal{L}_{GIoU}(D_% {i}(f_{i})-y_{mig}).

(7)

The $f_{tm}$ and $t_{tok}$ are dimensionally reduced, and an inner product is performed in the feature dimension to predict whether each token is manipulated. The manipulated text grounding loss is calculated using the cross-entropy loss, where $y_{mtg}=\{y_{i}\}^{M}_{i=1}$ and $y_{i}\in{0,1}$ denotes whether the $i$ -th token is manipulated or not:

\mathcal{L}_{mtg}=\mathcal{L}_{ce}(t_{tok}\times f_{tm}^{T},y_{mtg}).

(8)

2.4 Loss function

To obtain the final loss function, we combine the above components, where $\alpha$ , $\beta$ , and $\gamma$ are hyperparameters that control the relative importance of each loss term:

\mathcal{L}=\mathcal{L}_{bcls}+\alpha\mathcal{L}_{mcls}+\beta\mathcal{L}_{mig}% +\gamma\mathcal{L}_{mtg}.

(9)

3 experiment

3.1 Implementation details

The length of the text content is padded or truncated to 50 tokens, while the images are resized to 256x256 pixels. The uni-modal encoders are implemented by ViT-B/16 and RoBERTa. The modality interaction module is constructed using 6 transformer layers. The pre-training weights are derived from METER [19]. The binary classifier, fine-grained classifier, and bbox detector are all implemented using multi-layer perceptron layers. The coefficients of the loss function are set as $\alpha=1,\beta=0.1,\gamma=1$ . We utilize the AdamW [20] optimizer with a weight decay of 0.02. During the first 1000 steps, the learning rate is warmed up to 1e-4 and then decayed to 1e-6 using a cosine schedule.

Table 1: Results comparison with state-of-the-art methods and multi-modal learning methods on

\rm DGM^{4}

Categories		Binary Cls			Multi-Label Cls			Image Grounding			Text Grounding
Methods	Params	AUC	EER	ACC	mAP	CF1	OF1	IoUmean	IoU50	IoU75	Precision	Recall	F1
CLIP [21]	-	83.22	24.61	76.40	66.00	59.52	62.31	49.51	50.03	38.79	58.12	22.11	32.03
ViLT [22]	-	85.16	22.88	78.38	72.37	66.14	66.00	59.32	65.18	48.10	66.48	49.88	57.00
HAMMER [14]	441M	93.19	14.10	86.39	86.22	79.37	80.37	76.45	83.75	76.06	75.01	68.02	71.35
Ours	328M	95.11	11.36	88.75	91.42	83.60	84.38	80.83	88.35	80.39	76.51	70.61	73.44

Table 2: Ablation study on DCA, DFC, and IMQ.

Methods	AUC	mAP	IoUmean	F1	Avg
Baseline	94.25	89.29	77.19	69.13	82.47
Ours w/o $\rm M_{t}$	93.73	80.66	81.58	46.58	75.64
Ours w/o DFC	94.94	90.13	80.43	73.51	84.75
Ours w/o I-IMQ	95.05	91.32	79.14	73.39	84.73
Ours w/o T-IMQ	95.20	90.66	80.75	73.09	84.93
Ours	95.11	91.42	80.84	73.44	85.20

3.2 Datasets and Evaluation Metrics

We conduct experiments on $\rm DGM^{4}$ [14] dataset. The $\rm DGM^{4}$ comprise a total of 230k news samples, including 77,426 original image-text pairs and 152,574 manipulation pairs.

We evaluate each method using a total of twelve metrics across four tasks. For binary classification, we evaluate accuracy (ACC), area under the receiver operating characteristic curve (AUC), and equal error rate (EER). For fine-grained classification, we evaluate mean average precision (MAP), average per-class F1 (CF1), and overall F1 (OF1). For manipulated image grounding, we evaluate the mean intersection over union (IoUmean) as well as the IoU at thresholds of 0.5 (IoU50) and 0.75 (IoU75). For manipulated text grounding, we evaluate precision, recall, and F1 score.

3.3 Comparison with the state-of-the-art methods

In this section, we show the performance of the state-of-the-art methods and our method on the $\rm DGM^{4}$ dataset. The comparison results are shown in Table 1. It can be observed that our method outperforms the state-of-the-art methods on the $\rm DGM^{4}$ dataset in all metrics. Particularly, compared to the SOTA method, our approach achieves improvements of over 2% in important metrics such as ACC, MAP, IoUmean, and F1. These results indicate that our approach can effectively leverage modality-specific features and accurately model the correlations between images and text, leading to enhanced manipulation detection and grounding.

3.4 Ablation Study

To validate the importance of the DCA, DFC, and IMQ in our model, we conduct a series of ablation studies. We create the baseline model by removing DCA, DFC, and IMQ components. Specifically, we delete the $M_{i}$ branch, replace DFC with a multi-label classifier that makes predictions on concatenated features $\{i_{cls},t_{cls}\}$ , and remove IMQ while using MLP to locate the manipulation regions of image and text. By comparing the results to our full model, we can observe the effectiveness of our overall design. To verify the impact of each component, we perform ablation experiments by removing the corresponding component.

Removing $M_{t}$ (Ours w/o $\rm M_{t}$ ) leads to an average performance degradation. This indicates that the absence of DCA makes image features dominate and text features weakened. In addition, As mentioned in HAMMER [14], text manipulation detection and grounding are more difficult tasks than images. As a result, the relevant data decreases significantly. Furthermore, replacing DFC with a multi-label classifier (Ours w/o DFC) results in performance degradation on the classification task. This finding highlights the importance of DFC in enhancing modality-specific feature mining, enabling more accurate identification of specific types of manipulation. Moreover, the performance using IMQ outperforms its ablated counterparts (Ours w/o I-IMQ, Ours w/o T-IMQ) on the corresponding tasks. This demonstrates the effectiveness of the IMQ in aggregating forged features within each modality.

3.5 Visualization

We provide visualizations of the manipulation grounding in Fig 2. In the second and third columns, we observe that HAMMER [14] may encounter interference from the text modality, leading to errors in image grounding. In contrast, our method successfully distinguishes fake faces from real faces while accurately identifying the manipulated token. This illustrates the effectiveness of our approach in leveraging modality-specific features to uncover forged clues.

4 CONCLUSION

In this paper, we construct a simple and novel transformer-based framework for manipulation detection and grounding. We introduce dual-branch cross-attention and decoupled fine-grained classifiers to effectively model cross-modal correlations and exploit modality-specific features. Implicit manipulation query is proposed to improve the discovery of forged clues. Experimental results on the $\rm DGM^{4}$ dataset show that our proposed approach outperforms existing methods in terms of performance.

5 ACKNOWLEDGEMENT

This work is supported by the National Natural Science Foundation of China (Grant No. 62121002).

References

[1] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer, “The deepfake detection challenge (dfdc) dataset,” arXiv preprint arXiv:2006.07397, 2020.
[2] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Ye** Choi, “Defending against neural fake news,” NeurIPS, vol. 32, 2019.
[3] Changtao Miao, Qi Chu, Weihai Li, Tao Gong, Wanyi Zhuang, and Nenghai Yu, “Towards generalizable and robust face manipulation detection via bag-of-feature,” in VCIP. IEEE, 2021, pp. 1–5.
[4] Changtao Miao, Qi Chu, Weihai Li, Suichan Li, Zhentao Tan, Wanyi Zhuang, and Nenghai Yu, “Learning forgery region-aware and id-independent features for face manipulation detection,” T-BIOM, vol. 4, no. 1, pp. 71–84, 2022.
[5] Wanyi Zhuang, Qi Chu, Zhentao Tan, Qiankun Liu, Haojie Yuan, Changtao Miao, Zixiang Luo, and Nenghai Yu, “Uia-vit: Unsupervised inconsistency-aware method based on vision transformer for face forgery detection,” in ECCV. Springer, 2022, pp. 391–407.
[6] Changtao Miao, Zichang Tan, Qi Chu, Nenghai Yu, and Guodong Guo, “Hierarchical frequency-assisted interactive networks for face manipulation detection,” TIFS, vol. 17, pp. 3008–3021, 2022.
[7] Changtao Miao, Zichang Tan, Qi Chu, Huan Liu, Honggang Hu, and Nenghai Yu, “F 2 trans: High-frequency fine-grained transformer for face forgery detection,” TIFS, vol. 18, pp. 1039–1051, 2023.
[8] Wanyi Zhuang, Qi Chu, Haojie Yuan, Changtao Miao, Bin Liu, and Nenghai Yu, “Towards intrinsic common discriminative features learning for face forgery detection using adversarial learning,” in ICME. IEEE, 2022, pp. 1–6.
[9] Zichang Tan, Zhichao Yang, Changtao Miao, and Guodong Guo, “Transformer-based feature compensation and aggregation for deepfake detection,” SPL, vol. 29, pp. 2183–2187, 2022.
[10] Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush, “Gltr: Statistical detection and visualization of generated text,” arXiv preprint arXiv:1906.04043, 2019.
[11] Zimian Wei, Hengyue Pan, Linbo Qiao, Xin Niu, Peijie Dong, and Dongsheng Li, “Cross-modal knowledge distillation in multi-modal fake news detection,” in ICASSP. IEEE, 2022, pp. 4733–4737.
[12] Ting Zou, Zhong Qian, Peifeng Li, and Qiaoming Zhu, “Cross-modal adversarial contrastive learning for multi-modal rumor detection,” in ICASSP. IEEE, 2023, pp. 1–5.
[13] Qichao Ying, Xiaoxiao Hu, Yangming Zhou, Zhenxing Qian, Dan Zeng, and Shiming Ge, “Bootstrap** multi-view representations for fake news detection,” in AAAI, 2023.
[14] Rui Shao, Tianxing Wu, and Ziwei Liu, “Detecting and grounding multi-modal media manipulation,” in CVPR, 2023, pp. 6904–6913.
[15] Yu Huang, Junyang Lin, Chang Zhou, Hongxia Yang, and Longbo Huang, “Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably),” in ICML. PMLR, 2022, pp. 9226–9259.
[16] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko, “End-to-end object detection with transformers,” in ECCV. Springer, 2020, pp. 213–229.
[17] Bowen Cheng, Alex Schwing, and Alexander Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” NeurIPS, vol. 34, pp. 17864–17875, 2021.
[18] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in CVPR, 2019, pp. 658–666.
[19] Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al., “An empirical study of training end-to-end vision-and-language transformers,” in CVPR, 2022, pp. 18166–18176.
[20] Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
[21] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” in ICML. PMLR, 2021, pp. 8748–8763.
[22] Wonjae Kim, Bokyung Son, and Ildoo Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in ICML. PMLR, 2021, pp. 5583–5594.

Supplementary Material

Appendix A Intra-domain and inter-domain comparison.

Table 3: Intra-domain and inter-domain comparison on

\rm DGM^{4}

and Fakeddit.

Datasets	$\rm DGM^{4}$	Fakeddit	Overall
Methods	AUC	AUC	Avg
HAMMER [14]	93.19	62.81	78.00
Ours	95.11	64.26	79.78

The field of multi-modal media manipulation detection and grounding currently only has the $\rm DGM^{4}$ dataset. To verify the effectiveness of our method on other datasets, we select the multi-modal fake news dataset Fakeddit with similar tasks and a large amount of data. We train the model on the DGM4 dataset and test it on the Fakeddit dataset. As shown in Table 3, the AUC of our model is higher than HAMMER [14], demonstrating the generalization ability of our method.