License: arXiv.org perpetual non-exclusive license
arXiv:2309.12657v2 [cs.CV] 13 Jan 2024

Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding

Abstract

AI-synthesized text and images have gained significant attention, particularly due to the widespread dissemination of multi-modal manipulations on the internet, which has resulted in numerous negative impacts on society. Existing methods for multi-modal manipulation detection and grounding primarily focus on fusing vision-language features to make predictions, while overlooking the importance of modality-specific features, leading to sub-optimal results. In this paper, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. To achieve this, we introduce visual/language pre-trained encoders and dual-branch cross-attention (DCA) to extract and fuse modality-unique features. Furthermore, we design decoupled fine-grained classifiers (DFC) to enhance modality-specific feature mining and mitigate modality competition. Moreover, we propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality using learnable queries, thereby improving the discovery of forged details. Extensive experiments on the DGM4superscriptDGM4\rm DGM^{4}roman_DGM start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT dataset demonstrate the superior performance of our proposed model compared to state-of-the-art approaches.

Index Terms—  multi-modal, media manipulation, transformer, modality-specific

1 Introduction

Refer to caption
Fig. 1: The overall architecture of our framework. 1) Image and text features are extracted and fused through uni-model encoders Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and modality interaction module Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. 2) Decoupled fine-grained classifier Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and binary classifier Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT take image embedding iclssubscript𝑖𝑐𝑙𝑠i_{cls}italic_i start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, text embedding tclssubscript𝑡𝑐𝑙𝑠t_{cls}italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, and concatenated embeddings {icls,tcls}subscript𝑖𝑐𝑙𝑠subscript𝑡𝑐𝑙𝑠\{i_{cls},t_{cls}\}{ italic_i start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT } as inputs, respectively. 3) Image embeddings ipatsubscript𝑖𝑝𝑎𝑡i_{pat}italic_i start_POSTSUBSCRIPT italic_p italic_a italic_t end_POSTSUBSCRIPT and text embeddings ttoksubscript𝑡𝑡𝑜𝑘t_{tok}italic_t start_POSTSUBSCRIPT italic_t italic_o italic_k end_POSTSUBSCRIPT are separately fed into the implicit manipulation query module and grounding heads.

The rapid development of deep generative models and large language models has facilitated the easy generation of massive amounts of fake facial images, videos, and synthetic text. These deepfake products [1, 2] have the potential to spread widely on social media. Consequently, this threat has garnered significant attention in the fields of computer vision and natural language processing, leading to the proposal of various methods for detecting fake faces and AI-generated text [3, 4, 5, 6, 7, 8, 9, 10]. However, these methods often focus on a single modality. Yet, in everyday life, multi-modal media content in the form of image-text pairs is more prevalent. As a result, multi-modal fake content is more likely to spread widely and cause social problems.

Previous research on multi-modal fake news primarily concentrates on the binary classification problem of determining news authenticity. CMC [11] leverages knowledge distillation to capture cross-modal feature correlations during training. CMAC [12] combines adversarial learning and contrastive learning to obtain multi-modal fused representations with modality in-variance and clear class distributions. However, these methods [11, 12, 13] are unable to determine specific types of manipulation or localize manipulation positions, lacking applicability and interpretability. HAMMER [14] constructs the first dataset for the detection and grounding of multi-modal manipulated content. Furthermore, it proposes a contrastive learning-based approach for modality alignment as well as shallow and deep manipulation reasoning. However, this approach overlooks the importance of modality-specific features, potentially leading to underutilization of the rich information present in each modality and resulting in sub-optimal performance.

In order to exploit modality-specific features, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding. We utilize visual/language pre-trained uni-modal encoders to extract modality-unique features. These features are then fused by dual-branch cross-attention (DCA) to summarize multi-modal information while preserving the individual characteristics of each modality. Furthermore, we introduce decoupled fine-grained classifiers (DFC) to mitigate modality competition, in which one modality is learned well while the other is not fully explored. Additionally, we propose an implicit manipulation query (IMQ) that facilitates reasoning between learnable queries and the global context of images or text, enabling the model to discover potential forged clues.

The main contributions of our paper are as follows: (1) We construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding. (2) We introduce DCA to fuse modality-unique features while maintaining uni-modal characteristics. Additionally, we design DFC and IMQ to promote comprehensive exploration of each modality and enhance intra-modal interactions, respectively. (3) We conduct experiments on the DGM4superscriptDGM4\rm DGM^{4}roman_DGM start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT dataset, and the results demonstrate the superiority and effectiveness of our approach.

2 METHODOLOGY

2.1 Feature Extraction

In order to capture modality-specific features for images and text, we employ two visual/language pre-trained uni-modal feature extractors, denoted as Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for images and Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for text. As illustrated in Figure 1, given an image-text pair (I,T)𝐼𝑇(I,T)( italic_I , italic_T ), we divide the input image I𝐼Iitalic_I into N𝑁Nitalic_N patches and insert a [CLS] token. Similarly, we segment the input text T𝑇Titalic_T into M𝑀Mitalic_M tokens and insert a [CLS] token. The image features fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and text features ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are separately extracted by Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

fi=Ei(I),ft=Et(T).formulae-sequencesubscript𝑓𝑖subscript𝐸𝑖𝐼subscript𝑓𝑡subscript𝐸𝑡𝑇f_{i}=E_{i}(I),\quad f_{t}=E_{t}(T).italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_I ) , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_T ) . (1)

In a single modality, manipulation cues are often subtle and not easily discernible. However, in multi-modal manipulation content, there can be distinctive information that differs between the modalities. To effectively reason about the correlations between images and text, a deep level of interaction and fusion between their features is crucial. We employ a dual-branch cross-attention (DCA) mechanism, as depicted in Figure 1, to guide the interaction between image features fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and text features ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Unlike HAMMER [14] which uses single-stream interaction with text features as queries but image features remain unchanged, DCA allows each embedding to capture modality-contextual information while enabling deep interactions with information from the other modality.

The attention function is applied to query (Q𝑄Qitalic_Q), key (K𝐾Kitalic_K), and value (V𝑉Vitalic_V) features, each with a hidden size of D𝐷Ditalic_D, as follows:

Attention(Q,K,V)=Softmax(QKT/D)V.Attention𝑄𝐾𝑉Softmax𝑄superscript𝐾𝑇𝐷𝑉\text{Attention}(Q,K,V)=\text{Softmax}(QK^{T}/\sqrt{D})V.Attention ( italic_Q , italic_K , italic_V ) = Softmax ( italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / square-root start_ARG italic_D end_ARG ) italic_V . (2)

In DCA, given queries from one modality (e.g., image), keys and values can be taken only from the other modality (e.g., text). We denote two modality interaction modules as Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where cross-attention in Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using image features as queries and cross-attention in Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using text features as queries. Multi-layer self-attention and DCA combination to achieve complex inter-modal fusion. This can be expressed as

Mi(fi,ft)={icls,ipat},Mt(ft,fi)={tcls,ttok}.formulae-sequencesubscript𝑀𝑖subscript𝑓𝑖subscript𝑓𝑡subscript𝑖𝑐𝑙𝑠subscript𝑖𝑝𝑎𝑡subscript𝑀𝑡subscript𝑓𝑡subscript𝑓𝑖subscript𝑡𝑐𝑙𝑠subscript𝑡𝑡𝑜𝑘M_{i}(f_{i},f_{t})=\{i_{cls},i_{pat}\},\quad M_{t}(f_{t},f_{i})=\{t_{cls},t_{% tok}\}.italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = { italic_i start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_p italic_a italic_t end_POSTSUBSCRIPT } , italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_t italic_o italic_k end_POSTSUBSCRIPT } . (3)

Here, iclssubscript𝑖𝑐𝑙𝑠i_{cls}italic_i start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT represents the embedding of the image [CLS] token, ipat=i1,,iNsubscript𝑖𝑝𝑎𝑡subscript𝑖1subscript𝑖𝑁i_{pat}={i_{1},...,i_{N}}italic_i start_POSTSUBSCRIPT italic_p italic_a italic_t end_POSTSUBSCRIPT = italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT represents the embeddings of N𝑁Nitalic_N image patches, tclssubscript𝑡𝑐𝑙𝑠t_{cls}italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT represents the embedding of the text [CLS] token, and ttok=t1,,tMsubscript𝑡𝑡𝑜𝑘subscript𝑡1subscript𝑡𝑀t_{tok}={t_{1},...,t_{M}}italic_t start_POSTSUBSCRIPT italic_t italic_o italic_k end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT represents the embeddings of M𝑀Mitalic_M text tokens.

2.2 Manipulation Detection

Manipulation detection involves two tasks: fine-grained manipulation type detection and binary classification. The DGM4superscriptDGM4\rm DGM^{4}roman_DGM start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT dataset [14] introduces two image manipulation methods: Face Swap (FS) and Face Attribute (FA), as well as two text manipulation methods: Text Swap (TS) and Text Attribute (TA). In HAMMER [14], the embedding of the [CLS] token for binary classification of the four manipulation types. However, this approach may suffer from modality competition [15], where one modality is learned well while the other is not fully explored. For instance, text information may dominate [CLS] token, making it challenging to optimize the image part and leading to more imbalance modality information in [CLS] token. To address this issue and promote modal specificity, we introduce decoupled fine-grained classifiers (DFC) to independently guide visual and linguistic features. DFC consists of Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT outputs one of the three categories of Real/FS/FA, and Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT outputs one of the three categories of Real/TS/TA. iclssubscript𝑖𝑐𝑙𝑠i_{cls}italic_i start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT and tclssubscript𝑡𝑐𝑙𝑠t_{cls}italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT are separately fed into Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The fine-grained classification loss is computed as follows:

mcls=ce(Ci(icls),yi)+ce(Ct(tcls),yt),subscript𝑚𝑐𝑙𝑠subscript𝑐𝑒subscript𝐶𝑖subscript𝑖𝑐𝑙𝑠subscript𝑦𝑖subscript𝑐𝑒𝐶𝑡subscript𝑡𝑐𝑙𝑠subscript𝑦𝑡\mathcal{L}_{mcls}=\mathcal{L}_{ce}(C_{i}(i_{cls}),y_{i})+\mathcal{L}_{ce}(C{t% }(t_{cls}),y_{t}),caligraphic_L start_POSTSUBSCRIPT italic_m italic_c italic_l italic_s end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( italic_C italic_t ( italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (4)

where cesubscript𝑐𝑒\mathcal{L}_{ce}caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT denotes the cross-entropy loss, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the image’s fine-grained classification label, and ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the text’s fine-grained classification label. Additionally, we concatenate iclssubscript𝑖𝑐𝑙𝑠i_{cls}italic_i start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT and tclssubscript𝑡𝑐𝑙𝑠t_{cls}italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT and feed them to a binary classifier Cbsubscript𝐶𝑏C_{b}italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, computing the binary classification loss. Here, ybsubscript𝑦𝑏y_{b}italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT represents the binary classification label:

bcls=ce(Cb({icls,tcls}),yb).subscript𝑏𝑐𝑙𝑠subscript𝑐𝑒subscript𝐶𝑏subscript𝑖𝑐𝑙𝑠subscript𝑡𝑐𝑙𝑠subscript𝑦𝑏\mathcal{L}_{bcls}=\mathcal{L}_{ce}(C_{b}(\{i_{cls},t_{cls}\}),y_{b}).caligraphic_L start_POSTSUBSCRIPT italic_b italic_c italic_l italic_s end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( { italic_i start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT } ) , italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) . (5)

2.3 Manipulation Grounding

Manipulation grounding involves two tasks: manipulated image grounding and manipulated text grounding. In manipulated image grounding, the goal is to output the coordinates of the manipulated region in the image. In manipulated text grounding, the objective is to classify each token in the text and determine if it is manipulated. Drawing inspiration from the works of DETR [16] and MaskFormer [17], we propose an implicit manipulation query (IMQ) module, which consists of two components: I-IMQ and T-IMQ. The IMQ module utilizes learnable queries to adaptively aggregate intra-modal forged clues and emphasize modality-specific features. Taking the text in Fig .1 as an example, there is a significant inconsistency between ”MP”, ”underpants”, and ”award”. T-IMQ can efficiently model relationships between tokens by leveraging implicit forgery features learned by queries during training. The image manipulation features fimsubscript𝑓𝑖𝑚f_{im}italic_f start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT and text manipulation features ftmsubscript𝑓𝑡𝑚f_{tm}italic_f start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT are aggregated by image manipulation queries qimsubscript𝑞𝑖𝑚q_{im}italic_q start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT and text manipulation queries qtmsubscript𝑞𝑡𝑚q_{tm}italic_q start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT, respectively:

fimsubscript𝑓𝑖𝑚\displaystyle f_{im}italic_f start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT =Attention(qim,ipat,ipat),absentAttentionsubscript𝑞𝑖𝑚subscript𝑖𝑝𝑎𝑡subscript𝑖𝑝𝑎𝑡\displaystyle=\text{Attention}(q_{im},i_{pat},i_{pat}),= Attention ( italic_q start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_p italic_a italic_t end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_p italic_a italic_t end_POSTSUBSCRIPT ) , (6)
ftmsubscript𝑓𝑡𝑚\displaystyle f_{tm}italic_f start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT =Attention(qtm,ttok,ttok).absentAttentionsubscript𝑞𝑡𝑚subscript𝑡𝑡𝑜𝑘subscript𝑡𝑡𝑜𝑘\displaystyle=\text{Attention}(q_{tm},t_{tok},t_{tok}).= Attention ( italic_q start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_t italic_o italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_t italic_o italic_k end_POSTSUBSCRIPT ) .

The fimsubscript𝑓𝑖𝑚f_{im}italic_f start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT is inputted into a bbox detector Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to estimate the manipulated coordinates. The manipulated image grounding loss is calculated by combining the L1 loss and GIoU loss [18], where ymigsubscript𝑦𝑚𝑖𝑔y_{mig}italic_y start_POSTSUBSCRIPT italic_m italic_i italic_g end_POSTSUBSCRIPT represents the manipulated image grounding label:

mig=L1(Di(fi)ymig)+GIoU(Di(fi)ymig).subscript𝑚𝑖𝑔subscript𝐿1subscript𝐷𝑖subscript𝑓𝑖subscript𝑦𝑚𝑖𝑔subscript𝐺𝐼𝑜𝑈subscript𝐷𝑖subscript𝑓𝑖subscript𝑦𝑚𝑖𝑔\mathcal{L}_{mig}=\mathcal{L}_{L1}(D_{i}(f_{i})-y_{mig})+\mathcal{L}_{GIoU}(D_% {i}(f_{i})-y_{mig}).caligraphic_L start_POSTSUBSCRIPT italic_m italic_i italic_g end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_m italic_i italic_g end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_G italic_I italic_o italic_U end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_m italic_i italic_g end_POSTSUBSCRIPT ) . (7)

The ftmsubscript𝑓𝑡𝑚f_{tm}italic_f start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT and ttoksubscript𝑡𝑡𝑜𝑘t_{tok}italic_t start_POSTSUBSCRIPT italic_t italic_o italic_k end_POSTSUBSCRIPT are dimensionally reduced, and an inner product is performed in the feature dimension to predict whether each token is manipulated. The manipulated text grounding loss is calculated using the cross-entropy loss, where ymtg={yi}i=1Msubscript𝑦𝑚𝑡𝑔subscriptsuperscriptsubscript𝑦𝑖𝑀𝑖1y_{mtg}=\{y_{i}\}^{M}_{i=1}italic_y start_POSTSUBSCRIPT italic_m italic_t italic_g end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT and yi0,1subscript𝑦𝑖01y_{i}\in{0,1}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ 0 , 1 denotes whether the i𝑖iitalic_i-th token is manipulated or not:

mtg=ce(ttok×ftmT,ymtg).subscript𝑚𝑡𝑔subscript𝑐𝑒subscript𝑡𝑡𝑜𝑘superscriptsubscript𝑓𝑡𝑚𝑇subscript𝑦𝑚𝑡𝑔\mathcal{L}_{mtg}=\mathcal{L}_{ce}(t_{tok}\times f_{tm}^{T},y_{mtg}).caligraphic_L start_POSTSUBSCRIPT italic_m italic_t italic_g end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_t italic_o italic_k end_POSTSUBSCRIPT × italic_f start_POSTSUBSCRIPT italic_t italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_m italic_t italic_g end_POSTSUBSCRIPT ) . (8)

2.4 Loss function

To obtain the final loss function, we combine the above components, where α𝛼\alphaitalic_α, β𝛽\betaitalic_β, and γ𝛾\gammaitalic_γ are hyperparameters that control the relative importance of each loss term:

=bcls+αmcls+βmig+γmtg.subscript𝑏𝑐𝑙𝑠𝛼subscript𝑚𝑐𝑙𝑠𝛽subscript𝑚𝑖𝑔𝛾subscript𝑚𝑡𝑔\mathcal{L}=\mathcal{L}_{bcls}+\alpha\mathcal{L}_{mcls}+\beta\mathcal{L}_{mig}% +\gamma\mathcal{L}_{mtg}.caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_b italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT italic_m italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT italic_m italic_i italic_g end_POSTSUBSCRIPT + italic_γ caligraphic_L start_POSTSUBSCRIPT italic_m italic_t italic_g end_POSTSUBSCRIPT . (9)

3 experiment

3.1 Implementation details

The length of the text content is padded or truncated to 50 tokens, while the images are resized to 256x256 pixels. The uni-modal encoders are implemented by ViT-B/16 and RoBERTa. The modality interaction module is constructed using 6 transformer layers. The pre-training weights are derived from METER [19]. The binary classifier, fine-grained classifier, and bbox detector are all implemented using multi-layer perceptron layers. The coefficients of the loss function are set as α=1,β=0.1,γ=1formulae-sequence𝛼1formulae-sequence𝛽0.1𝛾1\alpha=1,\beta=0.1,\gamma=1italic_α = 1 , italic_β = 0.1 , italic_γ = 1. We utilize the AdamW [20] optimizer with a weight decay of 0.02. During the first 1000 steps, the learning rate is warmed up to 1e-4 and then decayed to 1e-6 using a cosine schedule.

Table 1: Results comparison with state-of-the-art methods and multi-modal learning methods on DGM4superscriptDGM4\rm DGM^{4}roman_DGM start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT.
Categories Binary Cls Multi-Label Cls Image Grounding Text Grounding
Methods Params AUC EER ACC mAP CF1 OF1 IoUmean IoU50 IoU75 Precision Recall F1
CLIP [21] - 83.22 24.61 76.40 66.00 59.52 62.31 49.51 50.03 38.79 58.12 22.11 32.03
ViLT [22] - 85.16 22.88 78.38 72.37 66.14 66.00 59.32 65.18 48.10 66.48 49.88 57.00
HAMMER [14] 441M 93.19 14.10 86.39 86.22 79.37 80.37 76.45 83.75 76.06 75.01 68.02 71.35
Ours 328M 95.11 11.36 88.75 91.42 83.60 84.38 80.83 88.35 80.39 76.51 70.61 73.44
Table 2: Ablation study on DCA, DFC, and IMQ.
Methods AUC mAP IoUmean F1 Avg
Baseline 94.25 89.29 77.19 69.13 82.47
Ours w/o MtsubscriptMt\rm M_{t}roman_M start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT 93.73 80.66 81.58 46.58 75.64
Ours w/o DFC 94.94 90.13 80.43 73.51 84.75
Ours w/o I-IMQ 95.05 91.32 79.14 73.39 84.73
Ours w/o T-IMQ 95.20 90.66 80.75 73.09 84.93
Ours 95.11 91.42 80.84 73.44 85.20

3.2 Datasets and Evaluation Metrics

We conduct experiments on DGM4superscriptDGM4\rm DGM^{4}roman_DGM start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT [14] dataset. The DGM4superscriptDGM4\rm DGM^{4}roman_DGM start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT comprise a total of 230k news samples, including 77,426 original image-text pairs and 152,574 manipulation pairs.

We evaluate each method using a total of twelve metrics across four tasks. For binary classification, we evaluate accuracy (ACC), area under the receiver operating characteristic curve (AUC), and equal error rate (EER). For fine-grained classification, we evaluate mean average precision (MAP), average per-class F1 (CF1), and overall F1 (OF1). For manipulated image grounding, we evaluate the mean intersection over union (IoUmean) as well as the IoU at thresholds of 0.5 (IoU50) and 0.75 (IoU75). For manipulated text grounding, we evaluate precision, recall, and F1 score.

3.3 Comparison with the state-of-the-art methods

In this section, we show the performance of the state-of-the-art methods and our method on the DGM4superscriptDGM4\rm DGM^{4}roman_DGM start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT dataset. The comparison results are shown in Table 1. It can be observed that our method outperforms the state-of-the-art methods on the DGM4superscriptDGM4\rm DGM^{4}roman_DGM start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT dataset in all metrics. Particularly, compared to the SOTA method, our approach achieves improvements of over 2% in important metrics such as ACC, MAP, IoUmean, and F1. These results indicate that our approach can effectively leverage modality-specific features and accurately model the correlations between images and text, leading to enhanced manipulation detection and grounding.

3.4 Ablation Study

To validate the importance of the DCA, DFC, and IMQ in our model, we conduct a series of ablation studies. We create the baseline model by removing DCA, DFC, and IMQ components. Specifically, we delete the Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT branch, replace DFC with a multi-label classifier that makes predictions on concatenated features {icls,tcls}subscript𝑖𝑐𝑙𝑠subscript𝑡𝑐𝑙𝑠\{i_{cls},t_{cls}\}{ italic_i start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT }, and remove IMQ while using MLP to locate the manipulation regions of image and text. By comparing the results to our full model, we can observe the effectiveness of our overall design. To verify the impact of each component, we perform ablation experiments by removing the corresponding component.

Removing Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (Ours w/o MtsubscriptMt\rm M_{t}roman_M start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT) leads to an average performance degradation. This indicates that the absence of DCA makes image features dominate and text features weakened. In addition, As mentioned in HAMMER [14], text manipulation detection and grounding are more difficult tasks than images. As a result, the relevant data decreases significantly. Furthermore, replacing DFC with a multi-label classifier (Ours w/o DFC) results in performance degradation on the classification task. This finding highlights the importance of DFC in enhancing modality-specific feature mining, enabling more accurate identification of specific types of manipulation. Moreover, the performance using IMQ outperforms its ablated counterparts (Ours w/o I-IMQ, Ours w/o T-IMQ) on the corresponding tasks. This demonstrates the effectiveness of the IMQ in aggregating forged features within each modality.

Refer to caption
Fig. 2: Visualization of manipulation grounding results. Ground truths are in red, and predictions are in blue. The top three examples from HAMMER [14], and the subsequent three examples from our model.

3.5 Visualization

We provide visualizations of the manipulation grounding in Fig 2. In the second and third columns, we observe that HAMMER [14] may encounter interference from the text modality, leading to errors in image grounding. In contrast, our method successfully distinguishes fake faces from real faces while accurately identifying the manipulated token. This illustrates the effectiveness of our approach in leveraging modality-specific features to uncover forged clues.

4 CONCLUSION

In this paper, we construct a simple and novel transformer-based framework for manipulation detection and grounding. We introduce dual-branch cross-attention and decoupled fine-grained classifiers to effectively model cross-modal correlations and exploit modality-specific features. Implicit manipulation query is proposed to improve the discovery of forged clues. Experimental results on the DGM4superscriptDGM4\rm DGM^{4}roman_DGM start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT dataset show that our proposed approach outperforms existing methods in terms of performance.

5 ACKNOWLEDGEMENT

This work is supported by the National Natural Science Foundation of China (Grant No. 62121002).

References

  • [1] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer, “The deepfake detection challenge (dfdc) dataset,” arXiv preprint arXiv:2006.07397, 2020.
  • [2] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Ye** Choi, “Defending against neural fake news,” NeurIPS, vol. 32, 2019.
  • [3] Changtao Miao, Qi Chu, Weihai Li, Tao Gong, Wanyi Zhuang, and Nenghai Yu, “Towards generalizable and robust face manipulation detection via bag-of-feature,” in VCIP. IEEE, 2021, pp. 1–5.
  • [4] Changtao Miao, Qi Chu, Weihai Li, Suichan Li, Zhentao Tan, Wanyi Zhuang, and Nenghai Yu, “Learning forgery region-aware and id-independent features for face manipulation detection,” T-BIOM, vol. 4, no. 1, pp. 71–84, 2022.
  • [5] Wanyi Zhuang, Qi Chu, Zhentao Tan, Qiankun Liu, Haojie Yuan, Changtao Miao, Zixiang Luo, and Nenghai Yu, “Uia-vit: Unsupervised inconsistency-aware method based on vision transformer for face forgery detection,” in ECCV. Springer, 2022, pp. 391–407.
  • [6] Changtao Miao, Zichang Tan, Qi Chu, Nenghai Yu, and Guodong Guo, “Hierarchical frequency-assisted interactive networks for face manipulation detection,” TIFS, vol. 17, pp. 3008–3021, 2022.
  • [7] Changtao Miao, Zichang Tan, Qi Chu, Huan Liu, Honggang Hu, and Nenghai Yu, “F 2 trans: High-frequency fine-grained transformer for face forgery detection,” TIFS, vol. 18, pp. 1039–1051, 2023.
  • [8] Wanyi Zhuang, Qi Chu, Haojie Yuan, Changtao Miao, Bin Liu, and Nenghai Yu, “Towards intrinsic common discriminative features learning for face forgery detection using adversarial learning,” in ICME. IEEE, 2022, pp. 1–6.
  • [9] Zichang Tan, Zhichao Yang, Changtao Miao, and Guodong Guo, “Transformer-based feature compensation and aggregation for deepfake detection,” SPL, vol. 29, pp. 2183–2187, 2022.
  • [10] Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush, “Gltr: Statistical detection and visualization of generated text,” arXiv preprint arXiv:1906.04043, 2019.
  • [11] Zimian Wei, Hengyue Pan, Linbo Qiao, Xin Niu, Peijie Dong, and Dongsheng Li, “Cross-modal knowledge distillation in multi-modal fake news detection,” in ICASSP. IEEE, 2022, pp. 4733–4737.
  • [12] Ting Zou, Zhong Qian, Peifeng Li, and Qiaoming Zhu, “Cross-modal adversarial contrastive learning for multi-modal rumor detection,” in ICASSP. IEEE, 2023, pp. 1–5.
  • [13] Qichao Ying, Xiaoxiao Hu, Yangming Zhou, Zhenxing Qian, Dan Zeng, and Shiming Ge, “Bootstrap** multi-view representations for fake news detection,” in AAAI, 2023.
  • [14] Rui Shao, Tianxing Wu, and Ziwei Liu, “Detecting and grounding multi-modal media manipulation,” in CVPR, 2023, pp. 6904–6913.
  • [15] Yu Huang, Junyang Lin, Chang Zhou, Hongxia Yang, and Longbo Huang, “Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably),” in ICML. PMLR, 2022, pp. 9226–9259.
  • [16] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko, “End-to-end object detection with transformers,” in ECCV. Springer, 2020, pp. 213–229.
  • [17] Bowen Cheng, Alex Schwing, and Alexander Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” NeurIPS, vol. 34, pp. 17864–17875, 2021.
  • [18] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in CVPR, 2019, pp. 658–666.
  • [19] Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al., “An empirical study of training end-to-end vision-and-language transformers,” in CVPR, 2022, pp. 18166–18176.
  • [20] Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [21] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” in ICML. PMLR, 2021, pp. 8748–8763.
  • [22] Wonjae Kim, Bokyung Son, and Ildoo Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in ICML. PMLR, 2021, pp. 5583–5594.

Supplementary Material

Appendix A Intra-domain and inter-domain comparison.

Table 3: Intra-domain and inter-domain comparison on DGM4superscriptDGM4\rm DGM^{4}roman_DGM start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and Fakeddit.
Datasets DGM4superscriptDGM4\rm DGM^{4}roman_DGM start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT Fakeddit Overall
Methods AUC AUC Avg
HAMMER [14] 93.19 62.81 78.00
Ours 95.11 64.26 79.78

The field of multi-modal media manipulation detection and grounding currently only has the DGM4superscriptDGM4\rm DGM^{4}roman_DGM start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT dataset. To verify the effectiveness of our method on other datasets, we select the multi-modal fake news dataset Fakeddit with similar tasks and a large amount of data. We train the model on the DGM4 dataset and test it on the Fakeddit dataset. As shown in Table 3, the AUC of our model is higher than HAMMER [14], demonstrating the generalization ability of our method.