11institutetext: Shanghai Jiao Tong University 22institutetext: Tsinghua Shenzhen International Graduate School 33institutetext: Huawei Noah’s Ark Lab
33email: {chenyirui,weiliucv}@sjtu.edu.cn [email protected] {huangxudong9,wei.lee,hujie23}@huawei.com

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

Yirui Chen 1133**    Xudong Huang 33**    Quan Zhang 2233**    Wei Li 33‡‡    Mingjian Zhu 33    Qiangyu Yan 33    Simiao Li 33    Hanting Chen 33    Hailin Hu 33    Jie Yang 11    Wei Liu 11††    Jie Hu 33††
Abstract

The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location(IMDL). However, the lack of a large-scale data foundation makes IMDL task unattainable. In this paper, a local manipulation pipeline is designed, incorporating the powerful SAM, ChatGPT and generative models. Upon this basis, We propose the GIM dataset, which has the following advantages: 1) Large scale, including over one million pairs of AI-manipulated images and real images. 2) Rich Image Content, encompassing a broad range of image classes 3) Diverse Generative Manipulation, manipulated images with state-of-the-art generators and various manipulation tasks. The aforementioned advantages allow for a more comprehensive evaluation of IMDL methods, extending their applicability to diverse images. We introduce two benchmark settings to evaluate the generalization capability and comprehensive performance of baseline methods. In addition, we propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial Block (FSB), and a Multi-window Anomalous Modelling (MWAM) Module. Extensive experiments on the GIM demonstrate that GIMFormer surpasses previous state-of-the-art works significantly on two different benchmarks. Code page: GIM.

Keywords:
Dataset and Benchmark Image Manipulation Detection and Location Forgery Traces Extraction Multi-scale Feature Learning
11footnotetext: * These authors contributed equally to this research, which was done during Yirui Chen and Quan Zhang’s internship at Huawei Noah’s Ark Lab. ‡ Project lead † Corresponding authors

1 Introduction

Refer to caption
Figure 1: Example image from the GIM dataset. Our dataset includes images manipulated by three state-of-the-art generators: Stable-Diffusion, GLIDE, and DDNM. The first column shows authentic images, while the second column presents forgery masks. The third column displays forged images.

Images are one of the most essential media for information transmission in modern society, which is widely spread on public platforms such as news and social media. With the rapid advancement of generative methods [11, 52, 16, 4], such natural information can be more easily tampered with for specific purposes such as tampering with an object or person. The image of this class is particularly convincing since its visual comprehensibility, leads to serious information security risks in social areas. For instance, an AI-generated image of black smoke billowing from what appeared to be a government building near the Pentagon set off investor fears, sending stocks tumbling [46]. Therefore, it is of utmost urgency to develop effective methods to examine whether an image is modified by generative models, meanwhile identifying the exact location of the tampering. However, the traditional image manipulation detection and location(IMDL) dataset [55, 13, 60] overlook the powerful generative models and generative IMDL dataset are limited with scale, thus not sufficient to comprehensively evaluate the performance of IMDL methods and benefits the IMDL community.

To this end, we propose a million-level generative-based IMDL dataset, termed GIM, to provide a reliable database for AI Generated Content (AIGC) security. GIM leverage the diffusion model [44, 52, 59] and SAM [32, 51], with ImageNet [10] and VOC [14] as the input. SAM is leveraged to locate the tampering region, and diffusion models paint the reasonable content in the tampering area. The excellent diffusion model ensures dataset fidelity and rich categories guarantee the diversity of images in GIM. GIM contains a total of over one million generative tampered images. Based on this, the IMDL methods are evaluated and benchmarked. To develop an appropriate benchmark scale, we explore the impact of different amounts of training manipulated data. The final benchmark contains about 300k manipulated images with their tampering masks. To simulate the real situation, we investigate the effect of degradation and apply the images to the random degradation method. Two benchmark settings are designed to evaluate the comprehensive and generalization. In addition, we recreated existing IMDL methods in a fair manner, providing the basis for future development. Overall, as a generative IMDL benchmark, GIM possesses the following advantages: 1) GIM has a large and reasonable data scale, including rich image categories and contents. 2) GIM contains various generative manipulation models and tasks. 3) GIM proposes two settings for verifying generalization capability and comprehensive performance in existing methods.

The existing methods emphasize traditional image manipulations also known as “cheapfakes”. Generative manipulations introduce lethal alterations in content with no apparent frequency or structural inconsistency. To address the above issue, we introduce GIMFormer, a transformer-based framework for generative IMDL. The ShadowTracer is designed to embed the nuanced artifacts inherent in generative tampering and serves as prior information. The Frequency-Spatial Block harvests the manipulation clues in frequency and spatial domains. Furthermore, the Multi Windowed Anomalous Modelling Module captures local inconsistencies at different scales to refine the features. By doing so, our model extracts features from both the RGB and learned tampering traces map, captures details from the frequency domain and spatial domain, and models inconsistencies at different scales for precise manipulation detection and location.

We conduct experiments on our proposed GIM. Both the qualitative and quantitative results demonstrate that GIMFormer can significantly outperform previous state-of-the-art methods in terms of both comprehensive capability and generalization ability.

In summary, our main contributions are as follows:

  • We build a local manipulation pipeline and construct a comprehensive large-scale dataset with various SOTA generative models and manipulation tasks to further facilitate the research on the IMDL task.

  • We investigate the impact of data scales and degradation on IMDL tasks to select an appropriate scale and simulate real-world scenarios. Based on this, we construct a reasonable benchmark and two settings to evaluate the generalization and comprehensive capabilities of IMDL methods.

  • We propose GIMFormer, a generative IMDL framework consisting of ShadowTracer to embed subtle traces, Frequency-Spatial Block to extract forgery features in the frequency and spatial domains and Multi-Windowed Anomalous Modelling Module to capture pixel-level inconsistencies at multi scales.

  • Extensive experiments show that GIMFormer achieves state-of-the-art IMDL performance in terms of generalization and comprehensive capabilities.

2 Related Work

2.1 Image Forensic Datasets

The research community has dedicated significant efforts to establishing robust datasets for image forensics. Early datasets primarily focus on one type of manipulation. The Columbia dataset  [43, 24] is the pioneering dataset focusing on splicing forgeries. CASIA v1.0 and CASIA v2.0  [13] are the first to incorporate multiple types of manipulations within a single dataset, with forged images manually crafted using Adobe Photoshop. The Wild Web [66] dataset collects forged images from the Internet, significantly surpassing the scale of previous datasets. NIST [17] Datasets provide an extensive collection of datasets serving as a crucial standard for assessing media tampering detection methods. DEFACTO [42]is automatically generated from the Microsoft COCO dataset, encompassing four categories of forgeries: Copy and Move, Splicing, Object Removal, and Morphing. IMD2020 [45] provides a range of locally manipulated images generated through manual operations or random slicing, while also collecting online images without apparent manipulation traces. Recently, HiFi-IFDL [20] construct a hierarchical fine-grained dataset containing some representative forgery methods. AutoSplice [29] leverage the capabilities of large-scale language-image models like DALL-E2  [48] to facilitate automatic image editing and generation. CocoGlide [19] contains 512 images using the GLIDE diffusion model. Some datasets focused exclusively on facial manipulations [54, 18, 9] or entirely synthesized images[71, 56, 3].

However, these datasets have certain limitations, such as small data sizes and limited manipulation techniques or tasks. Recent advances in generative models [44, 23, 52] have demonstrated remarkable abilities in generative-based manipulation. These developments have led to the emergence of open-source projects [51, 65, 15] for manipulation. Leveraging the power of these models, we introduce GIM, a large-scale generative-based manipulation dataset.

Refer to caption
Figure 2: An overview of the dataset generation. With the input of the original image and the desired region of interest, classification attribution or mouse input as a User query, the manipulation mask is extracted by the zero-shot segmentation model SAM. Besides, interacting with ChatGPT, we organize the tampering prompt by combining replacement classes. The final generations are produced by generative models with the input of the original image, tampering mask and tampering prompts.

2.2 Image Manipulation Detection and Localization

Early studies on natural image manipulation localization mainly focus on detecting a specific type of manipulation [7, 27, 34, 50]. Due to the exact manipulation type is unknown in real-world scenarios, most state-of-the-art methods [40, 19, 58, 64, 38, 28, 61] primarily concentrate on general manipulation. RGB-N [70] propose a dual-stream network for manipulation detection. One stream focuses on extracting RGB features and the other leverages noise features to model the disparities. MVSS-Net [12] utilizes a two-stream CNN to extract noise features and employs the Dual Attention Module to merge the output of the two-stream CNN. PSCC-Net  [40] extracts hierarchical features with a top-down path and detects whether the input image has been manipulated using a bottom-up path. ObjectFormer [58] uses object prototypes to model object-level consistencies and find patch-level inconsistencies to detect the manipulation. However, these methods focus on "cheapfake" detection and encounter challenges when applied to generative manipulation due to the tampering traces are subtle and the lack of an understanding of generative manipulation patterns. Certain works [26, 6, 9, 39, 2] focus on localizing manipulations using generative models, yet these methods primarily concentrate on human faces.

In this work, we leverage a deep network to embed the subtle artifacts inherent in generative tampering as a prior trace map. we then employ a dual network to fuse the trace map and RGB image, combining frequency and spatial information and capturing pixel-level inconsistencies at multiple scales to detect and locate the manipulation.

Table 1: Summary of previous image manipulation datasets and GIM. We showcase the number of data entries within each dataset, the image content, image sizes and the manipulation techniques they encompass. \star denotes that HiFi-IFDL is composed of multiple datasets, including entirely synthesized images, traditionally manipulated images and fake face images.
Dataset Image Content Image Size Num. Images Manipulations Category
# Authetic Image # Forged Image Traditional Gen.Reprint Gen.Removal
FaceForensics++ [54] Face 480p-1080p 1000 4000
DFDC [9] Face 240p-2160p 19,154 100,000
DeeperForensics [30] Face 1080p 50,000 10,000
Columbia Gray [43] General 128 ×\times× 128 933 912
CASIA V2.0  [13] General 320 ×\times× 240-800 ×\times× 600 7,200 5,123
IMD2020 [45] General 1062 ×\times× 866 35,000 35,000
Coverage [60] General 400 ×\times× 486 100 100
AutoSplice  [29] General 256 ×\times× 256 - 4232 ×\times× 4232 3,621 2,273
HiFi-IFDL[20] General - - 1,000,000\star
CocoGlide [19] General 256×\times×256 - 512
GIM General 64×\times×64-6000×\times×3904 1,140,000 1,140,000

3 Dataset and Benchmark Construction

In Section 3.1, we propose an automatic data synthesis pipeline to produce generative manipulated images efficiently with unlabeled images. Leveraging this pipeline, we construct a comprehensive large-scale dataset. To build a reasonable scale benchmark for evaluating the IMDL methods, we delve into the impact of training data scale in Section 3.2. To emulate real-world conditions and establish an objective benchmark, the manipulated images and original authentic images are subjected to three random degradations (JPEG compression, downsampling, and Gaussian blur), as detailed in Section 3.3). GIM benchmark comprises over 320k manipulated images paired with authentic images (100 distinct labels in ImageNet for each generative model) for algorithm evaluation. The generated images are shown in Figure 1 with their original images and tampering masks. The specific composition of GIM can be found in the Supplementary Materials. In Section 3.4, we introduce the criteria and settings used to evaluate the performance of IMDL methods.

3.1 Generation and Details of the dataset

To construct a generative-based IMDL dataset, a large-scale natural dataset is urgently desired. ImageNet [10] and VOC [14] are awesome datasets, chosen as the starting point for our research due to the following advantages: 1) Large and diverse. 2) Wide category coverage. To sum up, we regard these datasets as the database for generative manipulation. In addition to possessing various scenarios, the proposed dataset utilizes a wealth of tampering methods. Specifically, we reprint the target class by generative inpainting [52, 44] method or remove the destination region using the model [59].Benefiting from open-source projects [33, 51], we develop our data generation pipeline.

Figure 2 illustrates the overall process. Firstly, with the classification attribute or User query, the local manipulation mask is extracted by the zero-shot segmentation network [33]. For reprint tampering, the image category is embedded in the replacement prompt and interacts with ChatGPT, which returns an approximate category. The approximate category is then embedded into the inpainting prompt. Combined with the original image, manipulation mask, and inpainting prompt, generative models generate the reprint generative tampered result. For removal tampering, only the original image and manipulation mask are required for the generated model. The GIM Dataset utilizes the entire ImageNet dataset and the Stable-Diffusion generator to construct a comprehensive dataset, providing a reliable database and laying the foundation for the research below.

3.2 Analysis of Benchmark Scale

With the data generation pipeline described above, we are qualified to generate millions, tens of millions, or even billions of data. Nonetheless, blindly increasing the amount of data does not improve the algorithm performance, but may lead to data redundancy. Therefore, taking data generated by the Stable Diffusion as an example, we explore the influence of data volume and category volume on baseline classification [21] and segmentation algorithms [63]. Training sets are the different scales of the training subset, and the validation is the same test set. As shown in Table 2, the metrics of classification and segmentation are improved as the dataset scale increases. When the scale reaches 180K, the algorithm performance is almost saturated. No matter whether the image class or data scale is added, the metrics are stagnant. Experiments demonstrate that increasing the amount of data or category brings negligible benefits when the data tends to be saturated. According to the analysis above, for the training data, the GIM benchmark selects 100 different labels from ImageNet for each generation model to create tampered images. For the test data, the GIM benchmark uses the entire test dataset for each generation model.

Table 2: Results of classification and segmentation on different scale subsets. Classification uses Resnet-50 and segmentation uses SegFormer-b0.
Total Num. of Image Image Classes Image per Class Metrics
Cls.Acc\uparrow Seg.F1\uparrow Seg.AUC\uparrow
2,800 10 280 0.6306 0.2550 0.7950
28,000 100 280 0.7480 0.3247 0.7995
180,000 100 1800 0.9129 0.5289 0.8699
360,000 200 1800 0.9131 0.5291 0.8701
500,000 500 1000 0.9131 0.5291 0.8703
Table 3: Influence of different degradation method on classification (Resnet-50) and segmentation (SegFormer-b0) .
Degradation Metrics
Cls.Acc Seg.F1
- 91.29 52.89
JPEG 83.10 45.74
Gaussian Blur 90.94 40.14
Downsample 88.34 37.12

3.3 Post-processing of Degradation Method

After being spread on the Internet, images will encounter various post-processings, such as JPEG compression, downsampling, Gaussian blur, etc. These transformations can pose challenges for image forensics methods. Figure 3.2 investigates the impact of degradation. The baseline models are trained on the clean data generated by the Stable-Diffusion training set and tested on the same test set with a specific degradation method. Experimental results denote that these degradation methods make identification more difficult. To shorten the distance from reality and build an objective benchmark, the random degradation method (JPEG compression, downsampling, Gaussian blur) is performed on the dataset. The details of the parameters can be found in Supplementary Materials.

3.4 Benchmark Settings

Dataset: The datasets used for the research of generative IMDL tasks are lacking. This makes direct comparisons and accurate measurement difficult. We use three cross-generator datasets GIM-SD (data from ImageNet manipulation by Stable Diffusion, similarly hereinafter), GIM-GLIDE, GIM-DDNM and a cross-distribution dataset GIM-VOC(data from VOC manipulation by Stable Diffusion) to present benchmarking results.

Metrics: We evaluate the performance of the proposed method on both the image manipulation detection task and localization task. For classification results, we use Accuracy (Cls.acc) as our evaluation metric, while for localization, the pixel-level AUC and F1 score on manipulation masks are adopted. Since binary masks and detection scores are required to compute F1 scores, we adopt the Equal Error Rate (EER) threshold to binarize them.

Settings: Two settings are proposed to verify the comprehensive and generalization performance of the algorithms. In the cross-generator generalization setting, the models are trained on the GIM-SD training set and tested on the GIM-GLIDE, GIM-DDNM and GIM-VOC test set to explore the generalization IMDL performance, as shown in Table 5. In the mix-generator comprehensive setting, the models are jointly trained on the GIM-SD, GIM-GLIDE and GIM-DDNM training set and tested on the correspondence test dataset respectively to evaluate the comprehensive IMDL performance, as shown in Table 4.

4 Method

We propose the IMDL framework GIMFormer, which utilizes a dual encoder and decoder architecture. Considering the specifics of generative manipulation, we propose the ShadowTracer in Section 4.1, the Frequency-Spatial Block (FSB) in Section 4.2, and the Multi Windowed Anomalous Modelling module (MWAM) in Section 4.3. Figure 3 gives an overview of the framework. For the input RGB image x𝑥xitalic_x, we first extract its learned trace map t𝑡titalic_t. Then, both x𝑥xitalic_x and r𝑟ritalic_r are fed into a two-branch network, where the four-stage structure is used to extract pyramidal featuresFisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i[1,4]𝑖14i\in[1,4]italic_i ∈ [ 1 , 4 ]). The RGB branch is composed of FSB, Transformer Block [63] and WMAM. The tracer branch consists of a Transformer Block and WMAM. In the fusion step, the feature rectification module (FRM) [68] and feature fusion module (FFM) [68] are used for feature fusion. The four-stage fused features are forwarded to the decoder for final detection y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG and location M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG. Details are provided in the Supplementary.

Refer to caption
Figure 3: GIMFormer architecture. The ShadowTracer takes the RGB image x𝑥xitalic_x to extract the learned manipulation trace map t𝑡titalic_t. The encoder uses both the RGB input and trace map t𝑡titalic_t to extract pyramidal features Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i[1,4]𝑖14i\in[1,4]italic_i ∈ [ 1 , 4 ]) across four stages. The four-stage fused features Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are forwarded to the decoder for generative manipulation detection and location.

4.1 ShadowTracer

Prior manipulation detection methods mainly focus on ”cheapfake” scenes and rely on visible traces. These artifacts include distortions and sudden changes caused by manipulation of the image structure. However, generative tampering makes significant alterations to the content with no apparent frequency or structural inconsistency. As shown in Figure 4, these subtle traces are displayed with inherent patterns, not visible traces with inconsistent edges.

ShadowTracer aims to capture the inherent characteristics and subtle traces of the generative models. For a doctored image, our objective is to learn a map** gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT to map the tampered image to its latent disturbed pixel values, where gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT represents a neural network with trainable parameters ϕitalic-ϕ\phiitalic_ϕ. Our key observation is that the differences introduced by generative models in data distribution exhibit inherent patterns, and deep neural networks can attempt to reconstruct these variations. At the training stage, we generate pairs of the input image xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the generative tampered image G(xi)𝐺subscript𝑥𝑖G(x_{i})italic_G ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), the manipulation trace can be calculated by ti=G(xi)xisubscript𝑡𝑖𝐺subscript𝑥𝑖subscript𝑥𝑖t_{i}=G(x_{i})-x_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_G ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The objective function for training gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT can be formulated as

minϕ{r(gϕ(G(𝐱i))),ti)}\min_{\bm{\phi}}\Big{\{}\mathcal{L}_{r}(g_{\bm{\phi}}\left(G\left(\mathbf{x}_{% i}\right))\right),t_{i})\Big{\}}roman_min start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT { caligraphic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( italic_G ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } (1)

where r(𝐱,𝐲)=𝐱𝐲2subscript𝑟𝐱𝐲subscriptnorm𝐱𝐲2\mathcal{L}_{r}(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|_{2}caligraphic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x , bold_y ) = ∥ bold_x - bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Furthermore, the map** network should satisfy two properties: detecting subtle tampering traces and being robust to various real-world image degradations. For this reason, image pairs are generated by mixing original and manipulated images and incorporating diverse degradation operations at the training stage. Specifically, given an input image I𝐼Iitalic_I, we segment a portion and perform generative manipulation to obtain Imsubscript𝐼𝑚I_{m}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. The alpha blendering [47] is utilized to the original and manipulated images to obscure obvious tampering traces. Following this, we subject the images to degradation operations such as blurring, and JPEG compression to obtain the final manipulated image. The network is trained on 64×\times×64 pixel patches randomly sampled from the dataset, adapting the loss in eq. 1.

4.2 Frequency-Spatial Block

When degradation operations are applied, artifacts in manipulated images are tricky to perceive. To improve the local expressive ability and effectively harvest discriminative cues in manipulated images, we design a Frequency-Spatial Bloc(FSB) to extract forgery features in the frequency and spatial domains.

Inspired by the recent work [58, 49, 37, 67], as shown in Figure 3, FSB consists of two branches: a frequency branch and a spatial branch. In the frequency branch, the input X𝑋Xitalic_X is converted into the frequency domain T(X)subscript𝑇𝑋\mathcal{F}_{T}(X)caligraphic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_X ) using the 2D FFT. A learnable filter Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is multiplied to modulate the spectrum and capture the important frequency information. Subsequently, the inverse FFT is applied to convert the feature back to the spatial domain, resulting in the extraction of frequency-aware features Xfsubscript𝑋fX_{\text{f}}italic_X start_POSTSUBSCRIPT f end_POSTSUBSCRIPT. In the spatial branch, the input X𝑋Xitalic_X is processed through convolutional layers and activated using the LeakyReLU function in a separate network to enhance the expressiveness of the features and obtain refined spatial features Xssubscript𝑋sX_{\text{s}}italic_X start_POSTSUBSCRIPT s end_POSTSUBSCRIPT. Then Xfsubscript𝑋fX_{\text{f}}italic_X start_POSTSUBSCRIPT f end_POSTSUBSCRIPT and Xssubscript𝑋sX_{\text{s}}italic_X start_POSTSUBSCRIPT s end_POSTSUBSCRIPT are concatenated and passed through convolutional layers and the LeakyReLU activation function to obtain enhanced information, which is then combined with the original input X𝑋Xitalic_X through element-wise summation.

The total process can be formulated by:

Xf=^T(T(X)Gi)subscript𝑋fsubscript^𝑇direct-productsubscript𝑇𝑋subscript𝐺𝑖\displaystyle X_{\text{f}}=\hat{\mathcal{F}}_{T}(\mathcal{F}_{T}(X)\odot G_{i})italic_X start_POSTSUBSCRIPT f end_POSTSUBSCRIPT = over^ start_ARG caligraphic_F end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_X ) ⊙ italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (2)
Xs=ConvL(Conv(X))subscript𝑋ssubscriptConvLConv𝑋\displaystyle X_{\text{s}}=\mathrm{Conv_{L}}\left(\mathrm{Conv}\left(X\right)\right)italic_X start_POSTSUBSCRIPT s end_POSTSUBSCRIPT = roman_Conv start_POSTSUBSCRIPT roman_L end_POSTSUBSCRIPT ( roman_Conv ( italic_X ) )
Xout=ConvL([Xf,Xs])+X,subscript𝑋𝑜𝑢𝑡subscriptConvLsubscript𝑋fsubscript𝑋s𝑋\displaystyle X_{out}=\mathrm{Conv_{L}}([X_{\mathrm{f}},X_{\mathrm{s}}])+X,italic_X start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = roman_Conv start_POSTSUBSCRIPT roman_L end_POSTSUBSCRIPT ( [ italic_X start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT ] ) + italic_X ,

where direct-product\odot denotes the Hadamard product , ConvLsubscriptConvL\mathrm{Conv_{L}}roman_Conv start_POSTSUBSCRIPT roman_L end_POSTSUBSCRIPT denotes convolution with LeakyReLU and []matrix\begin{bmatrix}\cdot\end{bmatrix}[ start_ARG start_ROW start_CELL ⋅ end_CELL end_ROW end_ARG ] denotes a concatenation operation.

Refer to caption
Figure 4: Subtle traces are left by generative manipulation devoid of apparent frequency or structural inconsistency. It fatally modifies the content and this pixel-level perturbation corrupts the image. ShadowTracer captures the intrinsic pattern of the generative tampering and reconstructs the underlying tampering perturbations as priori traces.

4.3 Multi Windowed Anomalous Modelling Module

Image manipulation causes discrepancies at the pixel level. Genuine pixels are expected to exhibit consistency with neighboring pixels, while manipulated pixels may deviate and display anomalies. Former works [62, 35, 28], explore modeling such local inconsistencies. To effectively capture the pixel-level inconsistency between the manipulated and real region, we introduce the Multi Windowed Anomalous Modelling module (MWAM) to model these differences at multiple scales for fine-grained features.

As shown in Figure 3, given the input feature FH×W×C𝐹𝐻𝑊𝐶{F\in H\times W\times C}italic_F ∈ italic_H × italic_W × italic_C, we calculate the difference between each pixel and surrounding pixels within a local window in two branches by Eq.3.

Duk[i,j]=(F[i,j]Fuk[i,j])/σ,subscriptsuperscript𝐷𝑘𝑢𝑖𝑗𝐹𝑖𝑗subscriptsuperscript𝐹𝑘𝑢𝑖𝑗superscript𝜎\displaystyle D^{k}_{u}[i,j]=(F[i,j]-F^{k}_{u}[i,j])/\sigma^{*},italic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT [ italic_i , italic_j ] = ( italic_F [ italic_i , italic_j ] - italic_F start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT [ italic_i , italic_j ] ) / italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , (3)
σ=maximum(σ(F),1e5+wσ)superscript𝜎maximum𝜎𝐹1superscript𝑒5subscript𝑤𝜎\displaystyle\sigma^{*}=\text{maximum}(\sigma(F),1e^{-5}+w_{\sigma})italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = maximum ( italic_σ ( italic_F ) , 1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT + italic_w start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT )

where u{a,m}𝑢𝑎𝑚u\in\{a,m\}italic_u ∈ { italic_a , italic_m } denotes average or maximum branches, σ(F)𝜎𝐹\sigma(F)italic_σ ( italic_F ) is the standard deviation of F, and wσsubscript𝑤𝜎w_{\sigma}italic_w start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT is a learnable non-negative weight vector of the same length as σ𝜎\sigmaitalic_σ, Faksuperscriptsubscript𝐹𝑎𝑘F_{a}^{k}italic_F start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and Fmksuperscriptsubscript𝐹𝑚𝑘F_{m}^{k}italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT are calculated from the average and maximum values of the k×k𝑘𝑘k\times kitalic_k × italic_k windows in each pixel.

Different sizes k𝑘kitalic_k are selected to model the inconsistency at different scales. Then, the obtained N=3𝑁3N=3italic_N = 3 different-scale Daksubscriptsuperscript𝐷𝑘𝑎D^{k}_{a}italic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and Dmksubscriptsuperscript𝐷𝑘𝑚D^{k}_{m}italic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are concatenated and fed into a convolutional network to obtain an anomaly map Masubscript𝑀𝑎M_{a}italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and Mmsubscript𝑀𝑚M_{m}italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT of the same size as the original input.

Additionally, the anomaly score mask 𝑺𝒖^H×W^subscript𝑺𝒖𝐻𝑊\hat{\bm{S_{u}}}\in H\times Wover^ start_ARG bold_italic_S start_POSTSUBSCRIPT bold_italic_u end_POSTSUBSCRIPT end_ARG ∈ italic_H × italic_W of the feature is computed using Eq.4.

𝒇u^^subscript𝒇𝑢\displaystyle\hat{\bm{f}_{u}}over^ start_ARG bold_italic_f start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG =DWConv(𝒇),absentDWConv𝒇\displaystyle=\mathrm{DW-Conv}\left(\bm{f}\right),= roman_DW - roman_Conv ( bold_italic_f ) , (4)
𝑺u^^subscript𝑺𝑢\displaystyle\hat{\bm{S}_{u}}over^ start_ARG bold_italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG =Sigmoid(Conv(C,1)(𝒇u^)),absentSigmoidConv𝐶1^subscript𝒇𝑢\displaystyle=\operatorname{Sigmoid}\left(\mathrm{Conv}(C,1)\left(\hat{\bm{f}_% {u}}\right)\right),= roman_Sigmoid ( roman_Conv ( italic_C , 1 ) ( over^ start_ARG bold_italic_f start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG ) ) ,

where the DWConvDWConv\mathrm{DW}{-\operatorname{Conv}}roman_DW - roman_Conv means a 3×3333\times 33 × 3 Depth-Wise convolution layer.

The element-wise multiplication between anomaly score 𝑺u^^subscript𝑺𝑢\hat{\bm{S}_{u}}over^ start_ARG bold_italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG and the anomaly map Musubscript𝑀𝑢M_{u}italic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT capture the anomaly information. Next, we calculate the element-wise summation between the resulting anomaly-aware map and the input feature map X𝑋Xitalic_X to obtain an anomaly-sensitive feature map. The whole process can be described as:

X^=X+𝑺a^×Ma+𝑺m^×Mm^𝑋𝑋^subscript𝑺𝑎subscript𝑀𝑎^subscript𝑺𝑚subscript𝑀𝑚\hat{X}=X+\hat{\bm{S}_{a}}\times M_{a}+\hat{\bm{S}_{m}}\times M_{m}over^ start_ARG italic_X end_ARG = italic_X + over^ start_ARG bold_italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG × italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + over^ start_ARG bold_italic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG × italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (5)
Table 4: Benchmarking IMDL models for manipulation detection and localization for varying mix-generator comprehensive capability. The models are trained on the whole GIM training set and test on the correspondence test dataset respectively. The detection Cls.Acc(%) and localization AUC(%) and F1(%) are reported. Params and GFLOPS denote the models’ parameters and computational complexity with Params measured in millions (M). \ddagger indicates that the original paper did not provide the code, we reproduced the corresponding code and evaluated it under the same settings
Method Params GFLOPS GIM-SD GIM-GLIDE GIM-DDNM
Cls.Acc F1 AUC Cls.Acc F1 AUC Cls.Acc F1 AUC
ManTranet[62] 4.0 1009.7 61.08 37.48 80.83 70.99 49.11 83.29 53.99 33.12 74.94
MVSS-Net[12] 146.9 160.0 56.12 23.17 72.03 61.29 33.12 74.94 49.15 14.11 70.09
SPAN[25] 15.4 30.9 53.15 35.62 79.28 60.01 39.46 81.21 59.15 32.1 73.55
PSCC-Net[40] 4.1 107.3 52.28 31.48 83.85 66.52 53.68 86.37 56.33 41.77 85.8
ObjectFormer\ddagger[58] 14.6 249.6 59.12 26.82 85.16 70.12 40.12 85.22 54.27 33.12 86.82
Trufor\ddagger[19] 67.8 90.1 67.12 44.13 84.52 80.19 59.32 92.96 63.34 44.52 87.60
SegFormer[63] 27.5 41.3 64.25 46.17 83.26 78.11 56.77 88.73 69.29 40.19 84.65
GIMFormer (Ours) 95.9 96.2 70.92 58.61 88.25 83.89 77.31 95.42 76.72 56.25 88.31

4.4 Loss Function

For manipulation detection, we adopt a light-weight backbone in  [57] on the fourth stage feature to calculate the final binary prediction y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG. For manipulation localization, we utilize the MLP decoder  [63] as the segmentation head to obtain a predicted mask M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG. Given the ground-truth label y and mask M, we train GIMFormer with the following objective function:

=cls(y,y^)+λseg(M,M^),subscript𝑐𝑙𝑠𝑦^𝑦𝜆subscript𝑠𝑒𝑔𝑀^𝑀{\mathcal{L}}={\mathcal{L}}_{cls}(y,{\hat{y}})+\lambda{\mathcal{L}}_{seg}(M,{% \hat{M}}),caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ( italic_y , over^ start_ARG italic_y end_ARG ) + italic_λ caligraphic_L start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT ( italic_M , over^ start_ARG italic_M end_ARG ) , (6)

where both clssubscript𝑐𝑙𝑠{\mathcal{L}}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT and segsubscript𝑠𝑒𝑔{\mathcal{L}}_{seg}caligraphic_L start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT are binary cross-entropy loss, and λ=1𝜆1\lambda=1italic_λ = 1 is a balancing hyperparameter.

4.5 Implementation Details.

Our approach includes two separate training steps. First, we train the ShadowTracer using a synthesized dataset of ImageNet. This training process follows a similar data generation method as described in Section 4.1. Then, we train the encoder and decoder of the model according to the two settings in GIM, as described in Section 3.4. We train our models on eight V100 GPUs with an initial learning rate (LR) of 6e56superscript𝑒56e^{-5}6 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT which is scheduled by the poly strategy with power 0.9 over 40 epochs. The optimizer is AdamW  [41] with epsilon 1e81superscript𝑒81e^{-8}1 italic_e start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT weight decay 1e21superscript𝑒21e^{-2}1 italic_e start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, and the batch size is 4 on each GPU.

Table 5: Benchmarking IMDL models for manipulation detection and localization for varying cross-generator generalization capability. The models are trained on the GIM-SD training set and tested on all the test sets to explore the generalization ability. The detection Cls.Acc(%) and localization AUC(%) and F1(%) are reported. \ddagger indicates that the original paper did not provide the code, we reproduced the corresponding code and evaluated it under the same settings
Method GIM-SD GIM-SD(VOC) GIM-GLIDE GIM-DDNM
Cls.Acc F1 AUC Cls.Acc F1 AUC Cls.Acc F1 AUC Cls.Acc F1 AUC
ManTranet[62] 73.12 43.18 80.18 63.18 27.37 72.74 58.21 24.53 74.61 39.52 16.8 58.31
MVSS-Net[12] 56.12 25.13 82.15 56.30 21.2 73.17 53.12 23.17 73.11 48.12 10.11 50.15
SPAN[25] 52.15 47.32 56.93 52.17 29.86 55.17 50.03 30.14 56.08 44.17 14.30 56.67
PSCC-Net[40] 58.24 48.36 87.29 54.79 39.36 83.80 51.11 29.60 79.76 40.52 4.40 48.52
objectFormer\ddagger[58] 67.12 39.54 87.93 58.19 29.13 81.20 54.19 17.74 81.02 49.23 7.16 58.95
Trufor\ddagger[19] 74.00 49.85 86.15 65.15 31.95 80.13 50.12 22.39 76.25 49.19 5.70 53.11
SegFormer[63] 71.93 53.08 87.11 60.03 28.17 80.97 54.11 21.03 75.12 50.21 4.70 51.33
GIMFormer (Ours) 78.96 61.75 90.61 67.19 50.01 84.01 67.01 39.27 81.02 50.23 19.37 60.12

4.6 Comparison with state-of-the-art methods

We compare our methods with various state-of-the-art IMDL methods  [62, 12, 25, 40, 19, 58] on generative manipulation detection and location. In addition, the vanilla SegFormer (MiT-B2) [63] is also compared, since our method is based on that architecture. As shown in Table 4 and Table 5, these methods are tested on various settings to verify the performance and generalization. Note that some methods are not explicitly designed for image-level detection, in which case we use the maximum of the prediction map as the detection statistic. All methods are immersed in the same implementation details.

Cross-generator generalization capability comparison. Table 5 reports the generalization performance of all the methods mentioned. The results show that GIMFormer outperforms all other methods by a significant margin in both in-domain detection ability and cross-domain generalization ability. For in-domain detection ability, our method efficiently catches the subtle artifacts inherent in generative and accurately localizes them. Meanwhile, other methods may encounter confusion as they attempt to learn specific content, potentially leading to challenges in accurately detecting and localizing generative tampering patterns. For cross-domain generalization ability, GIMFormer achieves state-of-the-art performance. It demonstrates cross-detection capability in detecting manipulation using different generative models, as shown in the results of the GIM-GLIDE and GIM-DDNM testsets. Besides GIMFormer works well on Data generated by the same generator from different distributions, while existing methods are confused and have an obvious performance drop, as shown in the results of the GIM-SD(VOC) testset. The qualitative results for visual comparisons are illustrated in Figure 6.

Mix-generator comprehensive capability comparison Table 5 reports the comprehensive performance of all the methods mentioned. GIMFormer also outperforms all other methods on all the test sets, demonstrating its superior ability to identify generative tampering and its comprehensive performance. Existing methods have a low pixel-level F1 score on this benchmark, which means that tampering areas cannot be accurately identified. The qualitative results for visual comparisons are illustrated in Figure 5.

Refer to caption
Figure 5: Qualitative results on GIM of comparing GIMFormer with state-of-the-art methods comprehensive capability. From top to bottom, we show examples from the authentic image, GIM-SD, GIM-GLIDE, GIM-DDNM
Refer to caption
Figure 6: Qualitative results on GIM of comparing GIMFormer with state-of-the-art methods generalization capability. From top to bottom, we show examples from GIM-SD, GIM-GLIDE, GIM-DDNM

4.7 Ablation Analysis

Effectiveness of proposed module. We consider a simple baseline proposed in  [63] and gradually integrate new key components. Experiments are carried out on the GIM-GLIDE testset. The quantitative results are listed in Table 7. The result shows that ShadowTracer brings significant improvements to the vanilla baseline. With the MWAM, there is an increase of 6.29% in F1 and 2% in AUC, which indicates that the differential information at multiple scales is crucial for accurate tampering localization. The use of FSB to dynamically harvest complementary frequency and spatial cues improves performance, particularly in detection. The results verify that ShadowTracer, FSB and MWAM effectively improve the performance of the baseline model.

Table 6: Ablation results on GIM-GLIDE test dataset using different variants of GIMFormer. All detection Cls.Acc(%), localization AUC(%), and F1(%) scores are reported.
Variants Cls.Acc F1 AUC
Baseline 78.11 56.77 88.73
+ST 79.96 67.12 91.99
+ST+WMAM 80.19 73.41 94.01
+ST+WMAM+FSB 83.89 77.31 95.42
Table 7: Generalization Experiments of ShadowTracer in percentage (%). ST denotes ShadowTracer and ST(SD) denotes ShadowTracer trained on data generated by Stable-Diffusion.
Variants GIM-GLIDE GIM-DDNM
Cls.Acc F1 AUC Cls.Acc F1 AUC
GIMFormer w ST(SD) 83.1 78.1 95.4 77.4 58.8 89.4
GIMFormer w/o ST 80.0 68.9 91.3 73.1 51.3 86.1

Generalization of ShadowTracer. We initiate the training process of ShadowTracer using data generated by Stable Diffusion and subsequently hold its weights fixed. Following this pretraining phase, we proceed to train the backbone on both the GIM-GLIDE and GIM-DDNM trainsets, with and without the incorporation of ShadowTracer. The detection result Cls.Acc and location results in F1 and AUC are presented in Table 7, revealing that leveraging the pretrained ShadowTracer significantly enhances performance in cross-generator IMDL tasks. Notably, in the GIM-GLIDE dataset with the same manipulation type, leveraging ShadowTracer leads to a remarkable 9% enhancement in F1 and a 4% increase in AUC for location results. Additionally, the benefits extend to the detection aspect, manifesting as a 3% increase in accuracy. In the GIM-DDNM dataset with various manipulation types, there is a 7% enhancement in F1, a 3% increase in AUC for location results and a 4 % increase in accuracy.

5 Conclusion

We address the challenge of detecting and locating generative-based manipulation and provide a reliable database (GIM) for AIGC security. The proposed dataset leverages multiple mainstream generators and tampering methods to provide a variety of generative manipulation data. Additionally, We introduce GIMFormer, a novel transformer-based IMDL framework. ShadowTracer is designed to catch subtle artifacts in generative tampering. While the Frequency-Spatial Block gathers frequency and spatial information, the Multi Windowed Anomalous Modelling module captures pixel-level inconsistencies at multiple scales for fine-grained features. Extensive experiments demonstrate the superior performance of our model, which achieves SoTA results.

Limations. The GIM dataset shares classes with ImageNet and VOC, but it may not encompass future emerging objects due to the evolving variety in the real world. Fine-tuning pre-trained models on new object data can address this issue. Current research primarily centers on image manipulation. The emergence of realistic video generation models like SORA [5] presents fresh challenges as AI manipulation extends into videos. Future plans involve expanding research to address the complexities of video manipulation.

References

  • [1] Amerini, I., Ballan, L., Caldelli, R., Del Bimbo, A., Serra, G.: A sift-based forensic method for copy–move attack detection and transformation recovery. IEEE transactions on information forensics and security 6(3), 1099–1110 (2011)
  • [2] Asnani, V., Yin, X., Hassner, T., Liu, X.: Malp: Manipulation localization using a proactive scheme. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12343–12352 (2023)
  • [3] Bird, J.J., Lotfi, A.: Cifake: Image classification and explainable identification of ai-generated synthetic images. IEEE Access (2024)
  • [4] Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)
  • [5] Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., **g, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video generation models as world simulators (2024), https://openai.com/research/video-generation-models-as-world-simulators
  • [6] Chai, L., Bau, D., Lim, S.N., Isola, P.: What makes fake images detectable? understanding properties that generalize. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16. pp. 103–120. Springer (2020)
  • [7] Cozzolino, D., Poggi, G., Verdoliva, L.: Splicebuster: A new blind image splicing detector. In: 2015 IEEE International Workshop on Information Forensics and Security (WIFS). pp. 1–6. IEEE (2015)
  • [8] Crawford, K., Paglen, T.: Excavating ai: The politics of images in machine learning training sets. Ai & Society 36(4), 1105–1116 (2021)
  • [9] Dang, H., Liu, F., Stehouwer, J., Liu, X., Jain, A.K.: On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition. pp. 5781–5790 (2020)
  • [10] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. IEEE (2009)
  • [11] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794 (2021)
  • [12] Dong, C., Chen, X., Hu, R., Cao, J., Li, X.: Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(3), 3539–3553 (2022)
  • [13] Dong, J., Wang, W., Tan, T.: Casia image tampering detection evaluation database. In: 2013 IEEE China summit and international conference on signal and information processing. pp. 422–426. IEEE (2013)
  • [14] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88, 303–338 (2010)
  • [15] Gao, S., Lin, Z., Xie, X., Zhou, P., Cheng, M.M., Yan, S.: Editanything: Empowering unparalleled flexibility in image editing and generation. In: Proceedings of the 31st ACM International Conference on Multimedia, Demo track (2023)
  • [16] Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., Guo, B.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10696–10706 (2022)
  • [17] Guan, H., Kozak, M., Robertson, E., Lee, Y., Yates, A.N., Delgado, A., Zhou, D., Kheyrkhah, T., Smith, J., Fiscus, J.: Mfc datasets: Large-scale benchmark datasets for media forensic challenge evaluation. In: 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). pp. 63–72. IEEE (2019)
  • [18] Guarnera, L., Giudice, O., Guarnera, F., Ortis, A., Puglisi, G., Paratore, A., Bui, L.M., Fontani, M., Coccomini, D.A., Caldelli, R., et al.: The face deepfake detection challenge. Journal of Imaging 8(10),  263 (2022)
  • [19] Guillaro, F., Cozzolino, D., Sud, A., Dufour, N., Verdoliva, L.: Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20606–20615 (2023)
  • [20] Guo, X., Liu, X., Ren, Z., Grosz, S., Masi, I., Liu, X.: Hierarchical fine-grained image forgery detection and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3155–3165 (2023)
  • [21] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [22] Heller, S., Rossetto, L., Schuldt, H.: The ps-battles dataset-an image collection for image manipulation detection. arXiv preprint arXiv:1804.04866 (2018)
  • [23] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
  • [24] Hsu, Y.F., Chang, S.F.: Detecting image splicing using geometry invariants and camera characteristics consistency. In: 2006 IEEE International Conference on Multimedia and Expo. pp. 549–552. IEEE (2006)
  • [25] Hu, X., Zhang, Z., Jiang, Z., Chaudhuri, S., Yang, Z., Nevatia, R.: Span: Spatial pyramid attention network for image manipulation localization. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. pp. 312–328. Springer (2020)
  • [26] Huang, Y., Juefei-Xu, F., Guo, Q., Liu, Y., Pu, G.: Fakelocator: Robust localization of gan-based face manipulations. IEEE Transactions on Information Forensics and Security 17, 2657–2672 (2022)
  • [27] Huh, M., Liu, A., Owens, A., Efros, A.A.: Fighting fake news: Image splice detection via learned self-consistency. In: Proceedings of the European conference on computer vision (ECCV). pp. 101–117 (2018)
  • [28] Ji, K., Chen, F., Guo, X., Xu, Y., Wang, J., Chen, J.: Uncertainty-guided learning for improving image manipulation detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22456–22465 (2023)
  • [29] Jia, S., Huang, M., Zhou, Z., Ju, Y., Cai, J., Lyu, S.: Autosplice: A text-prompt manipulated image dataset for media forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 893–903 (2023)
  • [30] Jiang, L., Li, R., Wu, W., Qian, C., Loy, C.C.: Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2889–2898 (2020)
  • [31] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [32] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4015–4026 (October 2023)
  • [33] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
  • [34] Kniaz, V.V., Knyaz, V., Remondino, F.: The point where reality meets fantasy: Mixed adversarial generators for image splice detection. Advances in neural information processing systems 32 (2019)
  • [35] Kong, C., Luo, A., Wang, S., Li, H., Rocha, A., Kot, A.C.: Pixel-inconsistency modeling for image manipulation localization. arXiv preprint arXiv:2310.00234 (2023)
  • [36] Kwon, M.J., Nam, S.H., Yu, I.J., Lee, H.K., Kim, C.: Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision 130(8), 1875–1895 (2022)
  • [37] Lee-Thorp, J., Ainslie, J., Eckstein, I., Ontanon, S.: Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824 (2021)
  • [38] Li, D., Zhu, J., Wang, M., Liu, J., Fu, X., Zha, Z.J.: Edge-aware regional message passing controller for image forgery localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8222–8232 (2023)
  • [39] Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., Guo, B.: Face x-ray for more general face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5001–5010 (2020)
  • [40] Liu, X., Liu, Y., Chen, J., Liu, X.: Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology 32(11), 7505–7517 (2022)
  • [41] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
  • [42] Mahfoudi, G., Ta**i, B., Retraint, F., Morain-Nicolier, F., Dugelay, J.L., Marc, P.: Defacto: Image and face manipulation dataset. In: 2019 27Th european signal processing conference (EUSIPCO). pp. 1–5. IEEE (2019)
  • [43] Ng, T.T., Chang, S.F., Sun, Q.: A data set of authentic and spliced image blocks. Columbia University, ADVENT Technical Report 4 (2004)
  • [44] Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
  • [45] Novozamsky, A., Mahdian, B., Saic, S.: Imd2020: A large-scale annotated dataset tailored for detecting manipulated images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops. pp. 71–80 (2020)
  • [46] O’Sullivan, D., Passantino, J.: ‘verified’twitter accounts share fake image of ‘explosion’near pentagon, causing confusion (2023)
  • [47] Porter, T., Duff, T.: Compositing digital images. In: Proceedings of the 11th annual conference on Computer graphics and interactive techniques. pp. 253–259 (1984)
  • [48] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International Conference on Machine Learning. pp. 8821–8831. PMLR (2021)
  • [49] Rao, Y., Zhao, W., Zhu, Z., Lu, J., Zhou, J.: Global filter networks for image classification. Advances in neural information processing systems 34, 980–993 (2021)
  • [50] Rao, Y., Ni, J.: A deep learning approach to detection of splicing and copy-move forgeries in images. In: 2016 IEEE international workshop on information forensics and security (WIFS). pp. 1–6. IEEE (2016)
  • [51] Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., Zhang, L.: Grounded sam: Assembling open-world models for diverse visual tasks (2024)
  • [52] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
  • [53] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)
  • [54] Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Faceforensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1–11 (2019)
  • [55] Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence 22(8), 888–905 (2000)
  • [56] Verdoliva, L., Cozzolino, D., Nagano, K.: 2022 ieee image and video processing cup synthetic image detection
  • [57] Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al.: Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence 43(10), 3349–3364 (2020)
  • [58] Wang, J., Wu, Z., Chen, J., Han, X., Shrivastava, A., Lim, S.N., Jiang, Y.G.: Objectformer for image manipulation detection and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2364–2373 (2022)
  • [59] Wang, Y., Yu, J., Zhang, J.: Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490 (2022)
  • [60] Wen, B., Zhu, Y., Subramanian, R., Ng, T.T., Shen, X., Winkler, S.: Coverage—a novel database for copy-move forgery detection. In: 2016 IEEE international conference on image processing (ICIP). pp. 161–165. IEEE (2016)
  • [61] Wu, H., Zhou, J., Tian, J., Liu, J.: Robust image forgery detection over online social network shared images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13440–13449 (2022)
  • [62] Wu, Y., AbdAlmageed, W., Natarajan, P.: Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9543–9552 (2019)
  • [63] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34, 12077–12090 (2021)
  • [64] Ying, Q., Zhou, H., Qian, Z., Li, S., Zhang, X.: Learning to immunize images for tamper localization and self-recovery. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)
  • [65] Yu, T., Feng, R., Feng, R., Liu, J., **, X., Zeng, W., Chen, Z.: Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)
  • [66] Zampoglou, M., Papadopoulos, S., Kompatsiaris, Y.: Detecting image splicing in the wild (web). In: 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). pp. 1–6. IEEE (2015)
  • [67] Zhang, D., Huang, F., Liu, S., Wang, X., **, Z.: Swinfir: Revisiting the swinir with fast fourier convolution and improved training for image super-resolution. arXiv preprint arXiv:2208.11247 (2022)
  • [68] Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., Stiefelhagen, R.: Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on Intelligent Transportation Systems (2023)
  • [69] Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing 26(7), 3142–3155 (2017)
  • [70] Zhou, P., Han, X., Morariu, V.I., Davis, L.S.: Learning rich features for image manipulation detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1053–1061 (2018)
  • [71] Zhu, M., Chen, H., Yan, Q., Huang, X., Lin, G., Li, W., Tu, Z., Hu, H., Hu, J., Wang, Y.: Genimage: A million-scale benchmark for detecting ai-generated image. arXiv preprint arXiv:2306.08571 (2023)

Supplementary Document

In this supplementary document, we include many details of our work: (1) The details of the proposed GIM Datset and compare with related work (Sec. 6) (2) The details of Architecture and implementation (Sec. 7). (3) The additional ablation analysis of GIMFormer(Sec. 8) (4) More Visualizations on GIM dataset (Sec. 9). (5)The societal impact of our work (Sec. 10). (6) The ethics statement of our work (Sec.11).

6 GIM Dataset

6.1 GIM Dataset Configuration

The specific quantities and details of the dataset are presented in Table 9, GIM has a total number of 1,140k generative manipulated images with their corresponding origin images. Moreover, the dataset can be split into subsets. GIM-SD-all utilizes the training set from Imagenet-1k [10] and employs the Stable Diffusion [53] generator for image manipulation dedicated to manipulation research. Through the analysis of the data scale, GIM-SD, GIM-GLIDE and GIM-DDNM individually select a random set of 100 classes from the Imagenet-1k training dataset and the entire test data to generate their respective training and test sets to benchmark the image manipulation detection and location (IMDL)methods. GIM-SD (VOC) is a cross-data-distribution test set constructed using the test set from the PASCAL VOC [14] dataset and Stable Diffusion techniques.

Table 8: Parameters for GIM degradation. Three degradation methods are employed with variable parameters to emulate real-world scenarios.
Post-processing Parameter
JPEG compression 75, 80, 90
Gaussian blur 3, 5
Downsampling 0.5, 0.67
Table 9: Basic configuration about the GIM Dataset. GIM has five subsets, leveraging various generators and data sources.
GIM
Tampreing
Type
Number of
Trainset
Number of
Testset
GIM-SD-all reprint 1890K -
GIM-SD reprint 180K 70K
GIM-GLIDE reprint 140K 70K
GIM-DDNM removal 125K 50K
GIM-SD(VOC) reprint - 4.5K

To simulate real-world scenarios and comprehensively evaluate the IMDL methods, GIM incorporates three degradation methods (JPEG compression, Gaussian blur, and downsampling). Table 9 presents the parameters for each degradation group, which are randomly chosen.

The GIM data format consists of JPG images paired with PNG labels. Filenames ending with "_f" indicate tampered images. The masks segmented by SAM [33] serve as ground truth for manipulation detection, containing two labels denoted by 0 and 255, representing the original and manipulated categories, respectively.

6.2 Dataset Comparison.

GIM focuses on generative local manipulation in natural images and provides manipulated region labels, specifically for image manipulation detection and localization tasks. This distinguishes GIM from the existing datasets focused on faces or limited to detection (classification). As shown in Table 10, we compare several local image manipulation datasets. The existing image forensic datasets are primarily constructed using traditional manipulation techniques. These datasets commonly employ manual or random manipulation to generate data. AutoSplice [29], HiFi-IFDL[20], CoCoGlide  [19] dataset utilizes generative algorithms for image manipulation. However, they rely on one type of manipulation and suffer from a small data volume.

In contrast, GIM achieves image manipulation by employing generative methods for localized image alteration. It encompasses a variety of generators and manipulation methods. Simultaneously, it stands as a large-scale comprehensive dataset for the IMDL community.

Table 10: Summary of previous image manipulation datasets and GIM. We showcase the number of data entries within each dataset and the manipulation techniques they encompass. *denotes that HiFi-IFDL includes entirely synthesized images
Dataset Num. Images Handcraft Manipulations AIGC Manipulations
# Authetic Image # Forged Image Splicing Copy-move Removal Removal Generation
Columbia Gray [43] 933 912
Columbia Color [24] 183 180
MICC-F2000 [1] 1,300 700
VIPP Synth.  [1] 4,800 4,800
CASIA V1.0  [13] 800 921
CASIA V2.0  [13] 7,200 5,123
Wild Web [66] 90 9,657
NC2016  [17] 560 564
NC2017  [17] 2,667 1,410
MFC2018  [17] 14,156 3,265
MFC2019  [17] 10,279 5,750
PS-Battles [22] 11,142 102,028
DEFACTO  [42] 0 229,000
IMD2020 [45] 35,000 35,000
SP COCO  [36] 0 200,000
CM COCO  [36] 0 200,000
CM RAISE  [36] 0 200,000
HiFi-IFDL[20] - 1,000,00*
CoCoGlide [19] 0 512
AutoSplice  [29] 3,621 2,273
GIM 1,140,000 1,140,000

7 Implementation Details

7.1 Architecture

The GIMFormer backbone has an encoder-decoder architecture. The encoder is a dual-branch and four-stage encoder. For the input RGB image, ShadowTracer extracts its learned manipulation trace map of the same resolution as the image. The ShadowTracer adopts the architecture [69] with 15 trainable layers, 3 input channels, 1 output channel. Then, both the RGB and the trace map are fed into the parallel network, where the four-stage structure is employed to extract pyramidal features Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i[1,4]𝑖14i\in[1,4]italic_i ∈ [ 1 , 4 ]). At each stage the RGB branch is first processed by the Frequency-Spatial Block (FSB), then two branches are gradually processed by Transformer Blocks  [63] and Multi Windowed Anomalous Modelling module(MWAM). The learnable parameters within FSB (Feature Synthesis Block)maintain an identical resolution as the input features. The window sizes within the Multi-Window Anomalous Modeling (MWAM)at each stage are as follows: {3, 7, 9}, {7, 11, 15}, {9, 17, 25}, and {11, 21, 31}. The Transformer blocks are based on the Mix Transformer encoder B2 (MiT-B2)proposed for semantic segmentation and are pretrained on ImageNet. The Mix Transformer encoder uses self-attention and channel-wise operations, prioritizing spatial convolutions over positional encodings. To integrate information from the two branches, the cross-modal Feature Rectification Module (FRM) and Feature Fusion Module (FFM) [68] are utilized. For location, we employ the All-MLP decoder proposed in [63], which is a lightweight architecture formed by only 1×\times×1 convolution layers and bilinear up-samplers. For detection, we adopt the tail-end section of the light-weight backbone in  [57]. It applies convolutional, batch normalization, and activation layers to extract features from input data. These extracted features are then pooled and passed through fully connected layers to generate the classification predictions.

7.2 Training Process

We conduct our experiments with PyTorch 1.7.0. All models are trained on a node with 8 or 4 V100 GPUs.

For the training of the ShadowTracer network gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, We randomly selected 20,000 authentic images from ImageNet and correspondingly generated manipulated images. During comprehensive performance verification, all three generators (Stable Diffusion [52], GLIDE [44], DDNM [59])are used, whereas for generalization verification, only Stable Diffusion is employed. Throughout the training, the alpha blending parameter ranges between 0.5 and 1.0, randomly producing blended images subjected to three degradation types (JPEG compression, downsampling, Gaussian blur). The network is trained on 40×\times×40 pixel patches randomly sampled from the dataset. Training is conducted for roughly 300,000 iterations with a batch size of 64. An Adam [31] optimizer is employed, initialized with a learning rate of 0.001.

For the training of the GIMFormer backbone network. The input image is cropped to 512 ×\times× 512 during training. We train our models for 40 epochs on the GIM dataset. The batch size is 4 on each of the GPUs. The images are augmented by random resize with a ratio of 0.5–2.0, random horizontal flip**.

8 Additional Ablation Analysis

The robustness of GIMFormer. To evaluate the robustness of GIMFormer, we sample 5000 images from the original three generator test sets without degradations. The model is trained on the mixed-generator dataset. The model is trained on the mixed-generator dataset. The experimental results are presented in Table 11, GIMFormer demonstrates robustness against various distortion techniques. This resilience underscores its ability to maintain stable performance in the face of various challenges posed by image distortions.

Table 11: Robustness experiment of pixel-level Manipulation Localization F1(%) with various distortions
No Dis-
tortion
Cmp
(q=90)
Cmp
(q=80)
Cmp
(q=75)
Blur
(k=3)
Blur
(k=5)
Downsample
(0.66X)
Downsample
(0.5X)
GIMFormer 61.7 60.3 60.1 59.9 58.7 58.7 61.0 60.8

The effect of the number of windows (N). We conduct a set of ablation experiments to study the performance of the MWAM Module. To ensure fair comparisons, all experiments differ from each other only in the Windows setting. Experiments are carried out on the GIM-SD trainset and testset. As shown in Table 12, there is an overall incremental trend in the tampering location performance as the number of windows increases, while the impact of size variations remains relatively minor. For the sake of efficiency, we stop analyzing more windows. The results indicate that a favorable balance between accuracy and efficiency is achieved when N𝑁Nitalic_N=3.

Table 12: Ablation results on GIM-SD test dataset using different variants of GIMFormer, All detection Cls.Acc(%) and localization AUC(%) and F1(%) scores are reported. Where Da/mksuperscriptsubscript𝐷𝑎𝑚𝑘D_{a/m}^{k}italic_D start_POSTSUBSCRIPT italic_a / italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT represents the number and dimensions of windows utilized in either the average or maximum branches. We investigate the window counts N𝑁Nitalic_N of 0, 1, 2,3 as well as the impact of only one branch on the module. In this table, the window sizes of the first layer are used to annotate, with subsequent layers decreasing in size.
MWAM Variants Cls.Acc F1 AUC
w\o windows 72.13 58.12 88.81
{Da&k11superscriptsubscript𝐷𝑎𝑘11D_{a\&k}^{11}italic_D start_POSTSUBSCRIPT italic_a & italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT} 74.33 58.34 89.65
{Da&k11,Da&k21superscriptsubscript𝐷𝑎𝑘11superscriptsubscript𝐷𝑎𝑘21D_{a\&k}^{11},D_{a\&k}^{21}italic_D start_POSTSUBSCRIPT italic_a & italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_a & italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT} 76.61 60.03 89.92
{Da&k11,Da&k21,Da&k31superscriptsubscript𝐷𝑎𝑘11superscriptsubscript𝐷𝑎𝑘21superscriptsubscript𝐷𝑎𝑘31D_{a\&k}^{11},D_{a\&k}^{21},D_{a\&k}^{31}italic_D start_POSTSUBSCRIPT italic_a & italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_a & italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_a & italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 31 end_POSTSUPERSCRIPT} 78.96 61.75 90.61
{Da&k17,Da&k29,Da&k41superscriptsubscript𝐷𝑎𝑘17superscriptsubscript𝐷𝑎𝑘29superscriptsubscript𝐷𝑎𝑘41D_{a\&k}^{17},D_{a\&k}^{29},D_{a\&k}^{41}italic_D start_POSTSUBSCRIPT italic_a & italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 17 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_a & italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_a & italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 41 end_POSTSUPERSCRIPT} 77.33 61.17 90.11
{Da11,Da21,Da31superscriptsubscript𝐷𝑎11superscriptsubscript𝐷𝑎21superscriptsubscript𝐷𝑎31D_{a}^{11},D_{a}^{21},D_{a}^{31}italic_D start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 31 end_POSTSUPERSCRIPT} 77.95 60.77 90.03
{Dm11,Dm21,Dm31superscriptsubscript𝐷𝑚11superscriptsubscript𝐷𝑚21superscriptsubscript𝐷𝑚31D_{m}^{11},D_{m}^{21},D_{m}^{31}italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 31 end_POSTSUPERSCRIPT} 77.83 59.13 89.53
{Da&k11,Da&k21,Da&k31,Da&k41superscriptsubscript𝐷𝑎𝑘11superscriptsubscript𝐷𝑎𝑘21superscriptsubscript𝐷𝑎𝑘31superscriptsubscript𝐷𝑎𝑘41D_{a\&k}^{11},D_{a\&k}^{21},D_{a\&k}^{31},D_{a\&k}^{41}italic_D start_POSTSUBSCRIPT italic_a & italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_a & italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_a & italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 31 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_a & italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 41 end_POSTSUPERSCRIPT} 78.11 61.15 90.71

9 More Visualizations on GIM

In this section, we present additional visualizations of the GIM dataset in Figure  7,  8, and  9, showcasing the data visualization from GIM-SD, GIM-GLIDE, and GIM-DDNM respectively.

10 Societal Impact

Our research yields a positive societal impact on the community by focusing on addressing the challenge of detecting and locating generative-based manipulations. We introduce a reliable database, GIM, aimed at enhancing the security of AI-generated content (AIGC). This database facilitates the training of Image Manipulation Detection and Localization (IMDL)methods on GIM, allowing them to generalize across various scenarios. Consequently, our algorithm GIMFormer fosters increased trust among the general public in our society regarding media content. GIM and GIMFormer are beneficial for digital media forensics, especially generative manipulation with real-world degradations.

11 Ethics Statement

Our GIM is based on ImageNet and VOC. No additional personally identifiable information or sensitive personally identifiable information is introduced during the production of fake images in the GIM dataset. During the dataset production, we do not introduce extra information containing exacerbated bias against people of a certain gender, race, sexuality, or who have other protected characteristics. The ethical issues in the ImageNet and VOC datasets have been discussed in previous works. Crawford et al.  [8] discuss issues with ImageNet. The first issue is the political nature of all taxonomies or classification systems, where terms like "male" and "female" are considered "natural," while "hermaphrodite" is offensively placed within the branch of Person > Sensualist > Bisexual alongside "pseudohermaphrodite" and "switch hitter" categories. The second issue concerns offensive images of real people, while the third is the use of people’s photos without their consent by ImageNet creators.

Refer to caption
Figure 7: Additional visualizations of GIM-SD. From left to right, the sequence is as follows: original image, tampered area, generative manipulated image, original image, tampered area, generative manipulated image.
Refer to caption
Figure 8: Additional visualizations of GIM-GLIDE. From left to right, the sequence is as follows: original image, tampered area, generative manipulated image, original image, tampered area, generative manipulated image.
Refer to caption
Figure 9: Additional visualizations of GIM-DDNM. From left to right, the sequence is as follows: original image, tampered area, generative manipulated image, original image, tampered area, generative manipulated image.