¹¹institutetext: Shanghai Jiao Tong University ²²institutetext: Tsinghua Shenzhen International Graduate School ³³institutetext: Huawei Noah’s Ark Lab
³³email: {chenyirui,weiliucv}@sjtu.edu.cn [email protected] {huangxudong9,wei.lee,hujie23}@huawei.com

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

Yirui Chen 1133** Xudong Huang 33** Quan Zhang 2233** Wei Li 33‡‡ Mingjian Zhu 33 Qiangyu Yan 33 Simiao Li 33 Hanting Chen 33 Hailin Hu 33 Jie Yang 11 Wei Liu 11†† Jie Hu 33††

Abstract

The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location(IMDL). However, the lack of a large-scale data foundation makes IMDL task unattainable. In this paper, a local manipulation pipeline is designed, incorporating the powerful SAM, ChatGPT and generative models. Upon this basis, We propose the GIM dataset, which has the following advantages: 1) Large scale, including over one million pairs of AI-manipulated images and real images. 2) Rich Image Content, encompassing a broad range of image classes 3) Diverse Generative Manipulation, manipulated images with state-of-the-art generators and various manipulation tasks. The aforementioned advantages allow for a more comprehensive evaluation of IMDL methods, extending their applicability to diverse images. We introduce two benchmark settings to evaluate the generalization capability and comprehensive performance of baseline methods. In addition, we propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial Block (FSB), and a Multi-window Anomalous Modelling (MWAM) Module. Extensive experiments on the GIM demonstrate that GIMFormer surpasses previous state-of-the-art works significantly on two different benchmarks. Code page: GIM.

Keywords:

Dataset and Benchmark Image Manipulation Detection and Location Forgery Traces Extraction Multi-scale Feature Learning

¹¹footnotetext: * These authors contributed equally to this research, which was done during Yirui Chen and Quan Zhang’s internship at Huawei Noah’s Ark Lab. ‡ Project lead † Corresponding authors

1 Introduction

Refer to caption — Figure 1: Example image from the GIM dataset. Our dataset includes images manipulated by three state-of-the-art generators: Stable-Diffusion, GLIDE, and DDNM. The first column shows authentic images, while the second column presents forgery masks. The third column displays forged images.

Images are one of the most essential media for information transmission in modern society, which is widely spread on public platforms such as news and social media. With the rapid advancement of generative methods [11, 52, 16, 4], such natural information can be more easily tampered with for specific purposes such as tampering with an object or person. The image of this class is particularly convincing since its visual comprehensibility, leads to serious information security risks in social areas. For instance, an AI-generated image of black smoke billowing from what appeared to be a government building near the Pentagon set off investor fears, sending stocks tumbling [46]. Therefore, it is of utmost urgency to develop effective methods to examine whether an image is modified by generative models, meanwhile identifying the exact location of the tampering. However, the traditional image manipulation detection and location(IMDL) dataset [55, 13, 60] overlook the powerful generative models and generative IMDL dataset are limited with scale, thus not sufficient to comprehensively evaluate the performance of IMDL methods and benefits the IMDL community.

To this end, we propose a million-level generative-based IMDL dataset, termed GIM, to provide a reliable database for AI Generated Content (AIGC) security. GIM leverage the diffusion model [44, 52, 59] and SAM [32, 51], with ImageNet [10] and VOC [14] as the input. SAM is leveraged to locate the tampering region, and diffusion models paint the reasonable content in the tampering area. The excellent diffusion model ensures dataset fidelity and rich categories guarantee the diversity of images in GIM. GIM contains a total of over one million generative tampered images. Based on this, the IMDL methods are evaluated and benchmarked. To develop an appropriate benchmark scale, we explore the impact of different amounts of training manipulated data. The final benchmark contains about 300k manipulated images with their tampering masks. To simulate the real situation, we investigate the effect of degradation and apply the images to the random degradation method. Two benchmark settings are designed to evaluate the comprehensive and generalization. In addition, we recreated existing IMDL methods in a fair manner, providing the basis for future development. Overall, as a generative IMDL benchmark, GIM possesses the following advantages: 1) GIM has a large and reasonable data scale, including rich image categories and contents. 2) GIM contains various generative manipulation models and tasks. 3) GIM proposes two settings for verifying generalization capability and comprehensive performance in existing methods.

The existing methods emphasize traditional image manipulations also known as “cheapfakes”. Generative manipulations introduce lethal alterations in content with no apparent frequency or structural inconsistency. To address the above issue, we introduce GIMFormer, a transformer-based framework for generative IMDL. The ShadowTracer is designed to embed the nuanced artifacts inherent in generative tampering and serves as prior information. The Frequency-Spatial Block harvests the manipulation clues in frequency and spatial domains. Furthermore, the Multi Windowed Anomalous Modelling Module captures local inconsistencies at different scales to refine the features. By doing so, our model extracts features from both the RGB and learned tampering traces map, captures details from the frequency domain and spatial domain, and models inconsistencies at different scales for precise manipulation detection and location.

We conduct experiments on our proposed GIM. Both the qualitative and quantitative results demonstrate that GIMFormer can significantly outperform previous state-of-the-art methods in terms of both comprehensive capability and generalization ability.

In summary, our main contributions are as follows:

•

We build a local manipulation pipeline and construct a comprehensive large-scale dataset with various SOTA generative models and manipulation tasks to further facilitate the research on the IMDL task.
•

We investigate the impact of data scales and degradation on IMDL tasks to select an appropriate scale and simulate real-world scenarios. Based on this, we construct a reasonable benchmark and two settings to evaluate the generalization and comprehensive capabilities of IMDL methods.
•

We propose GIMFormer, a generative IMDL framework consisting of ShadowTracer to embed subtle traces, Frequency-Spatial Block to extract forgery features in the frequency and spatial domains and Multi-Windowed Anomalous Modelling Module to capture pixel-level inconsistencies at multi scales.
•

Extensive experiments show that GIMFormer achieves state-of-the-art IMDL performance in terms of generalization and comprehensive capabilities.

2 Related Work

2.1 Image Forensic Datasets

The research community has dedicated significant efforts to establishing robust datasets for image forensics. Early datasets primarily focus on one type of manipulation. The Columbia dataset [43, 24] is the pioneering dataset focusing on splicing forgeries. CASIA v1.0 and CASIA v2.0 [13] are the first to incorporate multiple types of manipulations within a single dataset, with forged images manually crafted using Adobe Photoshop. The Wild Web [66] dataset collects forged images from the Internet, significantly surpassing the scale of previous datasets. NIST [17] Datasets provide an extensive collection of datasets serving as a crucial standard for assessing media tampering detection methods. DEFACTO [42]is automatically generated from the Microsoft COCO dataset, encompassing four categories of forgeries: Copy and Move, Splicing, Object Removal, and Morphing. IMD2020 [45] provides a range of locally manipulated images generated through manual operations or random slicing, while also collecting online images without apparent manipulation traces. Recently, HiFi-IFDL [20] construct a hierarchical fine-grained dataset containing some representative forgery methods. AutoSplice [29] leverage the capabilities of large-scale language-image models like DALL-E2 [48] to facilitate automatic image editing and generation. CocoGlide [19] contains 512 images using the GLIDE diffusion model. Some datasets focused exclusively on facial manipulations [54, 18, 9] or entirely synthesized images[71, 56, 3].

However, these datasets have certain limitations, such as small data sizes and limited manipulation techniques or tasks. Recent advances in generative models [44, 23, 52] have demonstrated remarkable abilities in generative-based manipulation. These developments have led to the emergence of open-source projects [51, 65, 15] for manipulation. Leveraging the power of these models, we introduce GIM, a large-scale generative-based manipulation dataset.

2.2 Image Manipulation Detection and Localization

Early studies on natural image manipulation localization mainly focus on detecting a specific type of manipulation [7, 27, 34, 50]. Due to the exact manipulation type is unknown in real-world scenarios, most state-of-the-art methods [40, 19, 58, 64, 38, 28, 61] primarily concentrate on general manipulation. RGB-N [70] propose a dual-stream network for manipulation detection. One stream focuses on extracting RGB features and the other leverages noise features to model the disparities. MVSS-Net [12] utilizes a two-stream CNN to extract noise features and employs the Dual Attention Module to merge the output of the two-stream CNN. PSCC-Net [40] extracts hierarchical features with a top-down path and detects whether the input image has been manipulated using a bottom-up path. ObjectFormer [58] uses object prototypes to model object-level consistencies and find patch-level inconsistencies to detect the manipulation. However, these methods focus on "cheapfake" detection and encounter challenges when applied to generative manipulation due to the tampering traces are subtle and the lack of an understanding of generative manipulation patterns. Certain works [26, 6, 9, 39, 2] focus on localizing manipulations using generative models, yet these methods primarily concentrate on human faces.

In this work, we leverage a deep network to embed the subtle artifacts inherent in generative tampering as a prior trace map. we then employ a dual network to fuse the trace map and RGB image, combining frequency and spatial information and capturing pixel-level inconsistencies at multiple scales to detect and locate the manipulation.

Table 1: Summary of previous image manipulation datasets and GIM. We showcase the number of data entries within each dataset, the image content, image sizes and the manipulation techniques they encompass.

\star

denotes that HiFi-IFDL is composed of multiple datasets, including entirely synthesized images, traditionally manipulated images and fake face images.

Dataset	Image Content	Image Size	Num. Images		Manipulations Category
Dataset	Image Content	Image Size	# Authetic Image	# Forged Image	Traditional	Gen.Reprint	Gen.Removal
FaceForensics++ [54]	Face	480p-1080p	1000	4000	✗	✓	✗
DFDC [9]	Face	240p-2160p	19,154	100,000	✗	✓	✗
DeeperForensics [30]	Face	1080p	50,000	10,000	✗	✓	✗
Columbia Gray [43]	General	128 $\times$ 128	933	912	✓	✗	✗
CASIA V2.0 [13]	General	320 $\times$ 240-800 $\times$ 600	7,200	5,123	✓	✗	✗
IMD2020 [45]	General	1062 $\times$ 866	35,000	35,000	✓	✗	✗
Coverage [60]	General	400 $\times$ 486	100	100	✓	✗	✗
AutoSplice [29]	General	256 $\times$ 256 - 4232 $\times$ 4232	3,621	2,273	✗	✓	✗
HiFi-IFDL[20]	General	-	-	1,000,000 $\star$	✓	✓	✗
CocoGlide [19]	General	256 $\times$ 256	-	512	✗	✓	✗
GIM	General	64 $\times$ 64-6000 $\times$ 3904	1,140,000	1,140,000	✗	✓	✓

3 Dataset and Benchmark Construction

In Section 3.1, we propose an automatic data synthesis pipeline to produce generative manipulated images efficiently with unlabeled images. Leveraging this pipeline, we construct a comprehensive large-scale dataset. To build a reasonable scale benchmark for evaluating the IMDL methods, we delve into the impact of training data scale in Section 3.2. To emulate real-world conditions and establish an objective benchmark, the manipulated images and original authentic images are subjected to three random degradations (JPEG compression, downsampling, and Gaussian blur), as detailed in Section 3.3). GIM benchmark comprises over 320k manipulated images paired with authentic images (100 distinct labels in ImageNet for each generative model) for algorithm evaluation. The generated images are shown in Figure 1 with their original images and tampering masks. The specific composition of GIM can be found in the Supplementary Materials. In Section 3.4, we introduce the criteria and settings used to evaluate the performance of IMDL methods.

3.1 Generation and Details of the dataset

To construct a generative-based IMDL dataset, a large-scale natural dataset is urgently desired. ImageNet [10] and VOC [14] are awesome datasets, chosen as the starting point for our research due to the following advantages: 1) Large and diverse. 2) Wide category coverage. To sum up, we regard these datasets as the database for generative manipulation. In addition to possessing various scenarios, the proposed dataset utilizes a wealth of tampering methods. Specifically, we reprint the target class by generative inpainting [52, 44] method or remove the destination region using the model [59].Benefiting from open-source projects [33, 51], we develop our data generation pipeline.

Figure 2 illustrates the overall process. Firstly, with the classification attribute or User query, the local manipulation mask is extracted by the zero-shot segmentation network [33]. For reprint tampering, the image category is embedded in the replacement prompt and interacts with ChatGPT, which returns an approximate category. The approximate category is then embedded into the inpainting prompt. Combined with the original image, manipulation mask, and inpainting prompt, generative models generate the reprint generative tampered result. For removal tampering, only the original image and manipulation mask are required for the generated model. The GIM Dataset utilizes the entire ImageNet dataset and the Stable-Diffusion generator to construct a comprehensive dataset, providing a reliable database and laying the foundation for the research below.

3.2 Analysis of Benchmark Scale

With the data generation pipeline described above, we are qualified to generate millions, tens of millions, or even billions of data. Nonetheless, blindly increasing the amount of data does not improve the algorithm performance, but may lead to data redundancy. Therefore, taking data generated by the Stable Diffusion as an example, we explore the influence of data volume and category volume on baseline classification [21] and segmentation algorithms [63]. Training sets are the different scales of the training subset, and the validation is the same test set. As shown in Table 2, the metrics of classification and segmentation are improved as the dataset scale increases. When the scale reaches 180K, the algorithm performance is almost saturated. No matter whether the image class or data scale is added, the metrics are stagnant. Experiments demonstrate that increasing the amount of data or category brings negligible benefits when the data tends to be saturated. According to the analysis above, for the training data, the GIM benchmark selects 100 different labels from ImageNet for each generation model to create tampered images. For the test data, the GIM benchmark uses the entire test dataset for each generation model.

Table 2: Results of classification and segmentation on different scale subsets. Classification uses Resnet-50 and segmentation uses SegFormer-b0.

Total Num. of Image	Image Classes	Image per Class	Metrics
Total Num. of Image	Image Classes	Image per Class	Cls.Acc $\uparrow$	Seg.F1 $\uparrow$	Seg.AUC $\uparrow$
2,800	10	280	0.6306	0.2550	0.7950
28,000	100	280	0.7480	0.3247	0.7995
180,000	100	1800	0.9129	0.5289	0.8699
360,000	200	1800	0.9131	0.5291	0.8701
500,000	500	1000	0.9131	0.5291	0.8703

Table 3: Influence of different degradation method on classification (Resnet-50) and segmentation (SegFormer-b0) .

Degradation	Metrics
Degradation	Cls.Acc	Seg.F1
-	91.29	52.89
JPEG	83.10	45.74
Gaussian Blur	90.94	40.14
Downsample	88.34	37.12

3.3 Post-processing of Degradation Method

After being spread on the Internet, images will encounter various post-processings, such as JPEG compression, downsampling, Gaussian blur, etc. These transformations can pose challenges for image forensics methods. Figure 3.2 investigates the impact of degradation. The baseline models are trained on the clean data generated by the Stable-Diffusion training set and tested on the same test set with a specific degradation method. Experimental results denote that these degradation methods make identification more difficult. To shorten the distance from reality and build an objective benchmark, the random degradation method (JPEG compression, downsampling, Gaussian blur) is performed on the dataset. The details of the parameters can be found in Supplementary Materials.

3.4 Benchmark Settings

Dataset: The datasets used for the research of generative IMDL tasks are lacking. This makes direct comparisons and accurate measurement difficult. We use three cross-generator datasets GIM-SD (data from ImageNet manipulation by Stable Diffusion, similarly hereinafter), GIM-GLIDE, GIM-DDNM and a cross-distribution dataset GIM-VOC(data from VOC manipulation by Stable Diffusion) to present benchmarking results.

Metrics: We evaluate the performance of the proposed method on both the image manipulation detection task and localization task. For classification results, we use Accuracy (Cls.acc) as our evaluation metric, while for localization, the pixel-level AUC and F1 score on manipulation masks are adopted. Since binary masks and detection scores are required to compute F1 scores, we adopt the Equal Error Rate (EER) threshold to binarize them.

Settings: Two settings are proposed to verify the comprehensive and generalization performance of the algorithms. In the cross-generator generalization setting, the models are trained on the GIM-SD training set and tested on the GIM-GLIDE, GIM-DDNM and GIM-VOC test set to explore the generalization IMDL performance, as shown in Table 5. In the mix-generator comprehensive setting, the models are jointly trained on the GIM-SD, GIM-GLIDE and GIM-DDNM training set and tested on the correspondence test dataset respectively to evaluate the comprehensive IMDL performance, as shown in Table 4.

4 Method

We propose the IMDL framework GIMFormer, which utilizes a dual encoder and decoder architecture. Considering the specifics of generative manipulation, we propose the ShadowTracer in Section 4.1, the Frequency-Spatial Block (FSB) in Section 4.2, and the Multi Windowed Anomalous Modelling module (MWAM) in Section 4.3. Figure 3 gives an overview of the framework. For the input RGB image $x$ , we first extract its learned trace map $t$ . Then, both $x$ and $r$ are fed into a two-branch network, where the four-stage structure is used to extract pyramidal features $F_{i}$ ( $i\in[1,4]$ ). The RGB branch is composed of FSB, Transformer Block [63] and WMAM. The tracer branch consists of a Transformer Block and WMAM. In the fusion step, the feature rectification module (FRM) [68] and feature fusion module (FFM) [68] are used for feature fusion. The four-stage fused features are forwarded to the decoder for final detection $\hat{y}$ and location $\hat{M}$ . Details are provided in the Supplementary.

4.1 ShadowTracer

Prior manipulation detection methods mainly focus on ”cheapfake” scenes and rely on visible traces. These artifacts include distortions and sudden changes caused by manipulation of the image structure. However, generative tampering makes significant alterations to the content with no apparent frequency or structural inconsistency. As shown in Figure 4, these subtle traces are displayed with inherent patterns, not visible traces with inconsistent edges.

ShadowTracer aims to capture the inherent characteristics and subtle traces of the generative models. For a doctored image, our objective is to learn a map** $g_{\phi}$ to map the tampered image to its latent disturbed pixel values, where $g_{\phi}$ represents a neural network with trainable parameters $\phi$ . Our key observation is that the differences introduced by generative models in data distribution exhibit inherent patterns, and deep neural networks can attempt to reconstruct these variations. At the training stage, we generate pairs of the input image $x_{i}$ and the generative tampered image $G(x_{i})$ , the manipulation trace can be calculated by $t_{i}=G(x_{i})-x_{i}$ . The objective function for training $g_{\phi}$ can be formulated as

\min_{\bm{\phi}}\Big{\{}\mathcal{L}_{r}(g_{\bm{\phi}}\left(G\left(\mathbf{x}_{% i}\right))\right),t_{i})\Big{\}}

(1)

where $\mathcal{L}_{r}(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|_{2}$ .

Furthermore, the map** network should satisfy two properties: detecting subtle tampering traces and being robust to various real-world image degradations. For this reason, image pairs are generated by mixing original and manipulated images and incorporating diverse degradation operations at the training stage. Specifically, given an input image $I$ , we segment a portion and perform generative manipulation to obtain $I_{m}$ . The alpha blendering [47] is utilized to the original and manipulated images to obscure obvious tampering traces. Following this, we subject the images to degradation operations such as blurring, and JPEG compression to obtain the final manipulated image. The network is trained on 64 $\times$ 64 pixel patches randomly sampled from the dataset, adapting the loss in eq. 1.

4.2 Frequency-Spatial Block

When degradation operations are applied, artifacts in manipulated images are tricky to perceive. To improve the local expressive ability and effectively harvest discriminative cues in manipulated images, we design a Frequency-Spatial Bloc(FSB) to extract forgery features in the frequency and spatial domains.

Inspired by the recent work [58, 49, 37, 67], as shown in Figure 3, FSB consists of two branches: a frequency branch and a spatial branch. In the frequency branch, the input $X$ is converted into the frequency domain $\mathcal{F}_{T}(X)$ using the 2D FFT. A learnable filter $G_{i}$ is multiplied to modulate the spectrum and capture the important frequency information. Subsequently, the inverse FFT is applied to convert the feature back to the spatial domain, resulting in the extraction of frequency-aware features $X_{\text{f}}$ . In the spatial branch, the input $X$ is processed through convolutional layers and activated using the LeakyReLU function in a separate network to enhance the expressiveness of the features and obtain refined spatial features $X_{\text{s}}$ . Then $X_{\text{f}}$ and $X_{\text{s}}$ are concatenated and passed through convolutional layers and the LeakyReLU activation function to obtain enhanced information, which is then combined with the original input $X$ through element-wise summation.

The total process can be formulated by:

		$\displaystyle X_{\text{f}}=\hat{\mathcal{F}}_{T}(\mathcal{F}_{T}(X)\odot G_{i})$		(2)
		$\displaystyle X_{\text{s}}=\mathrm{Conv_{L}}\left(\mathrm{Conv}\left(X\right)\right)$
		$\displaystyle X_{out}=\mathrm{Conv_{L}}([X_{\mathrm{f}},X_{\mathrm{s}}])+X,$

where $\odot$ denotes the Hadamard product , $\mathrm{Conv_{L}}$ denotes convolution with LeakyReLU and $\begin{bmatrix}\cdot\end{bmatrix}$ denotes a concatenation operation.

4.3 Multi Windowed Anomalous Modelling Module

Image manipulation causes discrepancies at the pixel level. Genuine pixels are expected to exhibit consistency with neighboring pixels, while manipulated pixels may deviate and display anomalies. Former works [62, 35, 28], explore modeling such local inconsistencies. To effectively capture the pixel-level inconsistency between the manipulated and real region, we introduce the Multi Windowed Anomalous Modelling module (MWAM) to model these differences at multiple scales for fine-grained features.

As shown in Figure 3, given the input feature ${F\in H\times W\times C}$ , we calculate the difference between each pixel and surrounding pixels within a local window in two branches by Eq.3.

	$\displaystyle D^{k}_{u}[i,j]=(F[i,j]-F^{k}_{u}[i,j])/\sigma^{*},$		(3)
	$\displaystyle\sigma^{*}=\text{maximum}(\sigma(F),1e^{-5}+w_{\sigma})$		(3)

where $u\in\{a,m\}$ denotes average or maximum branches, $\sigma(F)$ is the standard deviation of F, and $w_{\sigma}$ is a learnable non-negative weight vector of the same length as $\sigma$ , $F_{a}^{k}$ and $F_{m}^{k}$ are calculated from the average and maximum values of the $k\times k$ windows in each pixel.

Different sizes $k$ are selected to model the inconsistency at different scales. Then, the obtained $N=3$ different-scale $D^{k}_{a}$ and $D^{k}_{m}$ are concatenated and fed into a convolutional network to obtain an anomaly map $M_{a}$ and $M_{m}$ of the same size as the original input.

Additionally, the anomaly score mask $\hat{\bm{S_{u}}}\in H\times W$ of the feature is computed using Eq.4.

	$\displaystyle\hat{\bm{f}_{u}}$	$\displaystyle=\mathrm{DW-Conv}\left(\bm{f}\right),$		(4)
	$\displaystyle\hat{\bm{S}_{u}}$	$\displaystyle=\operatorname{Sigmoid}\left(\mathrm{Conv}(C,1)\left(\hat{\bm{f}_% {u}}\right)\right),$		(4)

where the $\mathrm{DW}{-\operatorname{Conv}}$ means a $3\times 3$ Depth-Wise convolution layer.

The element-wise multiplication between anomaly score $\hat{\bm{S}_{u}}$ and the anomaly map $M_{u}$ capture the anomaly information. Next, we calculate the element-wise summation between the resulting anomaly-aware map and the input feature map $X$ to obtain an anomaly-sensitive feature map. The whole process can be described as:

\hat{X}=X+\hat{\bm{S}_{a}}\times M_{a}+\hat{\bm{S}_{m}}\times M_{m}

(5)

Table 4: Benchmarking IMDL models for manipulation detection and localization for varying mix-generator comprehensive capability. The models are trained on the whole GIM training set and test on the correspondence test dataset respectively. The detection Cls.Acc(%) and localization AUC(%) and F1(%) are reported. Params and GFLOPS denote the models’ parameters and computational complexity with Params measured in millions (M).

\ddagger

indicates that the original paper did not provide the code, we reproduced the corresponding code and evaluated it under the same settings

Method	Params	GFLOPS	GIM-SD			GIM-GLIDE			GIM-DDNM
Method	Params	GFLOPS	Cls.Acc	F1	AUC	Cls.Acc	F1	AUC	Cls.Acc	F1	AUC
ManTranet[62]	4.0	1009.7	61.08	37.48	80.83	70.99	49.11	83.29	53.99	33.12	74.94
MVSS-Net[12]	146.9	160.0	56.12	23.17	72.03	61.29	33.12	74.94	49.15	14.11	70.09
SPAN[25]	15.4	30.9	53.15	35.62	79.28	60.01	39.46	81.21	59.15	32.1	73.55
PSCC-Net[40]	4.1	107.3	52.28	31.48	83.85	66.52	53.68	86.37	56.33	41.77	85.8
ObjectFormer $\ddagger$ [58]	14.6	249.6	59.12	26.82	85.16	70.12	40.12	85.22	54.27	33.12	86.82
Trufor $\ddagger$ [19]	67.8	90.1	67.12	44.13	84.52	80.19	59.32	92.96	63.34	44.52	87.60
SegFormer[63]	27.5	41.3	64.25	46.17	83.26	78.11	56.77	88.73	69.29	40.19	84.65
GIMFormer (Ours)	95.9	96.2	70.92	58.61	88.25	83.89	77.31	95.42	76.72	56.25	88.31

4.4 Loss Function

For manipulation detection, we adopt a light-weight backbone in [57] on the fourth stage feature to calculate the final binary prediction $\hat{y}$ . For manipulation localization, we utilize the MLP decoder [63] as the segmentation head to obtain a predicted mask $\hat{M}$ . Given the ground-truth label y and mask M, we train GIMFormer with the following objective function:

{\mathcal{L}}={\mathcal{L}}_{cls}(y,{\hat{y}})+\lambda{\mathcal{L}}_{seg}(M,{% \hat{M}}),

(6)

where both ${\mathcal{L}}_{cls}$ and ${\mathcal{L}}_{seg}$ are binary cross-entropy loss, and $\lambda=1$ is a balancing hyperparameter.

4.5 Implementation Details.

Our approach includes two separate training steps. First, we train the ShadowTracer using a synthesized dataset of ImageNet. This training process follows a similar data generation method as described in Section 4.1. Then, we train the encoder and decoder of the model according to the two settings in GIM, as described in Section 3.4. We train our models on eight V100 GPUs with an initial learning rate (LR) of $6e^{-5}$ which is scheduled by the poly strategy with power 0.9 over 40 epochs. The optimizer is AdamW [41] with epsilon $1e^{-8}$ weight decay $1e^{-2}$ , and the batch size is 4 on each GPU.

Table 5: Benchmarking IMDL models for manipulation detection and localization for varying cross-generator generalization capability. The models are trained on the GIM-SD training set and tested on all the test sets to explore the generalization ability. The detection Cls.Acc(%) and localization AUC(%) and F1(%) are reported.

\ddagger

indicates that the original paper did not provide the code, we reproduced the corresponding code and evaluated it under the same settings

Method	GIM-SD			GIM-SD(VOC)			GIM-GLIDE			GIM-DDNM
Method	Cls.Acc	F1	AUC	Cls.Acc	F1	AUC	Cls.Acc	F1	AUC	Cls.Acc	F1	AUC
ManTranet[62]	73.12	43.18	80.18	63.18	27.37	72.74	58.21	24.53	74.61	39.52	16.8	58.31
MVSS-Net[12]	56.12	25.13	82.15	56.30	21.2	73.17	53.12	23.17	73.11	48.12	10.11	50.15
SPAN[25]	52.15	47.32	56.93	52.17	29.86	55.17	50.03	30.14	56.08	44.17	14.30	56.67
PSCC-Net[40]	58.24	48.36	87.29	54.79	39.36	83.80	51.11	29.60	79.76	40.52	4.40	48.52
objectFormer $\ddagger$ [58]	67.12	39.54	87.93	58.19	29.13	81.20	54.19	17.74	81.02	49.23	7.16	58.95
Trufor $\ddagger$ [19]	74.00	49.85	86.15	65.15	31.95	80.13	50.12	22.39	76.25	49.19	5.70	53.11
SegFormer[63]	71.93	53.08	87.11	60.03	28.17	80.97	54.11	21.03	75.12	50.21	4.70	51.33
GIMFormer (Ours)	78.96	61.75	90.61	67.19	50.01	84.01	67.01	39.27	81.02	50.23	19.37	60.12

4.6 Comparison with state-of-the-art methods

We compare our methods with various state-of-the-art IMDL methods [62, 12, 25, 40, 19, 58] on generative manipulation detection and location. In addition, the vanilla SegFormer (MiT-B2) [63] is also compared, since our method is based on that architecture. As shown in Table 4 and Table 5, these methods are tested on various settings to verify the performance and generalization. Note that some methods are not explicitly designed for image-level detection, in which case we use the maximum of the prediction map as the detection statistic. All methods are immersed in the same implementation details.

Cross-generator generalization capability comparison. Table 5 reports the generalization performance of all the methods mentioned. The results show that GIMFormer outperforms all other methods by a significant margin in both in-domain detection ability and cross-domain generalization ability. For in-domain detection ability, our method efficiently catches the subtle artifacts inherent in generative and accurately localizes them. Meanwhile, other methods may encounter confusion as they attempt to learn specific content, potentially leading to challenges in accurately detecting and localizing generative tampering patterns. For cross-domain generalization ability, GIMFormer achieves state-of-the-art performance. It demonstrates cross-detection capability in detecting manipulation using different generative models, as shown in the results of the GIM-GLIDE and GIM-DDNM testsets. Besides GIMFormer works well on Data generated by the same generator from different distributions, while existing methods are confused and have an obvious performance drop, as shown in the results of the GIM-SD(VOC) testset. The qualitative results for visual comparisons are illustrated in Figure 6.

Mix-generator comprehensive capability comparison Table 5 reports the comprehensive performance of all the methods mentioned. GIMFormer also outperforms all other methods on all the test sets, demonstrating its superior ability to identify generative tampering and its comprehensive performance. Existing methods have a low pixel-level F1 score on this benchmark, which means that tampering areas cannot be accurately identified. The qualitative results for visual comparisons are illustrated in Figure 5.

4.7 Ablation Analysis

Effectiveness of proposed module. We consider a simple baseline proposed in [63] and gradually integrate new key components. Experiments are carried out on the GIM-GLIDE testset. The quantitative results are listed in Table 7. The result shows that ShadowTracer brings significant improvements to the vanilla baseline. With the MWAM, there is an increase of 6.29% in F1 and 2% in AUC, which indicates that the differential information at multiple scales is crucial for accurate tampering localization. The use of FSB to dynamically harvest complementary frequency and spatial cues improves performance, particularly in detection. The results verify that ShadowTracer, FSB and MWAM effectively improve the performance of the baseline model.

Table 6: Ablation results on GIM-GLIDE test dataset using different variants of GIMFormer. All detection Cls.Acc(%), localization AUC(%), and F1(%) scores are reported.

Variants	Cls.Acc	F1	AUC
Baseline	78.11	56.77	88.73
+ST	79.96	67.12	91.99
+ST+WMAM	80.19	73.41	94.01
+ST+WMAM+FSB	83.89	77.31	95.42

Table 7: Generalization Experiments of ShadowTracer in percentage (%). ST denotes ShadowTracer and ST(SD) denotes ShadowTracer trained on data generated by Stable-Diffusion.

Variants	GIM-GLIDE			GIM-DDNM
Variants	Cls.Acc	F1	AUC	Cls.Acc	F1	AUC
GIMFormer w ST(SD)	83.1	78.1	95.4	77.4	58.8	89.4
GIMFormer w/o ST	80.0	68.9	91.3	73.1	51.3	86.1

Generalization of ShadowTracer. We initiate the training process of ShadowTracer using data generated by Stable Diffusion and subsequently hold its weights fixed. Following this pretraining phase, we proceed to train the backbone on both the GIM-GLIDE and GIM-DDNM trainsets, with and without the incorporation of ShadowTracer. The detection result Cls.Acc and location results in F1 and AUC are presented in Table 7, revealing that leveraging the pretrained ShadowTracer significantly enhances performance in cross-generator IMDL tasks. Notably, in the GIM-GLIDE dataset with the same manipulation type, leveraging ShadowTracer leads to a remarkable 9% enhancement in F1 and a 4% increase in AUC for location results. Additionally, the benefits extend to the detection aspect, manifesting as a 3% increase in accuracy. In the GIM-DDNM dataset with various manipulation types, there is a 7% enhancement in F1, a 3% increase in AUC for location results and a 4 % increase in accuracy.

5 Conclusion

We address the challenge of detecting and locating generative-based manipulation and provide a reliable database (GIM) for AIGC security. The proposed dataset leverages multiple mainstream generators and tampering methods to provide a variety of generative manipulation data. Additionally, We introduce GIMFormer, a novel transformer-based IMDL framework. ShadowTracer is designed to catch subtle artifacts in generative tampering. While the Frequency-Spatial Block gathers frequency and spatial information, the Multi Windowed Anomalous Modelling module captures pixel-level inconsistencies at multiple scales for fine-grained features. Extensive experiments demonstrate the superior performance of our model, which achieves SoTA results.

Limations. The GIM dataset shares classes with ImageNet and VOC, but it may not encompass future emerging objects due to the evolving variety in the real world. Fine-tuning pre-trained models on new object data can address this issue. Current research primarily centers on image manipulation. The emergence of realistic video generation models like SORA [5] presents fresh challenges as AI manipulation extends into videos. Future plans involve expanding research to address the complexities of video manipulation.

References

[1] Amerini, I., Ballan, L., Caldelli, R., Del Bimbo, A., Serra, G.: A sift-based forensic method for copy–move attack detection and transformation recovery. IEEE transactions on information forensics and security 6(3), 1099–1110 (2011)
[2] Asnani, V., Yin, X., Hassner, T., Liu, X.: Malp: Manipulation localization using a proactive scheme. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12343–12352 (2023)
[3] Bird, J.J., Lotfi, A.: Cifake: Image classification and explainable identification of ai-generated synthetic images. IEEE Access (2024)
[4] Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)
[5] Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., **g, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video generation models as world simulators (2024), https://openai.com/research/video-generation-models-as-world-simulators
[6] Chai, L., Bau, D., Lim, S.N., Isola, P.: What makes fake images detectable? understanding properties that generalize. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16. pp. 103–120. Springer (2020)
[7] Cozzolino, D., Poggi, G., Verdoliva, L.: Splicebuster: A new blind image splicing detector. In: 2015 IEEE International Workshop on Information Forensics and Security (WIFS). pp. 1–6. IEEE (2015)
[8] Crawford, K., Paglen, T.: Excavating ai: The politics of images in machine learning training sets. Ai & Society 36(4), 1105–1116 (2021)
[9] Dang, H., Liu, F., Stehouwer, J., Liu, X., Jain, A.K.: On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition. pp. 5781–5790 (2020)
[10] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. IEEE (2009)
[11] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794 (2021)
[12] Dong, C., Chen, X., Hu, R., Cao, J., Li, X.: Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(3), 3539–3553 (2022)
[13] Dong, J., Wang, W., Tan, T.: Casia image tampering detection evaluation database. In: 2013 IEEE China summit and international conference on signal and information processing. pp. 422–426. IEEE (2013)
[14] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88, 303–338 (2010)
[15] Gao, S., Lin, Z., Xie, X., Zhou, P., Cheng, M.M., Yan, S.: Editanything: Empowering unparalleled flexibility in image editing and generation. In: Proceedings of the 31st ACM International Conference on Multimedia, Demo track (2023)
[16] Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., Guo, B.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10696–10706 (2022)
[17] Guan, H., Kozak, M., Robertson, E., Lee, Y., Yates, A.N., Delgado, A., Zhou, D., Kheyrkhah, T., Smith, J., Fiscus, J.: Mfc datasets: Large-scale benchmark datasets for media forensic challenge evaluation. In: 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). pp. 63–72. IEEE (2019)
[18] Guarnera, L., Giudice, O., Guarnera, F., Ortis, A., Puglisi, G., Paratore, A., Bui, L.M., Fontani, M., Coccomini, D.A., Caldelli, R., et al.: The face deepfake detection challenge. Journal of Imaging 8(10), 263 (2022)
[19] Guillaro, F., Cozzolino, D., Sud, A., Dufour, N., Verdoliva, L.: Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20606–20615 (2023)
[20] Guo, X., Liu, X., Ren, Z., Grosz, S., Masi, I., Liu, X.: Hierarchical fine-grained image forgery detection and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3155–3165 (2023)
[21] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[22] Heller, S., Rossetto, L., Schuldt, H.: The ps-battles dataset-an image collection for image manipulation detection. arXiv preprint arXiv:1804.04866 (2018)
[23] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
[24] Hsu, Y.F., Chang, S.F.: Detecting image splicing using geometry invariants and camera characteristics consistency. In: 2006 IEEE International Conference on Multimedia and Expo. pp. 549–552. IEEE (2006)
[25] Hu, X., Zhang, Z., Jiang, Z., Chaudhuri, S., Yang, Z., Nevatia, R.: Span: Spatial pyramid attention network for image manipulation localization. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. pp. 312–328. Springer (2020)
[26] Huang, Y., Juefei-Xu, F., Guo, Q., Liu, Y., Pu, G.: Fakelocator: Robust localization of gan-based face manipulations. IEEE Transactions on Information Forensics and Security 17, 2657–2672 (2022)
[27] Huh, M., Liu, A., Owens, A., Efros, A.A.: Fighting fake news: Image splice detection via learned self-consistency. In: Proceedings of the European conference on computer vision (ECCV). pp. 101–117 (2018)
[28] Ji, K., Chen, F., Guo, X., Xu, Y., Wang, J., Chen, J.: Uncertainty-guided learning for improving image manipulation detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22456–22465 (2023)
[29] Jia, S., Huang, M., Zhou, Z., Ju, Y., Cai, J., Lyu, S.: Autosplice: A text-prompt manipulated image dataset for media forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 893–903 (2023)
[30] Jiang, L., Li, R., Wu, W., Qian, C., Loy, C.C.: Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2889–2898 (2020)
[31] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[32] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4015–4026 (October 2023)
[33] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
[34] Kniaz, V.V., Knyaz, V., Remondino, F.: The point where reality meets fantasy: Mixed adversarial generators for image splice detection. Advances in neural information processing systems 32 (2019)
[35] Kong, C., Luo, A., Wang, S., Li, H., Rocha, A., Kot, A.C.: Pixel-inconsistency modeling for image manipulation localization. arXiv preprint arXiv:2310.00234 (2023)
[36] Kwon, M.J., Nam, S.H., Yu, I.J., Lee, H.K., Kim, C.: Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision 130(8), 1875–1895 (2022)
[37] Lee-Thorp, J., Ainslie, J., Eckstein, I., Ontanon, S.: Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824 (2021)
[38] Li, D., Zhu, J., Wang, M., Liu, J., Fu, X., Zha, Z.J.: Edge-aware regional message passing controller for image forgery localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8222–8232 (2023)
[39] Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., Guo, B.: Face x-ray for more general face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5001–5010 (2020)
[40] Liu, X., Liu, Y., Chen, J., Liu, X.: Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology 32(11), 7505–7517 (2022)
[41] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
[42] Mahfoudi, G., Ta**i, B., Retraint, F., Morain-Nicolier, F., Dugelay, J.L., Marc, P.: Defacto: Image and face manipulation dataset. In: 2019 27Th european signal processing conference (EUSIPCO). pp. 1–5. IEEE (2019)
[43] Ng, T.T., Chang, S.F., Sun, Q.: A data set of authentic and spliced image blocks. Columbia University, ADVENT Technical Report 4 (2004)
[44] Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
[45] Novozamsky, A., Mahdian, B., Saic, S.: Imd2020: A large-scale annotated dataset tailored for detecting manipulated images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops. pp. 71–80 (2020)
[46] O’Sullivan, D., Passantino, J.: ‘verified’twitter accounts share fake image of ‘explosion’near pentagon, causing confusion (2023)
[47] Porter, T., Duff, T.: Compositing digital images. In: Proceedings of the 11th annual conference on Computer graphics and interactive techniques. pp. 253–259 (1984)
[48] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International Conference on Machine Learning. pp. 8821–8831. PMLR (2021)
[49] Rao, Y., Zhao, W., Zhu, Z., Lu, J., Zhou, J.: Global filter networks for image classification. Advances in neural information processing systems 34, 980–993 (2021)
[50] Rao, Y., Ni, J.: A deep learning approach to detection of splicing and copy-move forgeries in images. In: 2016 IEEE international workshop on information forensics and security (WIFS). pp. 1–6. IEEE (2016)
[51] Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., Zhang, L.: Grounded sam: Assembling open-world models for diverse visual tasks (2024)
[52] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
[53] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)
[54] Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Faceforensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1–11 (2019)
[55] Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence 22(8), 888–905 (2000)
[56] Verdoliva, L., Cozzolino, D., Nagano, K.: 2022 ieee image and video processing cup synthetic image detection
[57] Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al.: Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence 43(10), 3349–3364 (2020)
[58] Wang, J., Wu, Z., Chen, J., Han, X., Shrivastava, A., Lim, S.N., Jiang, Y.G.: Objectformer for image manipulation detection and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2364–2373 (2022)
[59] Wang, Y., Yu, J., Zhang, J.: Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490 (2022)
[60] Wen, B., Zhu, Y., Subramanian, R., Ng, T.T., Shen, X., Winkler, S.: Coverage—a novel database for copy-move forgery detection. In: 2016 IEEE international conference on image processing (ICIP). pp. 161–165. IEEE (2016)
[61] Wu, H., Zhou, J., Tian, J., Liu, J.: Robust image forgery detection over online social network shared images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13440–13449 (2022)
[62] Wu, Y., AbdAlmageed, W., Natarajan, P.: Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9543–9552 (2019)
[63] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34, 12077–12090 (2021)
[64] Ying, Q., Zhou, H., Qian, Z., Li, S., Zhang, X.: Learning to immunize images for tamper localization and self-recovery. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)
[65] Yu, T., Feng, R., Feng, R., Liu, J., **, X., Zeng, W., Chen, Z.: Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)
[66] Zampoglou, M., Papadopoulos, S., Kompatsiaris, Y.: Detecting image splicing in the wild (web). In: 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). pp. 1–6. IEEE (2015)
[67] Zhang, D., Huang, F., Liu, S., Wang, X., **, Z.: Swinfir: Revisiting the swinir with fast fourier convolution and improved training for image super-resolution. arXiv preprint arXiv:2208.11247 (2022)
[68] Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., Stiefelhagen, R.: Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on Intelligent Transportation Systems (2023)
[69] Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing 26(7), 3142–3155 (2017)
[70] Zhou, P., Han, X., Morariu, V.I., Davis, L.S.: Learning rich features for image manipulation detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1053–1061 (2018)
[71] Zhu, M., Chen, H., Yan, Q., Huang, X., Lin, G., Li, W., Tu, Z., Hu, H., Hu, J., Wang, Y.: Genimage: A million-scale benchmark for detecting ai-generated image. arXiv preprint arXiv:2306.08571 (2023)

Supplementary Document

In this supplementary document, we include many details of our work: (1) The details of the proposed GIM Datset and compare with related work (Sec. 6) (2) The details of Architecture and implementation (Sec. 7). (3) The additional ablation analysis of GIMFormer(Sec. 8) (4) More Visualizations on GIM dataset (Sec. 9). (5)The societal impact of our work (Sec. 10). (6) The ethics statement of our work (Sec.11).

6 GIM Dataset

6.1 GIM Dataset Configuration

The specific quantities and details of the dataset are presented in Table 9, GIM has a total number of 1,140k generative manipulated images with their corresponding origin images. Moreover, the dataset can be split into subsets. GIM-SD-all utilizes the training set from Imagenet-1k [10] and employs the Stable Diffusion [53] generator for image manipulation dedicated to manipulation research. Through the analysis of the data scale, GIM-SD, GIM-GLIDE and GIM-DDNM individually select a random set of 100 classes from the Imagenet-1k training dataset and the entire test data to generate their respective training and test sets to benchmark the image manipulation detection and location (IMDL)methods. GIM-SD (VOC) is a cross-data-distribution test set constructed using the test set from the PASCAL VOC [14] dataset and Stable Diffusion techniques.

Table 8: Parameters for GIM degradation. Three degradation methods are employed with variable parameters to emulate real-world scenarios.

Post-processing	Parameter
JPEG compression	75, 80, 90
Gaussian blur	3, 5
Downsampling	0.5, 0.67

Table 9: Basic configuration about the GIM Dataset. GIM has five subsets, leveraging various generators and data sources.

GIM

Tampreing

Type

Number of

Trainset

Number of

Testset

GIM-SD-all

reprint

1890K

GIM-SD

reprint

180K

70K

GIM-GLIDE

reprint

140K

70K

GIM-DDNM

removal

125K

50K

GIM-SD(VOC)

reprint

4.5K

To simulate real-world scenarios and comprehensively evaluate the IMDL methods, GIM incorporates three degradation methods (JPEG compression, Gaussian blur, and downsampling). Table 9 presents the parameters for each degradation group, which are randomly chosen.

The GIM data format consists of JPG images paired with PNG labels. Filenames ending with "_f" indicate tampered images. The masks segmented by SAM [33] serve as ground truth for manipulation detection, containing two labels denoted by 0 and 255, representing the original and manipulated categories, respectively.

6.2 Dataset Comparison.

GIM focuses on generative local manipulation in natural images and provides manipulated region labels, specifically for image manipulation detection and localization tasks. This distinguishes GIM from the existing datasets focused on faces or limited to detection (classification). As shown in Table 10, we compare several local image manipulation datasets. The existing image forensic datasets are primarily constructed using traditional manipulation techniques. These datasets commonly employ manual or random manipulation to generate data. AutoSplice [29], HiFi-IFDL[20], CoCoGlide [19] dataset utilizes generative algorithms for image manipulation. However, they rely on one type of manipulation and suffer from a small data volume.

In contrast, GIM achieves image manipulation by employing generative methods for localized image alteration. It encompasses a variety of generators and manipulation methods. Simultaneously, it stands as a large-scale comprehensive dataset for the IMDL community.

Table 10: Summary of previous image manipulation datasets and GIM. We showcase the number of data entries within each dataset and the manipulation techniques they encompass. *denotes that HiFi-IFDL includes entirely synthesized images

Dataset	Num. Images		Handcraft Manipulations			AIGC Manipulations
Dataset	# Authetic Image	# Forged Image	Splicing	Copy-move	Removal	Removal	Generation
Columbia Gray [43]	933	912	✓	✗	✓	✗	✗
Columbia Color [24]	183	180	✓	✗	✓	✗	✗
MICC-F2000 [1]	1,300	700	✓	✗	✓	✗	✗
VIPP Synth. [1]	4,800	4,800	✓	✗	✓	✗	✗
CASIA V1.0 [13]	800	921	✓	✓	✓	✗	✗
CASIA V2.0 [13]	7,200	5,123	✓	✓	✓	✗	✗
Wild Web [66]	90	9,657	✓	✓	✓	✗	✗
NC2016 [17]	560	564	✓	✓	✓	✗	✗
NC2017 [17]	2,667	1,410	✓	✓	✓	✗	✗
MFC2018 [17]	14,156	3,265	✓	✓	✓	✗	✗
MFC2019 [17]	10,279	5,750	✓	✓	✓	✗	✗
PS-Battles [22]	11,142	102,028	✓	✓	✓	✗	✗
DEFACTO [42]	0	229,000	✓	✓	✓	✗	✗
IMD2020 [45]	35,000	35,000	✓	✓	✓	✗	✗
SP COCO [36]	0	200,000	✓	✗	✗	✗	✗
CM COCO [36]	0	200,000	✓	✗	✗	✗	✗
CM RAISE [36]	0	200,000	✓	✓	✗	✗	✗
HiFi-IFDL[20]	-	1,000,00*	✗	✓	✗	✓	✗
CoCoGlide [19]	0	512	✗	✗	✗	✓	✗
AutoSplice [29]	3,621	2,273	✗	✗	✗	✗	✓
GIM	1,140,000	1,140,000	✗	✗	✗	✓	✓

7 Implementation Details

7.1 Architecture

The GIMFormer backbone has an encoder-decoder architecture. The encoder is a dual-branch and four-stage encoder. For the input RGB image, ShadowTracer extracts its learned manipulation trace map of the same resolution as the image. The ShadowTracer adopts the architecture [69] with 15 trainable layers, 3 input channels, 1 output channel. Then, both the RGB and the trace map are fed into the parallel network, where the four-stage structure is employed to extract pyramidal features $F_{i}$ ( $i\in[1,4]$ ). At each stage the RGB branch is first processed by the Frequency-Spatial Block (FSB), then two branches are gradually processed by Transformer Blocks [63] and Multi Windowed Anomalous Modelling module(MWAM). The learnable parameters within FSB (Feature Synthesis Block)maintain an identical resolution as the input features. The window sizes within the Multi-Window Anomalous Modeling (MWAM)at each stage are as follows: {3, 7, 9}, {7, 11, 15}, {9, 17, 25}, and {11, 21, 31}. The Transformer blocks are based on the Mix Transformer encoder B2 (MiT-B2)proposed for semantic segmentation and are pretrained on ImageNet. The Mix Transformer encoder uses self-attention and channel-wise operations, prioritizing spatial convolutions over positional encodings. To integrate information from the two branches, the cross-modal Feature Rectification Module (FRM) and Feature Fusion Module (FFM) [68] are utilized. For location, we employ the All-MLP decoder proposed in [63], which is a lightweight architecture formed by only 1 $\times$ 1 convolution layers and bilinear up-samplers. For detection, we adopt the tail-end section of the light-weight backbone in [57]. It applies convolutional, batch normalization, and activation layers to extract features from input data. These extracted features are then pooled and passed through fully connected layers to generate the classification predictions.

7.2 Training Process

We conduct our experiments with PyTorch 1.7.0. All models are trained on a node with 8 or 4 V100 GPUs.

For the training of the ShadowTracer network $g_{\phi}$ , We randomly selected 20,000 authentic images from ImageNet and correspondingly generated manipulated images. During comprehensive performance verification, all three generators (Stable Diffusion [52], GLIDE [44], DDNM [59])are used, whereas for generalization verification, only Stable Diffusion is employed. Throughout the training, the alpha blending parameter ranges between 0.5 and 1.0, randomly producing blended images subjected to three degradation types (JPEG compression, downsampling, Gaussian blur). The network is trained on 40 $\times$ 40 pixel patches randomly sampled from the dataset. Training is conducted for roughly 300,000 iterations with a batch size of 64. An Adam [31] optimizer is employed, initialized with a learning rate of 0.001.

For the training of the GIMFormer backbone network. The input image is cropped to 512 $\times$ 512 during training. We train our models for 40 epochs on the GIM dataset. The batch size is 4 on each of the GPUs. The images are augmented by random resize with a ratio of 0.5–2.0, random horizontal flip**.

8 Additional Ablation Analysis

The robustness of GIMFormer. To evaluate the robustness of GIMFormer, we sample 5000 images from the original three generator test sets without degradations. The model is trained on the mixed-generator dataset. The model is trained on the mixed-generator dataset. The experimental results are presented in Table 11, GIMFormer demonstrates robustness against various distortion techniques. This resilience underscores its ability to maintain stable performance in the face of various challenges posed by image distortions.

Table 11: Robustness experiment of pixel-level Manipulation Localization F1(%) with various distortions

No Dis-

tortion

Cmp

(q=90)

Cmp

(q=80)

Cmp

(q=75)

Blur

(k=3)

Blur

(k=5)

Downsample

(0.66X)

Downsample

(0.5X)

GIMFormer

61.7

60.3

60.1

59.9

58.7

61.0

60.8

The effect of the number of windows (N). We conduct a set of ablation experiments to study the performance of the MWAM Module. To ensure fair comparisons, all experiments differ from each other only in the Windows setting. Experiments are carried out on the GIM-SD trainset and testset. As shown in Table 12, there is an overall incremental trend in the tampering location performance as the number of windows increases, while the impact of size variations remains relatively minor. For the sake of efficiency, we stop analyzing more windows. The results indicate that a favorable balance between accuracy and efficiency is achieved when $N$ =3.

Table 12: Ablation results on GIM-SD test dataset using different variants of GIMFormer, All detection Cls.Acc(%) and localization AUC(%) and F1(%) scores are reported. Where

D_{a/m}^{k}

represents the number and dimensions of windows utilized in either the average or maximum branches. We investigate the window counts

N

of 0, 1, 2,3 as well as the impact of only one branch on the module. In this table, the window sizes of the first layer are used to annotate, with subsequent layers decreasing in size.

MWAM Variants	Cls.Acc	F1	AUC
w\o windows	72.13	58.12	88.81
{ $D_{a\&k}^{11}$ }	74.33	58.34	89.65
{ $D_{a\&k}^{11},D_{a\&k}^{21}$ }	76.61	60.03	89.92
{ $D_{a\&k}^{11},D_{a\&k}^{21},D_{a\&k}^{31}$ }	78.96	61.75	90.61
{ $D_{a\&k}^{17},D_{a\&k}^{29},D_{a\&k}^{41}$ }	77.33	61.17	90.11
{ $D_{a}^{11},D_{a}^{21},D_{a}^{31}$ }	77.95	60.77	90.03
{ $D_{m}^{11},D_{m}^{21},D_{m}^{31}$ }	77.83	59.13	89.53
{ $D_{a\&k}^{11},D_{a\&k}^{21},D_{a\&k}^{31},D_{a\&k}^{41}$ }	78.11	61.15	90.71

9 More Visualizations on GIM

In this section, we present additional visualizations of the GIM dataset in Figure 7, 8, and 9, showcasing the data visualization from GIM-SD, GIM-GLIDE, and GIM-DDNM respectively.

10 Societal Impact

Our research yields a positive societal impact on the community by focusing on addressing the challenge of detecting and locating generative-based manipulations. We introduce a reliable database, GIM, aimed at enhancing the security of AI-generated content (AIGC). This database facilitates the training of Image Manipulation Detection and Localization (IMDL)methods on GIM, allowing them to generalize across various scenarios. Consequently, our algorithm GIMFormer fosters increased trust among the general public in our society regarding media content. GIM and GIMFormer are beneficial for digital media forensics, especially generative manipulation with real-world degradations.

11 Ethics Statement

Our GIM is based on ImageNet and VOC. No additional personally identifiable information or sensitive personally identifiable information is introduced during the production of fake images in the GIM dataset. During the dataset production, we do not introduce extra information containing exacerbated bias against people of a certain gender, race, sexuality, or who have other protected characteristics. The ethical issues in the ImageNet and VOC datasets have been discussed in previous works. Crawford et al. [8] discuss issues with ImageNet. The first issue is the political nature of all taxonomies or classification systems, where terms like "male" and "female" are considered "natural," while "hermaphrodite" is offensively placed within the branch of Person > Sensualist > Bisexual alongside "pseudohermaphrodite" and "switch hitter" categories. The second issue concerns offensive images of real people, while the third is the use of people’s photos without their consent by ImageNet creators.