33email: {chenyirui,weiliucv}@sjtu.edu.cn [email protected] {huangxudong9,wei.lee,hujie23}@huawei.com
GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization
Abstract
The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location(IMDL). However, the lack of a large-scale data foundation makes IMDL task unattainable. In this paper, a local manipulation pipeline is designed, incorporating the powerful SAM, ChatGPT and generative models. Upon this basis, We propose the GIM dataset, which has the following advantages: 1) Large scale, including over one million pairs of AI-manipulated images and real images. 2) Rich Image Content, encompassing a broad range of image classes 3) Diverse Generative Manipulation, manipulated images with state-of-the-art generators and various manipulation tasks. The aforementioned advantages allow for a more comprehensive evaluation of IMDL methods, extending their applicability to diverse images. We introduce two benchmark settings to evaluate the generalization capability and comprehensive performance of baseline methods. In addition, we propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial Block (FSB), and a Multi-window Anomalous Modelling (MWAM) Module. Extensive experiments on the GIM demonstrate that GIMFormer surpasses previous state-of-the-art works significantly on two different benchmarks. Code page: GIM.
Keywords:
Dataset and Benchmark Image Manipulation Detection and Location Forgery Traces Extraction Multi-scale Feature Learning1 Introduction
![Refer to caption](x1.png)
Images are one of the most essential media for information transmission in modern society, which is widely spread on public platforms such as news and social media. With the rapid advancement of generative methods [11, 52, 16, 4], such natural information can be more easily tampered with for specific purposes such as tampering with an object or person. The image of this class is particularly convincing since its visual comprehensibility, leads to serious information security risks in social areas. For instance, an AI-generated image of black smoke billowing from what appeared to be a government building near the Pentagon set off investor fears, sending stocks tumbling [46]. Therefore, it is of utmost urgency to develop effective methods to examine whether an image is modified by generative models, meanwhile identifying the exact location of the tampering. However, the traditional image manipulation detection and location(IMDL) dataset [55, 13, 60] overlook the powerful generative models and generative IMDL dataset are limited with scale, thus not sufficient to comprehensively evaluate the performance of IMDL methods and benefits the IMDL community.
To this end, we propose a million-level generative-based IMDL dataset, termed GIM, to provide a reliable database for AI Generated Content (AIGC) security. GIM leverage the diffusion model [44, 52, 59] and SAM [32, 51], with ImageNet [10] and VOC [14] as the input. SAM is leveraged to locate the tampering region, and diffusion models paint the reasonable content in the tampering area. The excellent diffusion model ensures dataset fidelity and rich categories guarantee the diversity of images in GIM. GIM contains a total of over one million generative tampered images. Based on this, the IMDL methods are evaluated and benchmarked. To develop an appropriate benchmark scale, we explore the impact of different amounts of training manipulated data. The final benchmark contains about 300k manipulated images with their tampering masks. To simulate the real situation, we investigate the effect of degradation and apply the images to the random degradation method. Two benchmark settings are designed to evaluate the comprehensive and generalization. In addition, we recreated existing IMDL methods in a fair manner, providing the basis for future development. Overall, as a generative IMDL benchmark, GIM possesses the following advantages: 1) GIM has a large and reasonable data scale, including rich image categories and contents. 2) GIM contains various generative manipulation models and tasks. 3) GIM proposes two settings for verifying generalization capability and comprehensive performance in existing methods.
The existing methods emphasize traditional image manipulations also known as “cheapfakes”. Generative manipulations introduce lethal alterations in content with no apparent frequency or structural inconsistency. To address the above issue, we introduce GIMFormer, a transformer-based framework for generative IMDL. The ShadowTracer is designed to embed the nuanced artifacts inherent in generative tampering and serves as prior information. The Frequency-Spatial Block harvests the manipulation clues in frequency and spatial domains. Furthermore, the Multi Windowed Anomalous Modelling Module captures local inconsistencies at different scales to refine the features. By doing so, our model extracts features from both the RGB and learned tampering traces map, captures details from the frequency domain and spatial domain, and models inconsistencies at different scales for precise manipulation detection and location.
We conduct experiments on our proposed GIM. Both the qualitative and quantitative results demonstrate that GIMFormer can significantly outperform previous state-of-the-art methods in terms of both comprehensive capability and generalization ability.
In summary, our main contributions are as follows:
-
•
We build a local manipulation pipeline and construct a comprehensive large-scale dataset with various SOTA generative models and manipulation tasks to further facilitate the research on the IMDL task.
-
•
We investigate the impact of data scales and degradation on IMDL tasks to select an appropriate scale and simulate real-world scenarios. Based on this, we construct a reasonable benchmark and two settings to evaluate the generalization and comprehensive capabilities of IMDL methods.
-
•
We propose GIMFormer, a generative IMDL framework consisting of ShadowTracer to embed subtle traces, Frequency-Spatial Block to extract forgery features in the frequency and spatial domains and Multi-Windowed Anomalous Modelling Module to capture pixel-level inconsistencies at multi scales.
-
•
Extensive experiments show that GIMFormer achieves state-of-the-art IMDL performance in terms of generalization and comprehensive capabilities.
2 Related Work
2.1 Image Forensic Datasets
The research community has dedicated significant efforts to establishing robust datasets for image forensics. Early datasets primarily focus on one type of manipulation. The Columbia dataset [43, 24] is the pioneering dataset focusing on splicing forgeries. CASIA v1.0 and CASIA v2.0 [13] are the first to incorporate multiple types of manipulations within a single dataset, with forged images manually crafted using Adobe Photoshop. The Wild Web [66] dataset collects forged images from the Internet, significantly surpassing the scale of previous datasets. NIST [17] Datasets provide an extensive collection of datasets serving as a crucial standard for assessing media tampering detection methods. DEFACTO [42]is automatically generated from the Microsoft COCO dataset, encompassing four categories of forgeries: Copy and Move, Splicing, Object Removal, and Morphing. IMD2020 [45] provides a range of locally manipulated images generated through manual operations or random slicing, while also collecting online images without apparent manipulation traces. Recently, HiFi-IFDL [20] construct a hierarchical fine-grained dataset containing some representative forgery methods. AutoSplice [29] leverage the capabilities of large-scale language-image models like DALL-E2 [48] to facilitate automatic image editing and generation. CocoGlide [19] contains 512 images using the GLIDE diffusion model. Some datasets focused exclusively on facial manipulations [54, 18, 9] or entirely synthesized images[71, 56, 3].
However, these datasets have certain limitations, such as small data sizes and limited manipulation techniques or tasks. Recent advances in generative models [44, 23, 52] have demonstrated remarkable abilities in generative-based manipulation. These developments have led to the emergence of open-source projects [51, 65, 15] for manipulation. Leveraging the power of these models, we introduce GIM, a large-scale generative-based manipulation dataset.
![Refer to caption](x2.png)
2.2 Image Manipulation Detection and Localization
Early studies on natural image manipulation localization mainly focus on detecting a specific type of manipulation [7, 27, 34, 50]. Due to the exact manipulation type is unknown in real-world scenarios, most state-of-the-art methods [40, 19, 58, 64, 38, 28, 61] primarily concentrate on general manipulation. RGB-N [70] propose a dual-stream network for manipulation detection. One stream focuses on extracting RGB features and the other leverages noise features to model the disparities. MVSS-Net [12] utilizes a two-stream CNN to extract noise features and employs the Dual Attention Module to merge the output of the two-stream CNN. PSCC-Net [40] extracts hierarchical features with a top-down path and detects whether the input image has been manipulated using a bottom-up path. ObjectFormer [58] uses object prototypes to model object-level consistencies and find patch-level inconsistencies to detect the manipulation. However, these methods focus on "cheapfake" detection and encounter challenges when applied to generative manipulation due to the tampering traces are subtle and the lack of an understanding of generative manipulation patterns. Certain works [26, 6, 9, 39, 2] focus on localizing manipulations using generative models, yet these methods primarily concentrate on human faces.
In this work, we leverage a deep network to embed the subtle artifacts inherent in generative tampering as a prior trace map. we then employ a dual network to fuse the trace map and RGB image, combining frequency and spatial information and capturing pixel-level inconsistencies at multiple scales to detect and locate the manipulation.
Dataset | Image Content | Image Size | Num. Images | Manipulations Category | |||
# Authetic Image | # Forged Image | Traditional | Gen.Reprint | Gen.Removal | |||
FaceForensics++ [54] | Face | 480p-1080p | 1000 | 4000 | ✗ | ✓ | ✗ |
DFDC [9] | Face | 240p-2160p | 19,154 | 100,000 | ✗ | ✓ | ✗ |
DeeperForensics [30] | Face | 1080p | 50,000 | 10,000 | ✗ | ✓ | ✗ |
Columbia Gray [43] | General | 128 128 | 933 | 912 | ✓ | ✗ | ✗ |
CASIA V2.0 [13] | General | 320 240-800 600 | 7,200 | 5,123 | ✓ | ✗ | ✗ |
IMD2020 [45] | General | 1062 866 | 35,000 | 35,000 | ✓ | ✗ | ✗ |
Coverage [60] | General | 400 486 | 100 | 100 | ✓ | ✗ | ✗ |
AutoSplice [29] | General | 256 256 - 4232 4232 | 3,621 | 2,273 | ✗ | ✓ | ✗ |
HiFi-IFDL[20] | General | - | - | 1,000,000 | ✓ | ✓ | ✗ |
CocoGlide [19] | General | 256256 | - | 512 | ✗ | ✓ | ✗ |
GIM | General | 6464-60003904 | 1,140,000 | 1,140,000 | ✗ | ✓ | ✓ |
3 Dataset and Benchmark Construction
In Section 3.1, we propose an automatic data synthesis pipeline to produce generative manipulated images efficiently with unlabeled images. Leveraging this pipeline, we construct a comprehensive large-scale dataset. To build a reasonable scale benchmark for evaluating the IMDL methods, we delve into the impact of training data scale in Section 3.2. To emulate real-world conditions and establish an objective benchmark, the manipulated images and original authentic images are subjected to three random degradations (JPEG compression, downsampling, and Gaussian blur), as detailed in Section 3.3). GIM benchmark comprises over 320k manipulated images paired with authentic images (100 distinct labels in ImageNet for each generative model) for algorithm evaluation. The generated images are shown in Figure 1 with their original images and tampering masks. The specific composition of GIM can be found in the Supplementary Materials. In Section 3.4, we introduce the criteria and settings used to evaluate the performance of IMDL methods.
3.1 Generation and Details of the dataset
To construct a generative-based IMDL dataset, a large-scale natural dataset is urgently desired. ImageNet [10] and VOC [14] are awesome datasets, chosen as the starting point for our research due to the following advantages: 1) Large and diverse. 2) Wide category coverage. To sum up, we regard these datasets as the database for generative manipulation. In addition to possessing various scenarios, the proposed dataset utilizes a wealth of tampering methods. Specifically, we reprint the target class by generative inpainting [52, 44] method or remove the destination region using the model [59].Benefiting from open-source projects [33, 51], we develop our data generation pipeline.
Figure 2 illustrates the overall process. Firstly, with the classification attribute or User query, the local manipulation mask is extracted by the zero-shot segmentation network [33]. For reprint tampering, the image category is embedded in the replacement prompt and interacts with ChatGPT, which returns an approximate category. The approximate category is then embedded into the inpainting prompt. Combined with the original image, manipulation mask, and inpainting prompt, generative models generate the reprint generative tampered result. For removal tampering, only the original image and manipulation mask are required for the generated model. The GIM Dataset utilizes the entire ImageNet dataset and the Stable-Diffusion generator to construct a comprehensive dataset, providing a reliable database and laying the foundation for the research below.
3.2 Analysis of Benchmark Scale
With the data generation pipeline described above, we are qualified to generate millions, tens of millions, or even billions of data. Nonetheless, blindly increasing the amount of data does not improve the algorithm performance, but may lead to data redundancy. Therefore, taking data generated by the Stable Diffusion as an example, we explore the influence of data volume and category volume on baseline classification [21] and segmentation algorithms [63]. Training sets are the different scales of the training subset, and the validation is the same test set. As shown in Table 2, the metrics of classification and segmentation are improved as the dataset scale increases. When the scale reaches 180K, the algorithm performance is almost saturated. No matter whether the image class or data scale is added, the metrics are stagnant. Experiments demonstrate that increasing the amount of data or category brings negligible benefits when the data tends to be saturated. According to the analysis above, for the training data, the GIM benchmark selects 100 different labels from ImageNet for each generation model to create tampered images. For the test data, the GIM benchmark uses the entire test dataset for each generation model.
Total Num. of Image | Image Classes | Image per Class | Metrics | ||
Cls.Acc | Seg.F1 | Seg.AUC | |||
2,800 | 10 | 280 | 0.6306 | 0.2550 | 0.7950 |
28,000 | 100 | 280 | 0.7480 | 0.3247 | 0.7995 |
180,000 | 100 | 1800 | 0.9129 | 0.5289 | 0.8699 |
360,000 | 200 | 1800 | 0.9131 | 0.5291 | 0.8701 |
500,000 | 500 | 1000 | 0.9131 | 0.5291 | 0.8703 |
Degradation | Metrics | |
Cls.Acc | Seg.F1 | |
- | 91.29 | 52.89 |
JPEG | 83.10 | 45.74 |
Gaussian Blur | 90.94 | 40.14 |
Downsample | 88.34 | 37.12 |
3.3 Post-processing of Degradation Method
After being spread on the Internet, images will encounter various post-processings, such as JPEG compression, downsampling, Gaussian blur, etc. These transformations can pose challenges for image forensics methods. Figure 3.2 investigates the impact of degradation. The baseline models are trained on the clean data generated by the Stable-Diffusion training set and tested on the same test set with a specific degradation method. Experimental results denote that these degradation methods make identification more difficult. To shorten the distance from reality and build an objective benchmark, the random degradation method (JPEG compression, downsampling, Gaussian blur) is performed on the dataset. The details of the parameters can be found in Supplementary Materials.
3.4 Benchmark Settings
Dataset: The datasets used for the research of generative IMDL tasks are lacking. This makes direct comparisons and accurate measurement difficult. We use three cross-generator datasets GIM-SD (data from ImageNet manipulation by Stable Diffusion, similarly hereinafter), GIM-GLIDE, GIM-DDNM and a cross-distribution dataset GIM-VOC(data from VOC manipulation by Stable Diffusion) to present benchmarking results.
Metrics: We evaluate the performance of the proposed method on both the image manipulation detection task and localization task. For classification results, we use Accuracy (Cls.acc) as our evaluation metric, while for localization, the pixel-level AUC and F1 score on manipulation masks are adopted. Since binary masks and detection scores are required to compute F1 scores, we adopt the Equal Error Rate (EER) threshold to binarize them.
Settings: Two settings are proposed to verify the comprehensive and generalization performance of the algorithms. In the cross-generator generalization setting, the models are trained on the GIM-SD training set and tested on the GIM-GLIDE, GIM-DDNM and GIM-VOC test set to explore the generalization IMDL performance, as shown in Table 5. In the mix-generator comprehensive setting, the models are jointly trained on the GIM-SD, GIM-GLIDE and GIM-DDNM training set and tested on the correspondence test dataset respectively to evaluate the comprehensive IMDL performance, as shown in Table 4.
4 Method
We propose the IMDL framework GIMFormer, which utilizes a dual encoder and decoder architecture. Considering the specifics of generative manipulation, we propose the ShadowTracer in Section 4.1, the Frequency-Spatial Block (FSB) in Section 4.2, and the Multi Windowed Anomalous Modelling module (MWAM) in Section 4.3. Figure 3 gives an overview of the framework. For the input RGB image , we first extract its learned trace map . Then, both and are fed into a two-branch network, where the four-stage structure is used to extract pyramidal features (). The RGB branch is composed of FSB, Transformer Block [63] and WMAM. The tracer branch consists of a Transformer Block and WMAM. In the fusion step, the feature rectification module (FRM) [68] and feature fusion module (FFM) [68] are used for feature fusion. The four-stage fused features are forwarded to the decoder for final detection and location . Details are provided in the Supplementary.
![Refer to caption](x3.png)
4.1 ShadowTracer
Prior manipulation detection methods mainly focus on ”cheapfake” scenes and rely on visible traces. These artifacts include distortions and sudden changes caused by manipulation of the image structure. However, generative tampering makes significant alterations to the content with no apparent frequency or structural inconsistency. As shown in Figure 4, these subtle traces are displayed with inherent patterns, not visible traces with inconsistent edges.
ShadowTracer aims to capture the inherent characteristics and subtle traces of the generative models. For a doctored image, our objective is to learn a map** to map the tampered image to its latent disturbed pixel values, where represents a neural network with trainable parameters . Our key observation is that the differences introduced by generative models in data distribution exhibit inherent patterns, and deep neural networks can attempt to reconstruct these variations. At the training stage, we generate pairs of the input image and the generative tampered image , the manipulation trace can be calculated by . The objective function for training can be formulated as
(1) |
where .
Furthermore, the map** network should satisfy two properties: detecting subtle tampering traces and being robust to various real-world image degradations. For this reason, image pairs are generated by mixing original and manipulated images and incorporating diverse degradation operations at the training stage. Specifically, given an input image , we segment a portion and perform generative manipulation to obtain . The alpha blendering [47] is utilized to the original and manipulated images to obscure obvious tampering traces. Following this, we subject the images to degradation operations such as blurring, and JPEG compression to obtain the final manipulated image. The network is trained on 6464 pixel patches randomly sampled from the dataset, adapting the loss in eq. 1.
4.2 Frequency-Spatial Block
When degradation operations are applied, artifacts in manipulated images are tricky to perceive. To improve the local expressive ability and effectively harvest discriminative cues in manipulated images, we design a Frequency-Spatial Bloc(FSB) to extract forgery features in the frequency and spatial domains.
Inspired by the recent work [58, 49, 37, 67], as shown in Figure 3, FSB consists of two branches: a frequency branch and a spatial branch. In the frequency branch, the input is converted into the frequency domain using the 2D FFT. A learnable filter is multiplied to modulate the spectrum and capture the important frequency information. Subsequently, the inverse FFT is applied to convert the feature back to the spatial domain, resulting in the extraction of frequency-aware features . In the spatial branch, the input is processed through convolutional layers and activated using the LeakyReLU function in a separate network to enhance the expressiveness of the features and obtain refined spatial features . Then and are concatenated and passed through convolutional layers and the LeakyReLU activation function to obtain enhanced information, which is then combined with the original input through element-wise summation.
The total process can be formulated by:
(2) | ||||
where denotes the Hadamard product , denotes convolution with LeakyReLU and denotes a concatenation operation.
![Refer to caption](x4.png)
4.3 Multi Windowed Anomalous Modelling Module
Image manipulation causes discrepancies at the pixel level. Genuine pixels are expected to exhibit consistency with neighboring pixels, while manipulated pixels may deviate and display anomalies. Former works [62, 35, 28], explore modeling such local inconsistencies. To effectively capture the pixel-level inconsistency between the manipulated and real region, we introduce the Multi Windowed Anomalous Modelling module (MWAM) to model these differences at multiple scales for fine-grained features.
As shown in Figure 3, given the input feature , we calculate the difference between each pixel and surrounding pixels within a local window in two branches by Eq.3.
(3) | |||
where denotes average or maximum branches, is the standard deviation of F, and is a learnable non-negative weight vector of the same length as , and are calculated from the average and maximum values of the windows in each pixel.
Different sizes are selected to model the inconsistency at different scales. Then, the obtained different-scale and are concatenated and fed into a convolutional network to obtain an anomaly map and of the same size as the original input.
Additionally, the anomaly score mask of the feature is computed using Eq.4.
(4) | ||||
where the means a Depth-Wise convolution layer.
The element-wise multiplication between anomaly score and the anomaly map capture the anomaly information. Next, we calculate the element-wise summation between the resulting anomaly-aware map and the input feature map to obtain an anomaly-sensitive feature map. The whole process can be described as:
(5) |
Method | Params | GFLOPS | GIM-SD | GIM-GLIDE | GIM-DDNM | ||||||
Cls.Acc | F1 | AUC | Cls.Acc | F1 | AUC | Cls.Acc | F1 | AUC | |||
ManTranet[62] | 4.0 | 1009.7 | 61.08 | 37.48 | 80.83 | 70.99 | 49.11 | 83.29 | 53.99 | 33.12 | 74.94 |
MVSS-Net[12] | 146.9 | 160.0 | 56.12 | 23.17 | 72.03 | 61.29 | 33.12 | 74.94 | 49.15 | 14.11 | 70.09 |
SPAN[25] | 15.4 | 30.9 | 53.15 | 35.62 | 79.28 | 60.01 | 39.46 | 81.21 | 59.15 | 32.1 | 73.55 |
PSCC-Net[40] | 4.1 | 107.3 | 52.28 | 31.48 | 83.85 | 66.52 | 53.68 | 86.37 | 56.33 | 41.77 | 85.8 |
ObjectFormer[58] | 14.6 | 249.6 | 59.12 | 26.82 | 85.16 | 70.12 | 40.12 | 85.22 | 54.27 | 33.12 | 86.82 |
Trufor[19] | 67.8 | 90.1 | 67.12 | 44.13 | 84.52 | 80.19 | 59.32 | 92.96 | 63.34 | 44.52 | 87.60 |
SegFormer[63] | 27.5 | 41.3 | 64.25 | 46.17 | 83.26 | 78.11 | 56.77 | 88.73 | 69.29 | 40.19 | 84.65 |
GIMFormer (Ours) | 95.9 | 96.2 | 70.92 | 58.61 | 88.25 | 83.89 | 77.31 | 95.42 | 76.72 | 56.25 | 88.31 |
4.4 Loss Function
For manipulation detection, we adopt a light-weight backbone in [57] on the fourth stage feature to calculate the final binary prediction . For manipulation localization, we utilize the MLP decoder [63] as the segmentation head to obtain a predicted mask . Given the ground-truth label y and mask M, we train GIMFormer with the following objective function:
(6) |
where both and are binary cross-entropy loss, and is a balancing hyperparameter.
4.5 Implementation Details.
Our approach includes two separate training steps. First, we train the ShadowTracer using a synthesized dataset of ImageNet. This training process follows a similar data generation method as described in Section 4.1. Then, we train the encoder and decoder of the model according to the two settings in GIM, as described in Section 3.4. We train our models on eight V100 GPUs with an initial learning rate (LR) of which is scheduled by the poly strategy with power 0.9 over 40 epochs. The optimizer is AdamW [41] with epsilon weight decay , and the batch size is 4 on each GPU.
Method | GIM-SD | GIM-SD(VOC) | GIM-GLIDE | GIM-DDNM | ||||||||
Cls.Acc | F1 | AUC | Cls.Acc | F1 | AUC | Cls.Acc | F1 | AUC | Cls.Acc | F1 | AUC | |
ManTranet[62] | 73.12 | 43.18 | 80.18 | 63.18 | 27.37 | 72.74 | 58.21 | 24.53 | 74.61 | 39.52 | 16.8 | 58.31 |
MVSS-Net[12] | 56.12 | 25.13 | 82.15 | 56.30 | 21.2 | 73.17 | 53.12 | 23.17 | 73.11 | 48.12 | 10.11 | 50.15 |
SPAN[25] | 52.15 | 47.32 | 56.93 | 52.17 | 29.86 | 55.17 | 50.03 | 30.14 | 56.08 | 44.17 | 14.30 | 56.67 |
PSCC-Net[40] | 58.24 | 48.36 | 87.29 | 54.79 | 39.36 | 83.80 | 51.11 | 29.60 | 79.76 | 40.52 | 4.40 | 48.52 |
objectFormer[58] | 67.12 | 39.54 | 87.93 | 58.19 | 29.13 | 81.20 | 54.19 | 17.74 | 81.02 | 49.23 | 7.16 | 58.95 |
Trufor[19] | 74.00 | 49.85 | 86.15 | 65.15 | 31.95 | 80.13 | 50.12 | 22.39 | 76.25 | 49.19 | 5.70 | 53.11 |
SegFormer[63] | 71.93 | 53.08 | 87.11 | 60.03 | 28.17 | 80.97 | 54.11 | 21.03 | 75.12 | 50.21 | 4.70 | 51.33 |
GIMFormer (Ours) | 78.96 | 61.75 | 90.61 | 67.19 | 50.01 | 84.01 | 67.01 | 39.27 | 81.02 | 50.23 | 19.37 | 60.12 |
4.6 Comparison with state-of-the-art methods
We compare our methods with various state-of-the-art IMDL methods [62, 12, 25, 40, 19, 58] on generative manipulation detection and location. In addition, the vanilla SegFormer (MiT-B2) [63] is also compared, since our method is based on that architecture. As shown in Table 4 and Table 5, these methods are tested on various settings to verify the performance and generalization. Note that some methods are not explicitly designed for image-level detection, in which case we use the maximum of the prediction map as the detection statistic. All methods are immersed in the same implementation details.
Cross-generator generalization capability comparison. Table 5 reports the generalization performance of all the methods mentioned. The results show that GIMFormer outperforms all other methods by a significant margin in both in-domain detection ability and cross-domain generalization ability. For in-domain detection ability, our method efficiently catches the subtle artifacts inherent in generative and accurately localizes them. Meanwhile, other methods may encounter confusion as they attempt to learn specific content, potentially leading to challenges in accurately detecting and localizing generative tampering patterns. For cross-domain generalization ability, GIMFormer achieves state-of-the-art performance. It demonstrates cross-detection capability in detecting manipulation using different generative models, as shown in the results of the GIM-GLIDE and GIM-DDNM testsets. Besides GIMFormer works well on Data generated by the same generator from different distributions, while existing methods are confused and have an obvious performance drop, as shown in the results of the GIM-SD(VOC) testset. The qualitative results for visual comparisons are illustrated in Figure 6.
Mix-generator comprehensive capability comparison Table 5 reports the comprehensive performance of all the methods mentioned. GIMFormer also outperforms all other methods on all the test sets, demonstrating its superior ability to identify generative tampering and its comprehensive performance. Existing methods have a low pixel-level F1 score on this benchmark, which means that tampering areas cannot be accurately identified. The qualitative results for visual comparisons are illustrated in Figure 5.
![Refer to caption](x5.png)
![Refer to caption](x6.png)
4.7 Ablation Analysis
Effectiveness of proposed module. We consider a simple baseline proposed in [63] and gradually integrate new key components. Experiments are carried out on the GIM-GLIDE testset. The quantitative results are listed in Table 7. The result shows that ShadowTracer brings significant improvements to the vanilla baseline. With the MWAM, there is an increase of 6.29% in F1 and 2% in AUC, which indicates that the differential information at multiple scales is crucial for accurate tampering localization. The use of FSB to dynamically harvest complementary frequency and spatial cues improves performance, particularly in detection. The results verify that ShadowTracer, FSB and MWAM effectively improve the performance of the baseline model.
Variants | Cls.Acc | F1 | AUC |
Baseline | 78.11 | 56.77 | 88.73 |
+ST | 79.96 | 67.12 | 91.99 |
+ST+WMAM | 80.19 | 73.41 | 94.01 |
+ST+WMAM+FSB | 83.89 | 77.31 | 95.42 |
Variants | GIM-GLIDE | GIM-DDNM | ||||
Cls.Acc | F1 | AUC | Cls.Acc | F1 | AUC | |
GIMFormer w ST(SD) | 83.1 | 78.1 | 95.4 | 77.4 | 58.8 | 89.4 |
GIMFormer w/o ST | 80.0 | 68.9 | 91.3 | 73.1 | 51.3 | 86.1 |
Generalization of ShadowTracer. We initiate the training process of ShadowTracer using data generated by Stable Diffusion and subsequently hold its weights fixed. Following this pretraining phase, we proceed to train the backbone on both the GIM-GLIDE and GIM-DDNM trainsets, with and without the incorporation of ShadowTracer. The detection result Cls.Acc and location results in F1 and AUC are presented in Table 7, revealing that leveraging the pretrained ShadowTracer significantly enhances performance in cross-generator IMDL tasks. Notably, in the GIM-GLIDE dataset with the same manipulation type, leveraging ShadowTracer leads to a remarkable 9% enhancement in F1 and a 4% increase in AUC for location results. Additionally, the benefits extend to the detection aspect, manifesting as a 3% increase in accuracy. In the GIM-DDNM dataset with various manipulation types, there is a 7% enhancement in F1, a 3% increase in AUC for location results and a 4 % increase in accuracy.
5 Conclusion
We address the challenge of detecting and locating generative-based manipulation and provide a reliable database (GIM) for AIGC security. The proposed dataset leverages multiple mainstream generators and tampering methods to provide a variety of generative manipulation data. Additionally, We introduce GIMFormer, a novel transformer-based IMDL framework. ShadowTracer is designed to catch subtle artifacts in generative tampering. While the Frequency-Spatial Block gathers frequency and spatial information, the Multi Windowed Anomalous Modelling module captures pixel-level inconsistencies at multiple scales for fine-grained features. Extensive experiments demonstrate the superior performance of our model, which achieves SoTA results.
Limations. The GIM dataset shares classes with ImageNet and VOC, but it may not encompass future emerging objects due to the evolving variety in the real world. Fine-tuning pre-trained models on new object data can address this issue. Current research primarily centers on image manipulation. The emergence of realistic video generation models like SORA [5] presents fresh challenges as AI manipulation extends into videos. Future plans involve expanding research to address the complexities of video manipulation.
References
- [1] Amerini, I., Ballan, L., Caldelli, R., Del Bimbo, A., Serra, G.: A sift-based forensic method for copy–move attack detection and transformation recovery. IEEE transactions on information forensics and security 6(3), 1099–1110 (2011)
- [2] Asnani, V., Yin, X., Hassner, T., Liu, X.: Malp: Manipulation localization using a proactive scheme. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12343–12352 (2023)
- [3] Bird, J.J., Lotfi, A.: Cifake: Image classification and explainable identification of ai-generated synthetic images. IEEE Access (2024)
- [4] Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)
- [5] Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., **g, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video generation models as world simulators (2024), https://openai.com/research/video-generation-models-as-world-simulators
- [6] Chai, L., Bau, D., Lim, S.N., Isola, P.: What makes fake images detectable? understanding properties that generalize. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16. pp. 103–120. Springer (2020)
- [7] Cozzolino, D., Poggi, G., Verdoliva, L.: Splicebuster: A new blind image splicing detector. In: 2015 IEEE International Workshop on Information Forensics and Security (WIFS). pp. 1–6. IEEE (2015)
- [8] Crawford, K., Paglen, T.: Excavating ai: The politics of images in machine learning training sets. Ai & Society 36(4), 1105–1116 (2021)
- [9] Dang, H., Liu, F., Stehouwer, J., Liu, X., Jain, A.K.: On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition. pp. 5781–5790 (2020)
- [10] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. IEEE (2009)
- [11] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794 (2021)
- [12] Dong, C., Chen, X., Hu, R., Cao, J., Li, X.: Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(3), 3539–3553 (2022)
- [13] Dong, J., Wang, W., Tan, T.: Casia image tampering detection evaluation database. In: 2013 IEEE China summit and international conference on signal and information processing. pp. 422–426. IEEE (2013)
- [14] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88, 303–338 (2010)
- [15] Gao, S., Lin, Z., Xie, X., Zhou, P., Cheng, M.M., Yan, S.: Editanything: Empowering unparalleled flexibility in image editing and generation. In: Proceedings of the 31st ACM International Conference on Multimedia, Demo track (2023)
- [16] Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., Guo, B.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10696–10706 (2022)
- [17] Guan, H., Kozak, M., Robertson, E., Lee, Y., Yates, A.N., Delgado, A., Zhou, D., Kheyrkhah, T., Smith, J., Fiscus, J.: Mfc datasets: Large-scale benchmark datasets for media forensic challenge evaluation. In: 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). pp. 63–72. IEEE (2019)
- [18] Guarnera, L., Giudice, O., Guarnera, F., Ortis, A., Puglisi, G., Paratore, A., Bui, L.M., Fontani, M., Coccomini, D.A., Caldelli, R., et al.: The face deepfake detection challenge. Journal of Imaging 8(10), 263 (2022)
- [19] Guillaro, F., Cozzolino, D., Sud, A., Dufour, N., Verdoliva, L.: Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20606–20615 (2023)
- [20] Guo, X., Liu, X., Ren, Z., Grosz, S., Masi, I., Liu, X.: Hierarchical fine-grained image forgery detection and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3155–3165 (2023)
- [21] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
- [22] Heller, S., Rossetto, L., Schuldt, H.: The ps-battles dataset-an image collection for image manipulation detection. arXiv preprint arXiv:1804.04866 (2018)
- [23] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
- [24] Hsu, Y.F., Chang, S.F.: Detecting image splicing using geometry invariants and camera characteristics consistency. In: 2006 IEEE International Conference on Multimedia and Expo. pp. 549–552. IEEE (2006)
- [25] Hu, X., Zhang, Z., Jiang, Z., Chaudhuri, S., Yang, Z., Nevatia, R.: Span: Spatial pyramid attention network for image manipulation localization. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. pp. 312–328. Springer (2020)
- [26] Huang, Y., Juefei-Xu, F., Guo, Q., Liu, Y., Pu, G.: Fakelocator: Robust localization of gan-based face manipulations. IEEE Transactions on Information Forensics and Security 17, 2657–2672 (2022)
- [27] Huh, M., Liu, A., Owens, A., Efros, A.A.: Fighting fake news: Image splice detection via learned self-consistency. In: Proceedings of the European conference on computer vision (ECCV). pp. 101–117 (2018)
- [28] Ji, K., Chen, F., Guo, X., Xu, Y., Wang, J., Chen, J.: Uncertainty-guided learning for improving image manipulation detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22456–22465 (2023)
- [29] Jia, S., Huang, M., Zhou, Z., Ju, Y., Cai, J., Lyu, S.: Autosplice: A text-prompt manipulated image dataset for media forensics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 893–903 (2023)
- [30] Jiang, L., Li, R., Wu, W., Qian, C., Loy, C.C.: Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2889–2898 (2020)
- [31] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
- [32] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4015–4026 (October 2023)
- [33] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
- [34] Kniaz, V.V., Knyaz, V., Remondino, F.: The point where reality meets fantasy: Mixed adversarial generators for image splice detection. Advances in neural information processing systems 32 (2019)
- [35] Kong, C., Luo, A., Wang, S., Li, H., Rocha, A., Kot, A.C.: Pixel-inconsistency modeling for image manipulation localization. arXiv preprint arXiv:2310.00234 (2023)
- [36] Kwon, M.J., Nam, S.H., Yu, I.J., Lee, H.K., Kim, C.: Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision 130(8), 1875–1895 (2022)
- [37] Lee-Thorp, J., Ainslie, J., Eckstein, I., Ontanon, S.: Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824 (2021)
- [38] Li, D., Zhu, J., Wang, M., Liu, J., Fu, X., Zha, Z.J.: Edge-aware regional message passing controller for image forgery localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8222–8232 (2023)
- [39] Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., Guo, B.: Face x-ray for more general face forgery detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5001–5010 (2020)
- [40] Liu, X., Liu, Y., Chen, J., Liu, X.: Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology 32(11), 7505–7517 (2022)
- [41] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
- [42] Mahfoudi, G., Ta**i, B., Retraint, F., Morain-Nicolier, F., Dugelay, J.L., Marc, P.: Defacto: Image and face manipulation dataset. In: 2019 27Th european signal processing conference (EUSIPCO). pp. 1–5. IEEE (2019)
- [43] Ng, T.T., Chang, S.F., Sun, Q.: A data set of authentic and spliced image blocks. Columbia University, ADVENT Technical Report 4 (2004)
- [44] Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
- [45] Novozamsky, A., Mahdian, B., Saic, S.: Imd2020: A large-scale annotated dataset tailored for detecting manipulated images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops. pp. 71–80 (2020)
- [46] O’Sullivan, D., Passantino, J.: ‘verified’twitter accounts share fake image of ‘explosion’near pentagon, causing confusion (2023)
- [47] Porter, T., Duff, T.: Compositing digital images. In: Proceedings of the 11th annual conference on Computer graphics and interactive techniques. pp. 253–259 (1984)
- [48] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International Conference on Machine Learning. pp. 8821–8831. PMLR (2021)
- [49] Rao, Y., Zhao, W., Zhu, Z., Lu, J., Zhou, J.: Global filter networks for image classification. Advances in neural information processing systems 34, 980–993 (2021)
- [50] Rao, Y., Ni, J.: A deep learning approach to detection of splicing and copy-move forgeries in images. In: 2016 IEEE international workshop on information forensics and security (WIFS). pp. 1–6. IEEE (2016)
- [51] Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., Zhang, L.: Grounded sam: Assembling open-world models for diverse visual tasks (2024)
- [52] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
- [53] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)
- [54] Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Faceforensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1–11 (2019)
- [55] Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence 22(8), 888–905 (2000)
- [56] Verdoliva, L., Cozzolino, D., Nagano, K.: 2022 ieee image and video processing cup synthetic image detection
- [57] Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al.: Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence 43(10), 3349–3364 (2020)
- [58] Wang, J., Wu, Z., Chen, J., Han, X., Shrivastava, A., Lim, S.N., Jiang, Y.G.: Objectformer for image manipulation detection and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2364–2373 (2022)
- [59] Wang, Y., Yu, J., Zhang, J.: Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490 (2022)
- [60] Wen, B., Zhu, Y., Subramanian, R., Ng, T.T., Shen, X., Winkler, S.: Coverage—a novel database for copy-move forgery detection. In: 2016 IEEE international conference on image processing (ICIP). pp. 161–165. IEEE (2016)
- [61] Wu, H., Zhou, J., Tian, J., Liu, J.: Robust image forgery detection over online social network shared images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13440–13449 (2022)
- [62] Wu, Y., AbdAlmageed, W., Natarajan, P.: Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9543–9552 (2019)
- [63] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34, 12077–12090 (2021)
- [64] Ying, Q., Zhou, H., Qian, Z., Li, S., Zhang, X.: Learning to immunize images for tamper localization and self-recovery. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)
- [65] Yu, T., Feng, R., Feng, R., Liu, J., **, X., Zeng, W., Chen, Z.: Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)
- [66] Zampoglou, M., Papadopoulos, S., Kompatsiaris, Y.: Detecting image splicing in the wild (web). In: 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). pp. 1–6. IEEE (2015)
- [67] Zhang, D., Huang, F., Liu, S., Wang, X., **, Z.: Swinfir: Revisiting the swinir with fast fourier convolution and improved training for image super-resolution. arXiv preprint arXiv:2208.11247 (2022)
- [68] Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., Stiefelhagen, R.: Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on Intelligent Transportation Systems (2023)
- [69] Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing 26(7), 3142–3155 (2017)
- [70] Zhou, P., Han, X., Morariu, V.I., Davis, L.S.: Learning rich features for image manipulation detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1053–1061 (2018)
- [71] Zhu, M., Chen, H., Yan, Q., Huang, X., Lin, G., Li, W., Tu, Z., Hu, H., Hu, J., Wang, Y.: Genimage: A million-scale benchmark for detecting ai-generated image. arXiv preprint arXiv:2306.08571 (2023)
Supplementary Document
In this supplementary document, we include many details of our work: (1) The details of the proposed GIM Datset and compare with related work (Sec. 6) (2) The details of Architecture and implementation (Sec. 7). (3) The additional ablation analysis of GIMFormer(Sec. 8) (4) More Visualizations on GIM dataset (Sec. 9). (5)The societal impact of our work (Sec. 10). (6) The ethics statement of our work (Sec.11).
6 GIM Dataset
6.1 GIM Dataset Configuration
The specific quantities and details of the dataset are presented in Table 9, GIM has a total number of 1,140k generative manipulated images with their corresponding origin images. Moreover, the dataset can be split into subsets. GIM-SD-all utilizes the training set from Imagenet-1k [10] and employs the Stable Diffusion [53] generator for image manipulation dedicated to manipulation research. Through the analysis of the data scale, GIM-SD, GIM-GLIDE and GIM-DDNM individually select a random set of 100 classes from the Imagenet-1k training dataset and the entire test data to generate their respective training and test sets to benchmark the image manipulation detection and location (IMDL)methods. GIM-SD (VOC) is a cross-data-distribution test set constructed using the test set from the PASCAL VOC [14] dataset and Stable Diffusion techniques.
Post-processing | Parameter |
JPEG compression | 75, 80, 90 |
Gaussian blur | 3, 5 |
Downsampling | 0.5, 0.67 |
GIM |
|
|
|
||||||
GIM-SD-all | reprint | 1890K | - | ||||||
GIM-SD | reprint | 180K | 70K | ||||||
GIM-GLIDE | reprint | 140K | 70K | ||||||
GIM-DDNM | removal | 125K | 50K | ||||||
GIM-SD(VOC) | reprint | - | 4.5K |
To simulate real-world scenarios and comprehensively evaluate the IMDL methods, GIM incorporates three degradation methods (JPEG compression, Gaussian blur, and downsampling). Table 9 presents the parameters for each degradation group, which are randomly chosen.
The GIM data format consists of JPG images paired with PNG labels. Filenames ending with "_f" indicate tampered images. The masks segmented by SAM [33] serve as ground truth for manipulation detection, containing two labels denoted by 0 and 255, representing the original and manipulated categories, respectively.
6.2 Dataset Comparison.
GIM focuses on generative local manipulation in natural images and provides manipulated region labels, specifically for image manipulation detection and localization tasks. This distinguishes GIM from the existing datasets focused on faces or limited to detection (classification). As shown in Table 10, we compare several local image manipulation datasets. The existing image forensic datasets are primarily constructed using traditional manipulation techniques. These datasets commonly employ manual or random manipulation to generate data. AutoSplice [29], HiFi-IFDL[20], CoCoGlide [19] dataset utilizes generative algorithms for image manipulation. However, they rely on one type of manipulation and suffer from a small data volume.
In contrast, GIM achieves image manipulation by employing generative methods for localized image alteration. It encompasses a variety of generators and manipulation methods. Simultaneously, it stands as a large-scale comprehensive dataset for the IMDL community.
Dataset | Num. Images | Handcraft Manipulations | AIGC Manipulations | ||||
# Authetic Image | # Forged Image | Splicing | Copy-move | Removal | Removal | Generation | |
Columbia Gray [43] | 933 | 912 | ✓ | ✗ | ✓ | ✗ | ✗ |
Columbia Color [24] | 183 | 180 | ✓ | ✗ | ✓ | ✗ | ✗ |
MICC-F2000 [1] | 1,300 | 700 | ✓ | ✗ | ✓ | ✗ | ✗ |
VIPP Synth. [1] | 4,800 | 4,800 | ✓ | ✗ | ✓ | ✗ | ✗ |
CASIA V1.0 [13] | 800 | 921 | ✓ | ✓ | ✓ | ✗ | ✗ |
CASIA V2.0 [13] | 7,200 | 5,123 | ✓ | ✓ | ✓ | ✗ | ✗ |
Wild Web [66] | 90 | 9,657 | ✓ | ✓ | ✓ | ✗ | ✗ |
NC2016 [17] | 560 | 564 | ✓ | ✓ | ✓ | ✗ | ✗ |
NC2017 [17] | 2,667 | 1,410 | ✓ | ✓ | ✓ | ✗ | ✗ |
MFC2018 [17] | 14,156 | 3,265 | ✓ | ✓ | ✓ | ✗ | ✗ |
MFC2019 [17] | 10,279 | 5,750 | ✓ | ✓ | ✓ | ✗ | ✗ |
PS-Battles [22] | 11,142 | 102,028 | ✓ | ✓ | ✓ | ✗ | ✗ |
DEFACTO [42] | 0 | 229,000 | ✓ | ✓ | ✓ | ✗ | ✗ |
IMD2020 [45] | 35,000 | 35,000 | ✓ | ✓ | ✓ | ✗ | ✗ |
SP COCO [36] | 0 | 200,000 | ✓ | ✗ | ✗ | ✗ | ✗ |
CM COCO [36] | 0 | 200,000 | ✓ | ✗ | ✗ | ✗ | ✗ |
CM RAISE [36] | 0 | 200,000 | ✓ | ✓ | ✗ | ✗ | ✗ |
HiFi-IFDL[20] | - | 1,000,00* | ✗ | ✓ | ✗ | ✓ | ✗ |
CoCoGlide [19] | 0 | 512 | ✗ | ✗ | ✗ | ✓ | ✗ |
AutoSplice [29] | 3,621 | 2,273 | ✗ | ✗ | ✗ | ✗ | ✓ |
GIM | 1,140,000 | 1,140,000 | ✗ | ✗ | ✗ | ✓ | ✓ |
7 Implementation Details
7.1 Architecture
The GIMFormer backbone has an encoder-decoder architecture. The encoder is a dual-branch and four-stage encoder. For the input RGB image, ShadowTracer extracts its learned manipulation trace map of the same resolution as the image. The ShadowTracer adopts the architecture [69] with 15 trainable layers, 3 input channels, 1 output channel. Then, both the RGB and the trace map are fed into the parallel network, where the four-stage structure is employed to extract pyramidal features (). At each stage the RGB branch is first processed by the Frequency-Spatial Block (FSB), then two branches are gradually processed by Transformer Blocks [63] and Multi Windowed Anomalous Modelling module(MWAM). The learnable parameters within FSB (Feature Synthesis Block)maintain an identical resolution as the input features. The window sizes within the Multi-Window Anomalous Modeling (MWAM)at each stage are as follows: {3, 7, 9}, {7, 11, 15}, {9, 17, 25}, and {11, 21, 31}. The Transformer blocks are based on the Mix Transformer encoder B2 (MiT-B2)proposed for semantic segmentation and are pretrained on ImageNet. The Mix Transformer encoder uses self-attention and channel-wise operations, prioritizing spatial convolutions over positional encodings. To integrate information from the two branches, the cross-modal Feature Rectification Module (FRM) and Feature Fusion Module (FFM) [68] are utilized. For location, we employ the All-MLP decoder proposed in [63], which is a lightweight architecture formed by only 11 convolution layers and bilinear up-samplers. For detection, we adopt the tail-end section of the light-weight backbone in [57]. It applies convolutional, batch normalization, and activation layers to extract features from input data. These extracted features are then pooled and passed through fully connected layers to generate the classification predictions.
7.2 Training Process
We conduct our experiments with PyTorch 1.7.0. All models are trained on a node with 8 or 4 V100 GPUs.
For the training of the ShadowTracer network , We randomly selected 20,000 authentic images from ImageNet and correspondingly generated manipulated images. During comprehensive performance verification, all three generators (Stable Diffusion [52], GLIDE [44], DDNM [59])are used, whereas for generalization verification, only Stable Diffusion is employed. Throughout the training, the alpha blending parameter ranges between 0.5 and 1.0, randomly producing blended images subjected to three degradation types (JPEG compression, downsampling, Gaussian blur). The network is trained on 4040 pixel patches randomly sampled from the dataset. Training is conducted for roughly 300,000 iterations with a batch size of 64. An Adam [31] optimizer is employed, initialized with a learning rate of 0.001.
For the training of the GIMFormer backbone network. The input image is cropped to 512 512 during training. We train our models for 40 epochs on the GIM dataset. The batch size is 4 on each of the GPUs. The images are augmented by random resize with a ratio of 0.5–2.0, random horizontal flip**.
8 Additional Ablation Analysis
The robustness of GIMFormer. To evaluate the robustness of GIMFormer, we sample 5000 images from the original three generator test sets without degradations. The model is trained on the mixed-generator dataset. The model is trained on the mixed-generator dataset. The experimental results are presented in Table 11, GIMFormer demonstrates robustness against various distortion techniques. This resilience underscores its ability to maintain stable performance in the face of various challenges posed by image distortions.
|
|
|
|
|
|
|
|
|||||||||||||||||
GIMFormer | 61.7 | 60.3 | 60.1 | 59.9 | 58.7 | 58.7 | 61.0 | 60.8 |
The effect of the number of windows (N). We conduct a set of ablation experiments to study the performance of the MWAM Module. To ensure fair comparisons, all experiments differ from each other only in the Windows setting. Experiments are carried out on the GIM-SD trainset and testset. As shown in Table 12, there is an overall incremental trend in the tampering location performance as the number of windows increases, while the impact of size variations remains relatively minor. For the sake of efficiency, we stop analyzing more windows. The results indicate that a favorable balance between accuracy and efficiency is achieved when =3.
MWAM Variants | Cls.Acc | F1 | AUC |
w\o windows | 72.13 | 58.12 | 88.81 |
{} | 74.33 | 58.34 | 89.65 |
{} | 76.61 | 60.03 | 89.92 |
{} | 78.96 | 61.75 | 90.61 |
{} | 77.33 | 61.17 | 90.11 |
{} | 77.95 | 60.77 | 90.03 |
{} | 77.83 | 59.13 | 89.53 |
{} | 78.11 | 61.15 | 90.71 |
9 More Visualizations on GIM
10 Societal Impact
Our research yields a positive societal impact on the community by focusing on addressing the challenge of detecting and locating generative-based manipulations. We introduce a reliable database, GIM, aimed at enhancing the security of AI-generated content (AIGC). This database facilitates the training of Image Manipulation Detection and Localization (IMDL)methods on GIM, allowing them to generalize across various scenarios. Consequently, our algorithm GIMFormer fosters increased trust among the general public in our society regarding media content. GIM and GIMFormer are beneficial for digital media forensics, especially generative manipulation with real-world degradations.
11 Ethics Statement
Our GIM is based on ImageNet and VOC. No additional personally identifiable information or sensitive personally identifiable information is introduced during the production of fake images in the GIM dataset. During the dataset production, we do not introduce extra information containing exacerbated bias against people of a certain gender, race, sexuality, or who have other protected characteristics. The ethical issues in the ImageNet and VOC datasets have been discussed in previous works. Crawford et al. [8] discuss issues with ImageNet. The first issue is the political nature of all taxonomies or classification systems, where terms like "male" and "female" are considered "natural," while "hermaphrodite" is offensively placed within the branch of Person > Sensualist > Bisexual alongside "pseudohermaphrodite" and "switch hitter" categories. The second issue concerns offensive images of real people, while the third is the use of people’s photos without their consent by ImageNet creators.
![Refer to caption](x7.png)
![Refer to caption](x8.png)
![Refer to caption](x9.png)