Exploring Saliency Bias in Manipulation Detection

Abstract

The social media-fuelled explosion of fake news and misinformation supported by tampered images has led to growth in the development of models and datasets for image manipulation detection. However, existing detection methods mostly treat media objects in isolation, without considering the impact of specific manipulations on viewer perception. Forensic datasets are usually analyzed based on the manipulation operations and corresponding pixel-based masks, but not on the semantics of the manipulation, i.e., type of scene, objects, and viewers’ attention to scene content. The semantics of the manipulation play an important role in spreading misinformation through manipulated images. In an attempt to encourage further development of semantic-aware forensic approaches to understand visual misinformation, we propose a framework to analyze the trends of visual and semantic saliency in popular image manipulation datasets and their impact on detection.

Index Terms— Media Forensics, Image Manipulation, Dataset Analysis, Image Saliency, Semantic Understanding

1 Introduction

The increase in quality, quantity and diversity of manipulated media has led to an increased reliance on automated visual forensics, as human analysis has limited applicability. However, not all manipulated images are equally misinforming. For efficient detection at a large scale, forensic techniques should focus more on images that have the potential for more impact on viewer perception. In this work, we analyze image forensics datasets based on perceptual saliency, i.e., the amount of attention paid by human viewers to manipulated content within an image. Such an analysis can be used to reduce the scale and types of images that require urgent attention and dedicated detection resources. We hypothesize that image manipulations that catch human viewers’ attention are more valuable to detect and analyze from the perspective of misinformation. Gauging perceptual understanding in an image-centric manner that goes beyond just labeling images as real or fake is essential for fighting fake news and its effects. In this paper, we provide the following:

•

Determine the impact of saliency on the human ability to detect and localize image manipulations. Analyze the range of visually and semantically salient splicing-based manipulations present in widely used benchmark datasets for image manipulation detection.
•

Experiments showing that synthetic manipulation of saliency of image contents also lead to trends in detection performance, providing further evidence that more salient manipulated regions are easier to detect.
•

Evaluate bias in the performance of manipulation detection networks based on the visual saliency of the manipulated region and the semantic change offered by the manipulation. We propose a novel method of calculating the semantic relevance of the manipulation using CLIP [1], a vision-language foundation model.

Refer to caption — Fig. 1: Saliency of the manipulation is an important factor in determining if a human or machine will consider an image to be manipulated.

2 Related Work

Human Perception and Image Manipulation Detection. While humans can efficiently associate semantic understanding with coherent visual scene elements [2, 3] early works in analyzing human manipulation detection ability, such as [4], proved that humans consistently fail to detect physical impossibilities in an image such as inconsistent shadows, perspectives, and reflections. Recently, Nightingale et al. [5] and Schetinger et al. [6] showed that humans are unable to distinguish real images from manipulated ones. In the respective studies, participants’ could correctly classify 66% [5] and 58% [6] of the presented images as real vs. fake. The former also showed that viewers couldn’t locate the manipulation in the image 55% of the time.

However, existing literature has mostly focused on evaluating manipulation detection and localization ability of humans without exploring deeper into its explanations. They do not investigate how the relevance of the manipulation or viewer attention to specific regions impacts the ability to detect it.

Saliency is a property that characterizes where people choose to focus their mental processing power. Visual saliency plays a key role in biasing human attention [7] and processing visual stimuli [8]. Perception scientists have studied how salient aspects of an image impact human performance on a range of perceptual tasks, such as user engagement [9], category learning [10], and distinguishing between real and fake image [5, 11].

Rensink et al.’s seminal work, “To See or Not To See” [12], studied the human ability to detect changes in scenes that belonged to one of two saliency-related categories, central interest and marginal interest. In the study, participants would swap between two images, one original and one manipulated, until they found the difference between the two. The authors concluded that visual perception of change of an object occurs only when that object is given focused attention. In this paper, we perform a human study that investigates how the saliency of manipulations in real-world images affects the users’ ability to detect manipulations. Unlike the study in [12], to emulate a more realistic online scenario, we show manipulated images without an original reference images and ask the participants to annotate the regions they pay attention to and those they think were manipulated.

Learning-based Manipulation Detection Models. With the advent of deep learning, data-driven solutions for manipulation detection were developed. In recent years, Convolutional Neural Networks (CNNs) have been applied to the manipulation detection task [13, 14, 15, 16]. These networks learn patterns that distinguish pristine and manipulated images from human-labeled data. Most notably, the Local Anomaly Detection Network (LADN) architecture proposed by ManTra-Net [13] was designed to mimic the human decision-making process. Similarly, the authors of PSCC-Net [14] got inspired by how people solve tasks going from a coarse to fine image analysis to detect manipulations. Thus, certain biases originating from data curation can influence learning-based models trained on this data. Other models have explored attention networks for context-aware manipulation detection [17] and fine-grained hierarchical manipulation classification [18]. This work investigates biases observed in image manipulation detection models and compares them with those observed in humans.

3 Human Saliency

To understand how the saliency of the manipulations is related to peoples’ ability to accurately spot them, i.e., if people are better at detecting manipulations when they are within the salient region of the image, we set up a user study. Identifying saliency bias within human manipulation detection can help develop tools to help humans spot less salient manipulations.

In the study, participants looked at 130 spliced images from the Korus’ Realistic Tampering (RT) dataset [19, 20] containing well-crafted manipulations proven to be challenging for manipulation detection and localization networks [21]. For each image, the participants completed 2 tasks. The first task was to place bounding boxes on objects or regions that looked the most attention-grabbing. This question identified the salient regions of the image. For the second task, the participants placed bounding boxes on objects or regions they believed to be manipulated, if any. The order of these questions is important. We wanted to first understand what a human focuses on (saliency), and then ask specifically if they find something manipulated in the image. Since saliency is a more general visual concept, the saliency question being presented first does not shift the participants’ opinion of manipulation because saliency can be assessed from any image regardless of whether it is manipulated. Additionally, we chose to ask these questions to the same participants instead of conducting two studies for collecting the two types of responses. There can be variations in the human detection ability and visual reasoning. Recording both in the same session maintains consistency between saliency and manipulation data.

The study was created by the authors ¹¹1http://pnz.aca.mybluehost.me and participants were recruited using Prolific ²²2https://www.prolific.com, a crowdsourcing platform which ensures highly qualified and vetted participants [22]. Each image was reviewed 5 times and 650 responses were recorded in total from 65 individuals. The participants were compensated $2.00 per survey consisting of 10 images each. The bounding boxes recorded from all the participants’ responses were combined to create a (i) human saliency map 1(c) and the corresponding (ii) manipulation prediction mask 1(d). The final masks for each image contain a higher weight or confidence for pixel locations where boxes provided by multiple participants overlapped. Both the human saliency and manipulation prediction masks are then compared to its respective ground-truth pixel-wise manipulation mask 1(b). This comparison reveals how salient a manipulation is to the participants, and how well they could localize the manipulation (used to also determine detection performance, as correct localization implies correct detection implicitly). To obtain the saliency score of each image manipulation, the saliency mask was compared with the ground truth using pixel-wise Mean Recall, which considers both accuracy and group agreement.

The images used in the study were then divided into five groups depending on the saliency score of their manipulations. We obtained a fairly even distribution of images in the five groups, implying that the RT dataset contains splicing manipulations evenly spread across different levels of saliency. To understand if the human performance worsens when the less salient image regions were manipulated, we computed the detection performance of human participants for the images in each group. For each image within a group, the manipulation prediction mask was compared to the ground truth manipulation mask and the Area under the Receiver Operating Characteristics curve (AuROC) for pixel-wise manipulation classification was used. The average AuROC for each level of saliency is reported in Fig. 3 and shows a correlation between the saliency of the manipulations and how well people can localize them. The detection performance increases consistently through groups, from around $0.62$ in the first group to $0.89$ in the final group, showcasing that the saliency of the manipulations can bias people’s ability to properly localize manipulations.

4 Machine Experiments

Results from the human study lead us to formulate two questions: first, do manipulation detection networks reflect similar biases as seen in humans, and are the trends similar in other image manipulation datasets? And second, will increasing the saliency of these manipulations correlate to improved human detection performance? Answering the latter also further verifies if saliency was a contributor to the performance trends over other variables such as content in the image, quality, etc. This section describes the experiments conducted to answer the two research questions.

Datasets Used. For enabling comparison between human performance and that of detection algorithms, the first adopted dataset refers to the same subset of 130 images from the Korus’ RT dataset as used in the human study (Sec. 3).

We use two other forgery datasets, which enable larger-scale automated analyses. The 2018 Media Forensics Challenge dataset (MFC18) [23] was released by the American National Institute of Standards and Technology (NIST), and contains several forgery instances (more examples than the RT dataset). It contains multiple manipulations on the same image, which is a typical technique utilized by forgers to improve realism and hide detectability. Due to input resolution constraints, we select 1127 images from MFC18 with sum of dimensions up to 4K pixels and at least one manipulation that can change the semantic meaning of the image (e.g., copy-move, splicing, and inpainting). The second dataset, IMD2020 [24], contains 2007 images sourced from the r/photoshopbattles subreddit. This crowdsourced database provides the benefit of massive variance in image acquisition conditions and content. However, as the main goal of the community is to create obvious, funny, and satirical manipulations, the forgeries may not be well disguised.

Experiment 1. Dataset Visual Saliency Bias Estimation. The process of collecting human annotations cannot be extrapolated for every dataset, as it is time-consuming and expensive. In order to get robust saliency predictions for the larger datasets, we utilize two object-based saliency prediction networks, U ${}^{2}$ -Net [25] and R ${}^{3}$ Net [26]. Both networks generate a saliency prediction within the range $[0,1]$ for each pixel in a given image. Averaging the two saliency maps helps build robustness and accommodate scenarios where a single network fails. Since saliency is subjective, it is important to consider multiple opinions when deciding the saliency, as with human responses. The combined saliency map from U ${}^{2}$ -Net and R ${}^{3}$ Net is compared to the ground truth manipulated mask using Mean Recall. Based on the mean recall score, images are placed into one of five saliency groups.

The distribution of saliency of manipulations for the RT [20], MFC18 [23] and IMD2020 [24] datasets is shown in Fig. 4. Compared to the saliency distribution generated by the human study, the RT dataset is less uniformly distributed. However, it is still more uniformly distributed than MFC18 and IMD2020. After splitting images into the proper saliency groups, we calculate the average AuROC (detection performance) for each group across popular manipulation detection and localization models: PSCC-Net [14], OSN [16], BusterNet [15], and ManTra-net [13]. AuROC is commonly used to evaluate manipulation detection models and allows for comparison of model performance across datasets. Evaluating the detection performance for images in each saliency group individually can help determine how much visual saliency has an effect on manipulation detection.

Experiment 2. Impact of Varying Saliency on Manipulation Detection. As observed so far, saliency of the manipulated region plays a role in a model’s ability to detect the manipulation. If saliency bias exists within manipulation detection networks, the average detection performance for each saliency group should increase as saliency of manipulations in images of the group increases. Essentially, when a manipulation becomes more salient, models should detect the manipulation more accurately. Saliency of difficult-to-detect manipulated regions in a given image can be increased using a saliency-guided image manipulation network [27, 28, 29]. Such networks attempt to modify color, contrast, and saturation, while avoiding changes that can alter semantic interpretation of the image. Evaluating average detection performance of saliency enhanced manipulations across multiple networks can provide additional evidence that saliency is a biasing factor for manipulation detection.

To understand if the detectability of manipulated images improves upon artificially guiding attention towards relevant areas, we employ the GAN-based method proposed in [28]. Upon providing two images, an RGB image and a binary attention mask, the Saliency-Guidance Image Manipulation (SaGIM) network [28] attempts to modify the RGB image such that the regions highlighted in the attention mask are more salient. Average detection performance for each manipulation detection network is calculated for the saliency enhanced variants of the images in each manipulation saliency group. The performance before and after saliency-guided image manipulation are presented in Fig. 7.

Additionally, another human study was conducted with images post saliency enhancement, following the same protocol as described in Sec. 3, to evaluate if saliency adjustment helps people detect manipulations. 130 images received responses from 5 participants each and their overall detection performance is shown in Fig. 7.

Table 1: Average detection performance per saliency group for various manipulation detection networks. The result shows a decrease in performance variation between low and high salient manipulations after saliency enhancement.

			Original Saliency				Saliency Enhanced
Dataset	Partition	count	BusterNet	PSCC-Net	ManTra-Net	OSN	BusterNet	PSCC-Net	ManTra-Net	OSN
IMD2020	$<$ .2	179	0.53	0.57	0.65	0.71	0.55	0.72	0.90	0.79
	.2 $-$ .4	118	0.59	0.59	0.71	0.82	0.61	0.74	0.90	0.85
	.4 $-$ .6	270	0.65	0.59	0.72	0.86	0.67	0.73	0.91	0.89
	.6 $-$ .8	431	0.68	0.56	0.75	0.86	0.70	0.71	0.90	0.89
	$>$ .8	1008	0.72	0.59	0.79	0.90	0.73	0.72	0.93	0.91

5 Results

Is there performance bias for manipulation detection algorithms based on the visual saliency of the manipulated region? Based on Experiment 1, discussed in Section 6, a similar trend of results across multiple networks provide tangible evidence of saliency bias. The results from the evaluation show a clear increase in performance as the manipulations get more salient across all evaluated datasets and using all models except BusterNet (see Fig. 5). The failure can be attributed to its inability to handle larger images This clear performance gap between saliency groups similar to the one seen in human performance (Fig. 3) indicates that saliency is a clear factor in detectability of manipulations. The larger number of low-salient manipulations combined with detection performance bias may explain why RT is considered such a difficult dataset for many manipulation detection networks [21, 30].

Does varying the saliency of the manipulated region change the detection performance for machines and humans? Experiment 2 was aimed to test if there is an increase in average detection performance as we increased the saliency of the manipulated regions in images, resulting in a shrinking of the performance gap between low salient images and high salient images. This direct relation between manipulation saliency and detection performance (Table 1 and Fig. 7) reinforces evidence for our hypothesis that the more salient a manipulation, the more accurate its detection is by both humans and models.

A possible source of improvement in algorithmic detection performance is networks detecting the manipulation performed by the SaGIM network. However, if the networks were detecting the global changes made by SaGIM, the detection performance scores would be low by virtue of increased false positive rate, i.e., where the models predicted a pixel as manipulated when it was not indicated as manipulated in the original ground truth mask. Additionally, BusterNet which previously failed to score well (Fig. 5) improves in detection performance due to resizing (downsampling as per input dimension requirements of SaGIM) improving its ability. However, saliency enhancement led to minor improvements in its performance. Conversely, the resizing caused PSCC-Net to perform worse, also reported in the original paper [14] and while saliency enhancement has a positive impact, it is not enough to surpass its performance on the unprocessed images. Fig.6 illustrates and summarizes the impact of using saliency re-attention over a manipulated image. Finally, a marginal decline in detection performance is observed for highly salient images following saliency enhancement. This phenomenon arises from the inherent challenge of augmenting the saliency of regions already characterized by high salience. Enhancement attempts may inadvertently introduce manipulations that diminish the saliency of the region, consequently reducing the overall performance.

6 Semantic Relevance of Visual Saliency

The semantics within an image that a human focuses on can also represent saliency and can directly impact the perceived message. If an image region is not visually salient, it may not contribute to the interpretation of the image by a human viewer and can lead to misinformation. The semantic description of a scene can be obtained by either asking human participants (expensive and difficult to standardize) or using a multimodal foundation model, that embeds both images and semantic descriptions, originally provided by humans, into the same latent space.

To figure out the effect of saliency of a manipulation on the semantic interpretation of images, we use a pre-trained Contrastive Language-Image Pretraining (CLIP) [1] model to compare the important semantic elements within a scene before and after spliced edits or manipulation. CLIP has been used for evaluation of various high-level tasks such as image captioning [31] and image reconstruction [32]. In a similar vein, we employ it to evaluate the semantic change caused by manipulation. Given an image and a text corpus, in our case, a dictionary of nouns ³³3https://en.wiktionary.org, the model tries to relate the semantic content in images with text and returns a list of words relevant to the scene and their probabilities. By applying CLIP to both pristine image and their manipulated variant, we can investigate the correlation of the visual manipulation and its saliency to what machine algorithms find relevant in a scene. The exact metrics compare the words predicted with the highest relevance, i.e., probability, for both pristine and manipulated images and are explained with an example in Fig 8.

Semantic change is analyzed using the aggregated change in the predicted tag lists and probabilities yielded by the pristine and tampered versions of an image, for 5 trials (since the model is stochastic in its predictions). Specifically, we use top1 overlap, top5 overlap, top5 IoU, and top5 probability change as metrics (reported in Fig. 9 for RT and IMD2020 datasets for manipulations with varying saliency). If the pristine and tampered tag list has a top1 overlap score of 0, the primary semantic meaning was changed by the manipulation. Similarly, if the top5 IoU and top5 overlap are low, it indicates the manipulated region greatly changes the semantic meaning. Top5 probability change calculates the sum of change in probability of the top 5 tags for each image.

Do visual saliency biases also relate to semantic difference interpreted by general purpose vision models? Fig. 9 shows that on average, the higher the saliency of manipulation, the lower the overlap metric scores. The top1 overlap is initially $0.93$ , but decreases to $0.50$ by the final saliency group. The decrease in overlap metrics implies that there is higher perceivable semantic change as the manipulated region gets more salient. Similarly, the top 5 probability change metric increases, starting from 9% probability change for the lowest salient set to 20% for the highest. The increase in probability change metric with increase in saliency of the manipulation shows that the higher the saliency of the manipulation, the greater the semantic change from the manipulation.

7 Conclusion

This paper formally identifies saliency of the manipulation as a factor in its detectability. Manipulations in two of the three investigated datasets are diverse with regard to saliency and IMD2020 has generally high salient manipulations. Our results conclude that the saliency of the manipulation is an important factor in changing the semantic meaning of the image and the ability for both people and networks to localize it. Additionally, we show that increasing the saliency of the manipulated region with tools such as [28, 27, 29] results in an increased detection performance for both humans and detection networks, reinforcing the hypothesis that saliency affects detection performance of manipulated images.

References

[1] Alec Radford et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748–8763.
[2] Walter Scheirer et al., “Perceptual annotation: Measuring human vision to improve computer vision,” IEEE TPAMI, vol. 36, pp. 1679–1686, 2014.
[3] Rishi Rajalingham et al., “Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks,” Journal of Neuroscience, vol. 38, no. 33, pp. 7255–7269, 2018.
[4] Hany Farid and Mary Bravo, “Image forensic analyses that elude the human visual system,” in SPIE Media forensics and security II, 2010, vol. 7541, pp. 52–61.
[5] Sophie Nightingale, Kimberley Wade, and Derrick Watson, “Can people identify original and manipulated photos of real-world scenes?,” Springer Cognitive research: principles and implications, vol. 2, no. 1, pp. 1–21, 2017.
[6] Victor Schetinger et al., “Humans are easily fooled by digital images,” Elsevier Computers & Graphics, vol. 68, pp. 142–151, 2017.
[7] Laurent Itti, Christof Koch, and Ernst Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE TPAMI, vol. 20, pp. 1254–1259, 1998.
[8] Ali Borji, Dicky Sihite, and Laurent Itti, “What stands out in a scene? a study of human explicit saliency judgment,” Elsevier Vision Research, vol. 91, pp. 62–77, 2013.
[9] Lori McCay-Peet, Mounia Lalmas, and Vidhya Navalpakkam, “On saliency, affect and focused attention,” in ACM CHI, 2012, pp. 541–550.
[10] Rubi Hammer, “Impact of feature saliency on visual category learning,” Frontiers in Psychology, vol. 6, pp. 451, 2015.
[11] Matthew Groh et al., “Deepfake detection by human crowds, machines, and machine-informed crowds,” PNAS, vol. 119, no. 1, 2022.
[12] Ronald Rensink, Kevin O’Regan, and James Clark, “To see or not to see: The need for attention to perceive changes in scenes,” Psychological Science, vol. 8, no. 5, pp. 368–373, 1997.
[13] Yue Wu, Wael Abd-Almageed, and Prem Natarajan, “Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features,” in IEEE/CVF CVPR, 2019.
[14] Xiaohong Liu et al., “Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization,” IEEE TCSVT, vol. 32, pp. 7505–7517, 2022.
[15] Yue Wu, Wael Abd-Almageed, and Prem Natarajan, “Busternet: Detecting copy-move image forgery with source/target localization,” in ECCV, 2018, pp. 168–184.
[16] Haiwei Wu et al., “Robust image forgery detection against transmission over online social networks,” IEEE TIFS, vol. 17, no. 1, pp. 443–456, 2022.
[17] Ruyong Ren et al., “Multi-scale attention context-aware network for detection and localization of image splicing: Efficient and robust identification network,” Springer Applied Intelligence, pp. 1–20, 2023.
[18] Xiao Guo et al., “Hierarchical fine-grained image forgery detection and localization,” in IEEE/CVF CVPR, 2023, pp. 3155–3165.
[19] Paweł Korus and Jiwu Huang, “Multi-scale analysis strategies in prnu-based tampering localization,” IEEE TIFS, vol. 12, no. 4, pp. 809–824, 2016.
[20] Paweł Korus and Jiwu Huang, “Evaluation of random field models in multi-modal unsupervised tampering localization,” in IEEE WIFS, 2016, pp. 1–6.
[21] Owen Mayer and Matthew Stamm, “Exposing fake images with forensic similarity graphs,” IEEE JSTSP, vol. 14, no. 5, pp. 1049–1064, 2020.
[22] Benjamin Douglas, Patrick Ewell, and Markus Brauer, “Data quality in online human subjects research: Comparisons between mturk, prolific, cloudresearch, qualtrics, and sona,” Plos One, vol. 18, 2023.
[23] Haiying Guan et al., “Mfc datasets: Large-scale benchmark datasets for media forensic challenge evaluation,” in IEEE/CVF WACV, 2019, pp. 63–72.
[24] Adam Novozamsky, Babak Mahdian, and Stanislav Saic, “Imd2020: A large-scale annotated dataset tailored for detecting manipulated images,” in IEEE/CVF WACV, 2020, pp. 71–80.
[25] Xuebin Qin et al., “U2-net: Going deeper with nested u-structure for salient object detection,” Elsevier Pattern Recognition, vol. 106, pp. 107404, 2020.
[26] Zijun Deng et al., “R3net: Recurrent residual refinement network for saliency detection,” in AAAI IJCAI, 2018, pp. 684–690.
[27] Youssef Mejjati et al., “Look here! A parametric learning based approach to redirect visual attention,” in Springer ECCV, 2020, pp. 343–361.
[28] Yen-Chung Chen et al., “Guide your eyes: Learning image manipulation under saliency guidance.,” in BMVC, 2019, vol. 2, p. 3.
[29] Mahdi Miangoleh et al., “Realistic saliency guided image enhancement,” in IEEE/CVF CVPR, 2023, pp. 186–194.
[30] Susmit Agrawal et al., “Sisl: self-supervised image signature learning for splicing detection & localization,” in IEEE/CVF CVPR, 2022, pp. 22–32.
[31] Jack Hessel et al., “CLIPScore: A reference-free evaluation metric for image captioning,” in ACL EMNLP, 2021, pp. 7514–7528.
[32] Yu Takagi and Shinji Nishimoto, “High-resolution image reconstruction with latent diffusion models from human brain activity,” in IEEE/CVF CVPR, 2022.