Identifying Important Group of Pixels using Interactions
Abstract
To better understand the behavior of image classifiers, it is useful to visualize the contribution of individual pixels to the model prediction. In this study, we propose a method, MoXI (Model eXplanation by Interactions), that efficiently and accurately identifies a group of pixels with high prediction confidence. The proposed method employs game-theoretic concepts, Shapley values and interactions, taking into account the effects of individual pixels and the cooperative influence of pixels on model confidence. Theoretical analysis and experiments demonstrate that our method better identifies the pixels that are highly contributing to the model outputs than widely-used visualization by Grad-CAM, Attention rollout, and Shapley value. While prior studies have suffered from the exponential computational cost in the computation of Shapley value and interactions, we show that this can be reduced to quadratic cost for our task. The code is available at https://github.com/KosukeSumiyasu/MoXI.
1 Introduction
Visualization of important image pixels has been widely used to understand machine learning models in computer vision tasks such as image classification [31, 23, 1, 20, 18, 3]. To this end, visualization methods compute the contribution of each pixel to model decisions. For example, Grad-CAM [23] measures the contribution using a weighted sum of the feature maps of convolutional layers, where weights are determined by the gradient of confidence score for any target class with respect to the feature map entries. Attention rollout [1] measures it based on the attention weight of encoders of a Vision Transformer.
Several recent studies revealed that a game-theoretic concept, Shapley values [24], is a powerful indicator of pixel contribution [18, 16, 8]. In multi-player games, Shapley values measures the contribution of each player from the average change in the total game reward with his/her presence versus absence. When applied to an image classifier, the pixels of an image are the players, which work cooperatively for the model output (e.g., confidence score). Unlike Grad-CAM and Attention rollout, Shapley values compute the contribution of pixels to the model output more directly. The former methods use feature maps or attention weights, the magnitude of whose entries are not necessarily well-aligned with their contributions to the model output, whereas the latter uses logits or confidence scores. Indeed, Fig. 1 shows that the pixels with high Shapely values have a significantly larger impact on confidence scores than those determined by Grad-CAM or Attention rollout in both (a) insertion case and (b) deletion case.
A crucial caveat of the aforementioned methods is that they identify a group of important pixels by the individual contribution of each pixel and overlook the collective contribution of multiple pixels. For example, Fig. 1(a) shows that the three methods only highlight the class object (i.e., duck) and do not indicate the background (i.e., sea) as an informative factor. However, the set of pixels with the highest contributions (e.g., highest Shapley values) does not imply the most informative pixel set as a whole because the information overlap among pixels is not considered. Indeed, the bottom row of Fig. 1(a) shows that the class object and background greatly impact in synergy the confidence score.
In this paper, we propose an efficient game-theoretic visualization method of image pixels with a high impact on the prediction of an image classifier. Besides Shapley values, we exploit interactions, a game-theoretical concept that reflects the average effect of the cooperation of pixels. Namely, unlike prior methods, including Grad-CAM, Attention rollout, and Shapley values, the proposed method takes into account the cooperative contribution of pixels and identifies the image pixels as a whole. In Fig. 1(a), the proposed method identifies a pixel set on which the classifier puts high classification confidence. Similarly, in Fig. 1(b), it identifies a minimal pixel set without which the classification fails. Notably, we define self-context variants of Shapley values and interactions, and reduce the number of forward passes from exponential to quadratic times, which resolves the fundamental challenge of game-theoretic approaches to be handy tools for model explanation.
In the experiments, we consider the insertion curve and deletion curve on a subset of ImageNet images that are correctly classified by a pretrained classifier. Starting from fully masked images, an insertion curve plots the increase of classification accuracy as unmasking image patches from highly contributing ones determined by each method. Similarly, a deletion curve plots the accuracy decrease from the clean images to fully masked ones. The results show that the proposed method gives sharp insertion/deletion curves. For example, the classification accuracy reached with images with 4% unmasked patches if selected by the proposed method, significantly outperforming the results of Grad-CAM (accuracy of ), Attention rollout (accuracy of ), and Shapley values (accuracy of ). Similar results are observed for the deletion curves and also when we use common corruptions [15] instead of masking. Qualitatively, the heatmaps using the patches selected in the early stage of the insertion curve show that our method highlights both a class object and background, while the other methods mostly highlight the class object only. Meanwhile, in the heatmaps from the deletion curves, our method particularly highlights the class-discriminative region of the object, while the others do not.
Our contributions are summarized as follows:
-
•
We propose an efficient game-theoretic visualization method, named MoXI (Model eXplanation by Interactions), for a group of pixels that significantly influences the classification.
-
•
Our analysis supports a simple greedy strategy from a game-theoretic perspective, leading us to use self-context variants of Shapley values and interactions, which can be computed exponentially faster than computing the original ones.
-
•
Extensive experiments show that our method more accurately identifies the pixels that are highly contributing to the model outputs than standard visualization methods.
2 Related Work
Visual explanation of model decision.
Various methods have been proposed to understand deep learning models for vision tasks by quantifying and visualizing the contribution of image pixels to the model output [31, 23, 1, 20, 18, 3, 5, 27, 6]. The contribution of pixels has been typically measured using feature maps in models. For example, Grad-CAM [23] determines the contribution by applying weights to the feature maps of the convolutional layers of a CNN using gradients. Attention rollout [1], commonly used for Vision Transformers, calculates the contributions using attention maps. Several methods instead calculate the contribution of each pixel by analyzing the sensitivity of the confidence score with respect to each pixel [20, 18, 16, 8]. For example, RISE (Randomized Input Sampling for Explanation; [20]) calculates the contributions empirically by probing the model with randomly masked images of the input image and obtaining the corresponding confidence scores. SHAP (SHapley Additive exPlanations; [18]) distributes confidence scores fairly to contributions by leveraging Shapley values from game theory. Importantly, the aforementioned methods all measure the contribution of each pixel independently; the collection of important pixels consists of the pixels with high contributions. In contrast, this study identifies the important pixels by further taking into account the collective contributions of pixels.
Game-theoretic approach of model explanation.
Several recent studies have utilized a game-theoretic concept, interactions, to analyze various phenomena of deep learning models and quantify an effect of pixel cooperation on the model inference [7, 9, 21, 28, 29, 25]. Wang et al. [28] showed that the transferability of adversarial images has a negative correlation to the interactions. Zhang et al. [29] showed the similarity between the computation of interactions and dropout regularization. Deng et al. [9] discussed the difference in information obtained between humans and machine learning models using interactions. Sumiyasu et al. [25] investigated misclassification by models using interactions and discovered that the distribution of interactions varies with the type of misclassified images. Thus, interactions are helpful for understanding the model from the perspective of cooperative relationships between pixels. A critical issue of interaction-based analysis is its computational cost; the computation of interaction requires an exponential number of forward passes with respect to the number of pixels. In this paper, we propose an efficient approach to explain a model using variants of interactions (and also Shapley values), achieving the identification of important pixels with only a quadratic number of forward passes.
3 Preliminaries
Shapley values.
Shapley values was proposed in game theory to measure the contribution of each player to the total reward that is obtained from multiple players working cooperatively [24]. Let be the index set of players, and let be its power set. Given a reward function , the Shapley value of player with a context is defined as follows.
(1) |
where . Here, denotes the cardinality of set. Namely, the Shapley value averages over all the reward increase on the participation of player to player set .
Interactions.
Interactions measure the contribution of the cooperation between the two players to the total reward [13]. Interactions by players and are defined as follows.
(2) |
where two players are regarded as a single player and (i.e., ). In Eq. (2), the first term corresponds to the joint contribution from players , and the second and the third terms correspond to the individual contribution of players and , respectively. Namely, interactions quantify the average cooperation on the reward of two players joining simultaneously. Importantly, we have .
Application to image classifiers.
In the application of Shapley values and interactions to image classifiers, an image with pixels is regarded as the index set of players. Typically, the reward function is defined by [9], where represents the class of , and denotes the classifier’s confidence score on class with input . The reward of a subset of pixels of image is similarly computed by feeding a partially masked to the classifier (i.e., the pixels in are masked).
If the classifier is a convolutional neural network (CNN), the masked region is conventionally filled with some base value, such as 0 or the average pixel value [2, 30]. Such a replacement may drop the original information of an image but also inject a new feature. Thus, the choice of base value affects the Shapley values and interactions. In contrast, when a Vision Transformer is used, one can realize masking in a rigid manner by applying a mask to the attention. To our knowledge, most prior studies exploited Shapley values and interactions on CNNs with the base value replacement, which might not unleash the full potential of these quantities. To our knowledge, the only exception is [8], which demonstrated that Shapley values can be calculated more accurately using attention masking. We follow this strategy in the computation of Shapley values and interactions for Vision Transformers.
4 Method
We address the problem of identifying in a given image a set of pixels that significantly influence the confidence score of a classifier. While prior studies solve this by explicitly or implicitly measuring the independent contribution of each pixel to the confidence score, the proposed method takes into account the collective contribution of pixels using interactions. We refer to the proposed method as MoXI (Model eXplanation by Interactions).
We consider two approaches to measuring the contribution of pixels to the confidence score: (i) pixel insertion and (ii) pixel deletion. The former measures the contribution of a pixel by the confidence gain when it is unmasked as in Eqs. (1) and (2), while the latter measures it by the confidence drop when it is masked.
4.1 Pixel Insertion
Problem 1
Let be the index set of all pixels of image . Let be a function that gives the confidence score on the class of index set, with the convention that pixels not included in the index set are masked. Find a subset such that
(3) |
for .
By its formulation, this problem is an NP-hard problem in general. Particularly, is here a CNN or Vision Transformer,111With this assumption, we use a slight abuse of notation and assume, e.g., because in either case of or , we input the image with pixels to the model. a highly nonlinear function. Thus, we resort to a greedy strategy to solve it approximately.
For , the index of the pixel with the highest Shapley value of gives the optimal set by the its definition. For , we select the next pixel with the one maximizing . Importantly, this is equivalent to maximizing the sum of the Shapley value and interaction, not the Shapley value alone.
(4) |
where
(5) | ||||
(6) |
We refer to such a particular form of Shapley values and interactions to be self-context in the pixel insertion approach, and they play an essential role in our framework. For , we can similarly show that maximizing with respect to is equivalent to
(7) |
Equation (7) shows that for identifying of index for , it is crucial to consider the interaction between and . Even when a pixel indexed has a large Shapley value (the first term), it may have a large negative interaction (the second term) if its pixel information overlaps with that of . Namely, collecting pixels with large Shapley values does not necessarily give the most informative pixel set.
To summarize, our analysis justifies a very simple greedy algorithm Algorithm 1 from a game-theoretic perspective. The algorithm seems trivial in hindsight, but prior studies visualize highly contributing pixels only using Shapley values [18, 16, 8].
Computational cost.
The identification of important pixels (or patches, in practice) using Shapley values requires times of forward passes because of the average over all for all (cf. Eq. (1)). In contrast, our approach only requires times of forward passes in the worst case (see Appendix C for details of the algorithm complexity and runtime).
Set-Sum task.
We now give an intuitive example for showing the necessity of interactions using Set-Sum task. Set-Sum task is a variant of Problem 1 with a collection of integers and reward function for , where denotes the sum of all types of integers in . For example, for . Note that for any , we have if and otherwise . In this way, when the features already possessed are equal to the newly added features, the model does not gain new information. This shows the role of interaction in considering information redundancy.
Visual Set-Sum task.
We empirically confirm the advantage of using interactions in the visual Set-Sum task on the synthetic MNIST dataset. This task is to accurately predict the sum of all types of numbers in an image using a model. We constructed composite images, each of which consists of four randomly selected MNIST images arranged in a 2x2 grid (cf. Fig. 2(a)). The label of a composite image is the sum of all types of numbers in the image as in the Set-Sum problem. The evaluation metric utilizes the insertion curve, as detailed in Sec. 5. For the model and dataset details, refer to Appendix A. The insertion curves in Fig. 2(b) show that the MoXI achieves higher accuracy than the methods using MoXI(-), which uses self-context Shapley values, and the Shapley value methods when and of the image area are unmasked, i.e., the second and the third number is appended. This demonstrates that MoXI acquires non-redundant information more effectively.
4.2 Pixel Deletion
To address Problem 1, we considered the problem of identifying groups of pixels with high confidence scores through pixel insertion. Here, we aim at decreasing the confidence scores via pixel deletion.
Problem 2
We again resort to a greedy approach. The key difference is that now we define and utilize a variant of Shapley value that measures the contribution of a player by its absence.
(9) |
where . This Shapley value quantifies the average impact attributable to the removal of player . In Problem 1, we addressed the issue by defining self-context Shapley values and interactions, as it involves the case of incrementally adding pixels from the entire image. In contrast, Problem 2 involves the sequential deletion of pixels from an image, necessitating the formulation of full-context Shapley values and interactions as follows:
(10) | |||
(11) |
With these quantities, the greedy algorithm for pixel deletion is as follows. For , the index of the pixel with the highest (deletion-based) Shapley value gives the optimal set by its definition. For , we select the next pixel that minimizes . This choice is again explained as a sum of Shapley value and interaction,
(12) |
For , we can similarly show that minimizing with respect to is equivalent to
(13) |
Again, the greedy algorithm is described from a game-theoretic viewpoint. The only difference from the insertion case is that the interaction term is now weighted. Algorithm 2 summarises the procedure. The computational cost of the pixel deletion approach is the same as the pixel insertion approach, which only requires times of forward passes in the worst case.
5 Experiments
In this section, we evaluate the characteristics of identified patches through comparative experiments with existing methods and demonstrate the effectiveness of our method.
Setup.
Our experiments utilize the ImageNet dataset [10] and focus on analyzing Vision Transformer [11] pre-trained for the classification task. For baseline methods, we use Grad-CAM [23]222The target layer of Grad-CAM is set to the one before the layer normalization in the final attention block of network. This choice is common, see https://github.com/jacobgil/pytorch-grad-cam., Grad-CAM++ [5], Attention rollout [1], Shapley values, and MoXI(-), which do not utilize the interactions present in MoXI. For insertion curve experiments, we use the Pixel Insertion approach, while for deletion curves, we utilize the Pixel Deletion approach. Following the previous studies [21, 29], we consider image patches instead of pixels to reduce computational costs. All methods calculate the contributions for patches with a patch size of , which is equal to the patch size and the number of tokens in standard ViT models. We used a pre-trained ViT-T333https://huggingface.co/WinKawaks/vit-tiny-patch16-224 [11], DeiT-T444https://huggingface.co/facebook/deit-tiny-patch16-224 [26] and ResNet-18555https://huggingface.co/microsoft/resnet-18 [14]. We selected images, one corresponding to each label, all of which were successfully classified in the test set. To reduce the computational burden, we computed Shapley values approximately by random sampling of in Eq. (1) as in other studies [4, 22, 29, 25]. The sampling size is set to 200. Moreover, we have adopted feature patch deletion as the masking method for Shapley values and interactions. In the following, we focus on ViT-T. See Appendix B for more results.
5.1 Evaluating the importance of identified patches
We evaluate the importance of the image patches as determined by the above methods, using insertion/deletion curve metrics. The insertion curve identifies information-rich patches, while the deletion curve helps identify patches important for the model’s decision-making process. In our insertion/deletion curve experiments, we utilized the masking method for patch deletion. For Grad-CAM, Attention rollout, and Shapley value, image patches are inserted and deleted in the same order.
The insertion curves in Fig. 3(a) show that MoXI exhibits a sharper increase in classification accuracy compared to the other methods. In particular, even with images where only is visible, MoXI achieves an accuracy of , whereas Grad-CAM, Attention rollout, and Shapley value achieve , , and , respectively. This result indicates that MoXI can efficiently identify important patches for classification. Then, both the self-context and original Shapley values, which are based on confidence scores, achieve a sharper increase in classification accuracy. However, these two methods calculate the importance of individual patches and often select patches with similar information. Consequently, MoXI can identify features contributing to a higher classification accuracy than these methods.
The deletion curves in Fig. 3(b) show that MoXI exhibits a sharp decrease in classification accuracy compared to the other methods. When concealing just of an image, MoXI significantly decreases the model’s accuracy to . In contrast, Grad-CAM and Attention rollout only decrease the accuracy to approximately under the same conditions. This result indicates that MoXI, which accounts for interactions between patches, effectively identifies the image patches important for classification. We observed analogous results for DeiT-T [26] and ResNet-18 [14] models, as detailed in Appendix B. Additionally, we discuss the application of masks using our method in Appendix D.
5.2 Confidence score-based visualization
We introduce two heatmap-based visualization methods tailored for analyzing insertion and deletion patches. The first method visualizes insertion patches, highlighting those important for accurate classification. The second focuses on deletion patches, specifically identifying those whose deletion significantly impacts the classification. The heatmap shows higher values, indicated by shades closer to red, for patches that were inserted or deleted earlier. The insertion or deletion stops when the model reaches a successful classification or misclassification.
Heatmap visualization.
Figure 4(a) displays a heatmap for patch insertion. Compared to the existing methods, MoXI’s heatmap highlights fewer regions and identifies the class object. Interestingly, MoXI selects the patches on the background as well as the class object. This visualization explains the object and background is required for classification and demonstrates the usefulness of the interaction.
Figure 4(b) displays a heatmap for patch deletion. The heatmaps generated by MoXI(-) and Grad-CAM display extensive highlights across the image, while MoXI, Attention rollout, and Shapley value show more concentrated highlights on the class object. This finding indicates that these latter methods accurately capture important information from the object. Notably, MoXI places less emphasis on the background than Attention rollout and Shapley value. This result suggests that MoXI effectively narrows down information by selectively deleting the class object, which could be advantageous for precise object localization.
Class-dicriminative localization.
To enhance understanding of the model’s prediction process, localization for specific classes improve interpretability. We have extended MoXI to analyze a target class that differs from the model’s prediction. For the detailed visualization, see Appendix F. Figures 5(b) and 5(c) visualizes important regions for two classes: the bull mastiff, as predicted by the model, and the tiger cat, the target class. The heatmaps reveal that MoXI highlights the bull mastiff’s facial area and the tiger cat’s face and body. These observations demonstrate that MoXI can identify important groups of image patches relevant to the predicted class and class-specific features important for decision-making.
5.3 Common corruption effect on patch deletion
We investigate the risk of model misclassification when image patches important for model accuracy are disrupted by adding noise. In the deletion curve experiment of Sec. 5.1, we used patch masking to simulate feature absence. Instead of patch masking, we consider common corruption [15]: fog and Gaussian noise at level 5 (for the other corruptions such as brightness and motion blur, see Appendix G.1). We apply these corruptions to image patches in the order selected for patch deletion in Sec. 5.1.
Figure 6(a) shows the effect of Gaussian noise on the deletion curve results. MoXI exhibits a significant decrease in accuracy compared to the others, indicating MoXI is vulnerable to Gaussian noise. This result implies that MoXI efficiently identifies important patches. Figure 6(b) shows the fog corruption results, which are similar to those observed for Gaussian noise. Furthermore, as detailed in Appendix G.1, MoXI similarly affects accuracy with the other common corruptions. Additionally, we evaluate the effect of adversarial perturbations. Interestingly, adversarial perturbations yield distinct results due to their deceptive effect on the model’s internal features (see Appendix G.2).
5.4 Consistant explainability
We examine the consistent explainability of visualization methods, regardless of the internal feature representation, which is a key aspect of explainable artificial intelligence. Specifically, we examine whether the models, trained with varying numbers of classification classes, consistently select important image patches. We evaluate the consistency using insertion and deletion curves for the models trained with datasets containing 10, 20, 100, and 1000 classes. For training the 10-class model, we select images from ImageNet that share labels with CIFAR10. For the models with 20, 100, and 1000 classes, we extend the 10-class dataset by adding images with randomly selected classes from ImageNet. We draw the insertion and deletion curves using the 10-class test images that are correctly classified.
Figures 7(a) and 7(b) show the insertion curve results for Attention rollout and MoXI, respectively. Attention rollout decreases accuracy as the number of classes increases. In contrast, MoXI does not decrease in accuracy. Therefore, MoXI consistently selects important image patches for accurate classification. In addition, the results from other methods and deletion experiments are shown in the Appendix H. We confirmed that MoXI provides consistent explainability in the deletion curve experiments.
6 Conclusion
This study addressed the problem of identifying a group of pixels that largely and collectively impact confidence scores in image classification models. We justify simple greedy algorithms from a game-theoretic view using Shapley values and interactions. This analysis naturally suggests the use of self-context and full-context variants of Shapley values and interactions. Their computation only requires a quadratic number of forward passes, whereas prior studies compute Shapley values and/or interactions with an exponential number of forward passes or heavy sampling-based approximation. The experimental results show that our method is more accurate in identifying the important image patches for models than popular methods.
Acknowledgments
This work was supported by JSPS KAKENHI Grant Number JP22H03658 and JP22K17962.
References
- Abnar and Zuidema [2020] Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Online, 2020. Association for Computational Linguistics.
- Ancona et al. [2019] Marco Ancona, Cengiz Oztireli, and Markus Gross. Explaining deep neural networks with a polynomial time algorithm for shapley value approximation. In Proceedings of the 36th International Conference on Machine Learning, pages 272–281, Long Beach, California, USA, 2019. PMLR.
- Binder et al. [2016] Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, Klaus-Robert Müller, and Wojciech Samek. Layer-Wise Relevance Propagation for Neural Networks with Local Renormalization Layers, pages 63–71. Springer International Publishing, Cham, 2016.
- Castro et al. [2009] Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based on sampling. Computers & Operations Research, 36(5):1726–1730, 2009. Selected papers presented at the Tenth International Symposium on Locational Decisions (ISOLDE X).
- Chattopadhay et al. [2018] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-CAM: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. IEEE, 2018.
- Chefer et al. [2021] Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 782–791, 2021.
- Cheng et al. [2021] Xu Cheng, Chuntung Chu, Yi Zheng, Jie Ren, and Quanshi Zhang. A game-theoretic taxonomy of visual concepts in DNNs. arXiv preprint arXiv:2106.10938, 2021.
- Covert et al. [2023] Ian Connick Covert, Chanwoo Kim, and Su-In Lee. Learning to estimate shapley values with vision transformers. In The Eleventh International Conference on Learning Representations, 2023.
- Deng et al. [2022] Huiqi Deng, Qihan Ren, Hao Zhang, and Quanshi Zhang. Discovering and explaining the representation bottleneck of DNNs. In Proceedings of the International Conference on Learning Representations, 2022.
- Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, 2021.
- Goodfellow et al. [2015] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations, 2015.
- Grabisch and Roubens [1999] Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games. International Journal of Game Theory, 28:547–565, 1999.
- He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Hendrycks and Dietterich [2019] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of the International Conference on Learning Representations, 2019.
- Jethani et al. [2022] Neil Jethani, Mukund Sudarshan, Ian Connick Covert, Su-In Lee, and Rajesh Ranganath. FastSHAP: Real-time Shapley value estimation. In Proceedings of the International Conference on Learning Representations, 2022.
- Kurakin et al. [2017] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. Proceedings of the International Conference on Learning Representations Workshop, 2017.
- Lundberg and Lee [2017] Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, page 4768–4777, Red Hook, NY, USA, 2017. Curran Associates Inc.
- Madry et al. [2018] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In Proceedings of the International Conference on Learning Representations, 2018.
- Petsiuk et al. [2018] Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models. In Proceedings of the British Machine Vision Conference, 2018.
- Ren et al. [2021] Jie Ren, Die Zhang, Yisen Wang, Lu Chen, Zhanpeng Zhou, Yiting Chen, Xu Cheng, Xin Wang, Meng Zhou, Jie Shi, and Quanshi Zhang. Towards a unified game-theoretic view of adversarial perturbations and robustness. In Proceedings of the Advances in Neural Information Processing Systems, pages 3797–3810, 2021.
- Ren et al. [2022] Jie Ren, Zhanpeng Zhou, Qirui Chen, and Quanshi Zhang. Towards a game-theoretic view of baseline values in the shapley value, 2022.
- Selvaraju et al. [2017] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of IEEE/CVF International Conference on Computer Vision, pages 618–626, 2017.
- Shapley [1953] Lloyd S. Shapley. A value for n-person games. In Contributions to the Theory of Games, pages 307–317, 1953.
- Sumiyasu et al. [2022] Kosuke Sumiyasu, Kazuhiko Kawamoto, and Hiroshi Kera. Game-theoretic understanding of misclassification, 2022.
- Touvron et al. [2021] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, pages 10347–10357, 2021.
- Wang et al. [2020] Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neural networks, 2020.
- Wang et al. [2021] Xin Wang, Jie Ren, Shuyun Lin, Xiangming Zhu, Yisen Wang, and Quanshi Zhang. A unified approach to interpreting and boosting adversarial transferability. In Proceedings of the International Conference on Learning Representations, 2021.
- Zhang et al. [2021a] Hao Zhang, Sen Li, YinChao Ma, Mingjie Li, Yichen Xie, and Quanshi Zhang. Interpreting and boosting dropout from a game-theoretic view. In Proceedings of the International Conference on Learning Representations, 2021a.
- Zhang et al. [2021b] Hao Zhang, Yichen Xie, Longjie Zheng, Die Zhang, and Quanshi Zhang. Interpreting multivariate shapley interactions in dnns. In The AAAI Conference on Artificial Intelligence, 2021b.
- Zhou et al. [2016] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921–2929, 2016.
Supplementary Material
A Visual Set-Sum Task
We here describe the details of the experiment for the Visual Set-Sum task. The dataset consists of composite images, each of which consists of four MNIST images. The composite images are labeled by the sum of all types of digits in that image as the label (see examples in Fig. 2(a)). The size of a composite image is 56x56, and the patch size is 28x28. As we sample the digits (i.e., MNIST images) uniformly, a composite image has duplicate numbers with a probability of roughly . In the test set, each composite image was designed to have its largest digit in two patches, which is the most advantageous case of using interactions. We trained a ResNet-18 [14] model and evaluated it on a test set of size 10,000. The loss function used for the training is , where the first loss denotes the cross-entropy loss and the second loss denotes the mean-squared error between the model prediction and the true class. The second loss adds a regression flavor and takes a lower value when the model prediction (i.e., predicted set-sum) is closer to the label (i.e., the set sum). We filled the zero value for masking patches for computing Shapley values and interactions and for the accuracy evaluation.
B Results of the Insertion/Deletion curve with additional models
In Sec. 5.1, we evaluated the proposed and baseline methods using ViT-T [11]. The insertion and deletion curves show that the proposed method provides the most efficient visual explanation. To demonstrate this generalization across different models and architectures, we provide results using both the DeiT-T [26], a ViT architecture, and ResNet-18 [14], a widely used CNN model. For details of the experiment in Deit-T, refer to Sec. 5.1. The insertion curve in Fig. 8(a) again shows that MoXI exhibits a sharper increase compared to the other methods. The deletion curve in Fig. 8(b) also demonstrates that MoXI exhibits a sharper decrease compared to the other methods. Similarly, Fig. 9 exhibits that the results for ResNet-18 are similar to these findings. These results indicate that our method can efficiently and accurately identify the critical patches in the model’s decision-making process.
C Comparison of algorithm complexity and runtime
In this section, we compare the algorithm complexity and runtime for each method.
First, we provide an explanation of the complexity of the algorithm, focusing on the number of forward passes required per image. Let be the number of patches in an image (typically, ). Grad-CAM needs forward pass (and backward pass), Attention rollout needs forward pass, and Shapley value needs forward passes. MoXI needs forward passes in the worst case. The number of passes for Shapley value and MoXI is given in Sec. 4, which we will elaborate it again. As defined in Eq. (1), computing the Shapley value for the -th pixel requires passes due to the possible choices of . leading to passes for an entire image. On the other hand, MoXI needs passes. For example, at the -th step of the greedy insertion, it recruits a new patch from the remaining patches to maximize the confidence score (i.e., passes). For steps, it needs passes in total. Note that this is the worst-case scenario; the algorithm stops when the classification becomes correct, and Fig. 3(a) indicates that more than 90% of evaluation images require less than steps. In the runtime experiment, the median of the steps was 6 (with std 7.6) and 10 (with std 11.3) for insertion and deletion, respectively. A similar discussion holds for the deletion case.
Furthermore, our method can leverage parallel processing with mini-batches, leading to a linear number of forward passes at the cost of additional memory usage. Specifically, the -th step of MoXI can be done by a single forward pass of patterns of the insertion from remaining patches. Our implementation is based on this parallelization.
Next, we compare the runtime required for measuring the importance in each method. The comparison is based on the average runtime across 100 images, following the experimental setup described in Sec. 5. For Grad-CAM, Attention rollout, and Shapley value, the runtime represents the duration required to compute the importance of the entire image. In contrast, in the case of MoXI, we separately measure the runtime for pixel insertion until successful classification and for pixel deletion until classification failure. Our experiments were conducted using a machine equipped with a 12-core processor, 64GB RAM, and an NVIDIA RTX 3090.
The runtime for each method is shown in Table 1. This indicates that the runtime for MoXI is approximately times faster than that for Shapley value. Recalling the results from Fig. 3, MoXI achieves higher accuracy in capturing an important group of patches than Shapley value method does. Therefore, MoXI surpasses Shapley value in both accuracy and runtime. While not as fast as Grad-CAM and Attention rollout, we consider that MoXI meets most use cases of visualization, and the quality is better, as our extensive experiments show.
Grad-CAM | Attention R. | Shapley V. | MoXI (Ins/Del) |
0.15 | 0.02 | 17.9 | 0.60/1.34 |
D Analysis of effective layers to remove patches
In Sec. 5.1, we consider the absence of players (i.e., pixels/patches) for calculating Shapley values and interactions in the input space. Specifically, the patches are removed after the input embedding layer. Here, we examine the case where several self-attention layers are instead masked. To this end, we utilize a variant of the attention-masking approach used in [8]. Specifically, let the -th layer be our target layer. Then, a large negative value is added to the product of the query and key matrices from -th to the last self-attention layers. Figure 10 displays the insertion curve results when MoXI is applied to various target layers. The experimental setup is the same as in Sec. 5.1. The result demonstrates that MoXI prefers the earlier layers and better pinpoints the important features of images.
E Additional results of visualization
We provide additional visualization results in Fig. 11 and 12. As in Sec. 5.2, the results demonstrated that the patches highlighted by MoXI are smaller than those highlighted by other methods.
We observed that MoXI behaves slightly unstable at the insertion case. Recall that in this case, MoXI appends important patches to an empty set accordingly and terminates when the model gives the correct classification. Empirically, the termination can happen at a very early stage, where the confidence score of the correct class is the largest but still very low. If we continue to patch, the model prediction can fluctuate among several classes. Note this does not cause a big problem in most cases; all the insertion curves in this paper consistently show a monotonic increase of classification accuracy with the increase in insertion rate. If needed, one can introduce a minimum confidence score and terminate the insertion when the confidence score exceeds this threshold with the correct classification. We include this hyperparameter in our official implementation of MoXI.
F Class-descriminative localization
The proposed method was originally designed to identify important pixels to explain the model prediction. Here, we generalize MoXI (for pixel deletion) to visualize such pixels for a given target class, which is used in Fig. 5. To this end, we consider reward function switching as follows. Let be the input image, the target label, and the predicted label, respectively. If , we simply use a reward function . Otherwise, we use , which helps us identify patches with positive effect on the confidence score on class and negative effect on class . The image patches removed in the former case are collected as important patches for class .
G Patch perturbations
In Sec. 5.3, we evaluated the effectiveness of each method by measuring the classification accuracy when Gaussian and fog noise were applied to important image patches identified. The deletion curves here are not plotted by removing patches but instead perturbed. We present experimental results on common corruptions and adversarial perturbations.
G.1 Common corruptions
We implemented 19 types of common corruptions using the imagecorruptions module with severity 5.666 https://github.com/hendrycks/robustness. Figures 13 and 14 showcase the deletion curves with different corruptions for ViT-T and DeiT-T, respectively. The results demonstrate that our method gives a sharper decrease at the early stage of deletion curves than others, as in Sec. 5.3.
G.2 Adversarial perturbations
Besides common corruptions, we also investigated the case with adversarial perturbations [12, 17, 19], which are small but malicious perturbations that can largely change the model’s output. We conducted the same experiment given in Sec. G.1 but with adversarial perturbations instead of common corruptions. To obtain adversarial perturbations, we adopted -untargeted PGD with and stepsize . Figure 16(a) and 16(a) present the deletion curves for ViT-T and Deit-T, respectively. The results show that the attention rollout method gives a slightly sharper decrease than MoXI. This differs from the results for common corruptions. We suspect that adversarial perturbations mostly lie in the patches that are suggested as important by attention rollout. To confirm this, we measured the magnitude of adversarial perturbations on each image patch. Specifically, the magnitude is measured by the L2 norm. Figure 16(b) shows the magnitude of the perturbations of each patch. The patches are ordered as in the deletion curves in Fig. 16(a). The results indicate that the importance of image patches identified by attention rollout is well aligned with the amount of perturbations on them. On the other hand, image patches identified by MoXI contain a larger amount of perturbations at the early and late stages than those at the middle stage. This may be because the attention rollout reflects the internal computation process of the features directly when measuring the contributions of image patches, while adversarial perturbations are designed to hack this process. On the other hand, MoXI treats a Vision Transformer as a black-box model and is unaware of the internal process.
H More results in the stability of explanations.
In Sec 5.4, we evaluate the stability of explanations of MoXI and attention rollout with respect to the number of classes. Here, we consider both insertion and deletion metrics, utilizing Grad-CAM, attention rollout, Shapley value, and MoXI. Figure 17 shows insertion and deletion curves. The result again shows that MoXI maintains relatively stable accuracy when the model is trained on more classes. Similarly, other methods have significantly decreased classification accuracy in such scenarios. Therefore, MoXI acquires important image patches more consistently than other methods.