Identifying Important Group of Pixels using Interactions

Kosuke Sumiyasu,  Kazuhiko Kawamoto,  Hiroshi Kera Graduate School of Science and Engineering, Chiba University (email: [email protected])Graduate School of Engineering, Chiba University (email: [email protected], [email protected])Corresponding Author.
Abstract

To better understand the behavior of image classifiers, it is useful to visualize the contribution of individual pixels to the model prediction. In this study, we propose a method, MoXI (Model eXplanation by Interactions), that efficiently and accurately identifies a group of pixels with high prediction confidence. The proposed method employs game-theoretic concepts, Shapley values and interactions, taking into account the effects of individual pixels and the cooperative influence of pixels on model confidence. Theoretical analysis and experiments demonstrate that our method better identifies the pixels that are highly contributing to the model outputs than widely-used visualization by Grad-CAM, Attention rollout, and Shapley value. While prior studies have suffered from the exponential computational cost in the computation of Shapley value and interactions, we show that this can be reduced to quadratic cost for our task. The code is available at https://github.com/KosukeSumiyasu/MoXI.

1 Introduction

Visualization of important image pixels has been widely used to understand machine learning models in computer vision tasks such as image classification [31, 23, 1, 20, 18, 3]. To this end, visualization methods compute the contribution of each pixel to model decisions. For example, Grad-CAM [23] measures the contribution using a weighted sum of the feature maps of convolutional layers, where weights are determined by the gradient of confidence score for any target class with respect to the feature map entries. Attention rollout [1] measures it based on the attention weight of encoders of a Vision Transformer.

Several recent studies revealed that a game-theoretic concept, Shapley values [24], is a powerful indicator of pixel contribution [18, 16, 8]. In multi-player games, Shapley values measures the contribution of each player from the average change in the total game reward with his/her presence versus absence. When applied to an image classifier, the pixels of an image are the players, which work cooperatively for the model output (e.g., confidence score). Unlike Grad-CAM and Attention rollout, Shapley values compute the contribution of pixels to the model output more directly. The former methods use feature maps or attention weights, the magnitude of whose entries are not necessarily well-aligned with their contributions to the model output, whereas the latter uses logits or confidence scores. Indeed, Fig. 1 shows that the pixels with high Shapely values have a significantly larger impact on confidence scores than those determined by Grad-CAM or Attention rollout in both (a) insertion case and (b) deletion case.

A crucial caveat of the aforementioned methods is that they identify a group of important pixels by the individual contribution of each pixel and overlook the collective contribution of multiple pixels. For example, Fig. 1(a) shows that the three methods only highlight the class object (i.e., duck) and do not indicate the background (i.e., sea) as an informative factor. However, the set of pixels with the highest contributions (e.g., highest Shapley values) does not imply the most informative pixel set as a whole because the information overlap among pixels is not considered. Indeed, the bottom row of Fig. 1(a) shows that the class object and background greatly impact in synergy the confidence score.

Refer to caption
Figure 1: Examples of image patches with high contributions to the output of ViT-T. (a) Starting from an empty image, image patches are inserted according to their contribution measured by each method. (b) Starting from an original image, image patches are removed according to their contribution measured by each method. The heatmaps highlight the image patches inserted (deleted) to obtain the correct (incorrect) classification. The selected patches are colored according to the timing of insertion/deletion. For insertion, only the proposed method selects patches from the background. For deletion, the proposed method highlights the class object only. For both cases, the proposed method highlights the least number of patches while achieving the highest/lowest confidence score.

In this paper, we propose an efficient game-theoretic visualization method of image pixels with a high impact on the prediction of an image classifier. Besides Shapley values, we exploit interactions, a game-theoretical concept that reflects the average effect of the cooperation of pixels. Namely, unlike prior methods, including Grad-CAM, Attention rollout, and Shapley values, the proposed method takes into account the cooperative contribution of pixels and identifies the image pixels as a whole. In Fig. 1(a), the proposed method identifies a pixel set on which the classifier puts high classification confidence. Similarly, in Fig. 1(b), it identifies a minimal pixel set without which the classification fails. Notably, we define self-context variants of Shapley values and interactions, and reduce the number of forward passes from exponential to quadratic times, which resolves the fundamental challenge of game-theoretic approaches to be handy tools for model explanation.

In the experiments, we consider the insertion curve and deletion curve on a subset of ImageNet images that are correctly classified by a pretrained classifier. Starting from fully masked images, an insertion curve plots the increase of classification accuracy as unmasking image patches from highly contributing ones determined by each method. Similarly, a deletion curve plots the accuracy decrease from the clean images to fully masked ones. The results show that the proposed method gives sharp insertion/deletion curves. For example, the classification accuracy reached 90%percent9090\%90 % with images with 4% unmasked patches if selected by the proposed method, significantly outperforming the results of Grad-CAM (accuracy of 2%percent22\%2 %), Attention rollout (accuracy of 4%percent44\%4 %), and Shapley values (accuracy of 25%percent2525\%25 %). Similar results are observed for the deletion curves and also when we use common corruptions [15] instead of masking. Qualitatively, the heatmaps using the patches selected in the early stage of the insertion curve show that our method highlights both a class object and background, while the other methods mostly highlight the class object only. Meanwhile, in the heatmaps from the deletion curves, our method particularly highlights the class-discriminative region of the object, while the others do not.

Our contributions are summarized as follows:

  • We propose an efficient game-theoretic visualization method, named MoXI (Model eXplanation by Interactions), for a group of pixels that significantly influences the classification.

  • Our analysis supports a simple greedy strategy from a game-theoretic perspective, leading us to use self-context variants of Shapley values and interactions, which can be computed exponentially faster than computing the original ones.

  • Extensive experiments show that our method more accurately identifies the pixels that are highly contributing to the model outputs than standard visualization methods.

2 Related Work

Visual explanation of model decision.

Various methods have been proposed to understand deep learning models for vision tasks by quantifying and visualizing the contribution of image pixels to the model output [31, 23, 1, 20, 18, 3, 5, 27, 6]. The contribution of pixels has been typically measured using feature maps in models. For example, Grad-CAM [23] determines the contribution by applying weights to the feature maps of the convolutional layers of a CNN using gradients. Attention rollout [1], commonly used for Vision Transformers, calculates the contributions using attention maps. Several methods instead calculate the contribution of each pixel by analyzing the sensitivity of the confidence score with respect to each pixel [20, 18, 16, 8]. For example, RISE (Randomized Input Sampling for Explanation; [20]) calculates the contributions empirically by probing the model with randomly masked images of the input image and obtaining the corresponding confidence scores. SHAP (SHapley Additive exPlanations; [18]) distributes confidence scores fairly to contributions by leveraging Shapley values from game theory. Importantly, the aforementioned methods all measure the contribution of each pixel independently; the collection of important pixels consists of the pixels with high contributions. In contrast, this study identifies the important pixels by further taking into account the collective contributions of pixels.

Game-theoretic approach of model explanation.

Several recent studies have utilized a game-theoretic concept, interactions, to analyze various phenomena of deep learning models and quantify an effect of pixel cooperation on the model inference [7, 9, 21, 28, 29, 25]. Wang et al. [28] showed that the transferability of adversarial images has a negative correlation to the interactions. Zhang et al. [29] showed the similarity between the computation of interactions and dropout regularization. Deng et al. [9] discussed the difference in information obtained between humans and machine learning models using interactions. Sumiyasu et al. [25] investigated misclassification by models using interactions and discovered that the distribution of interactions varies with the type of misclassified images. Thus, interactions are helpful for understanding the model from the perspective of cooperative relationships between pixels. A critical issue of interaction-based analysis is its computational cost; the computation of interaction requires an exponential number of forward passes with respect to the number of pixels. In this paper, we propose an efficient approach to explain a model using variants of interactions (and also Shapley values), achieving the identification of important pixels with only a quadratic number of forward passes.

3 Preliminaries

Shapley values.

Shapley values was proposed in game theory to measure the contribution of each player to the total reward that is obtained from multiple players working cooperatively [24]. Let N={1,,n}𝑁1𝑛N=\{1,\ldots,n\}italic_N = { 1 , … , italic_n } be the index set of players, and let 2N=def{S|SN}superscriptdefsuperscript2𝑁conditional-set𝑆𝑆𝑁2^{N}\stackrel{{\scriptstyle\rm{def}}}{{=}}\{S\,|\,S\subseteq N\}2 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP { italic_S | italic_S ⊆ italic_N } be its power set. Given a reward function f:2N:𝑓superscript2𝑁f:2^{N}\to\mathbb{R}italic_f : 2 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT → blackboard_R, the Shapley value ϕ(i|N)italic-ϕconditional𝑖𝑁\phi(i\,|\,N)italic_ϕ ( italic_i | italic_N ) of player i𝑖iitalic_i with a context N𝑁Nitalic_N is defined as follows.

ϕ(i|N)=defSN{i}P(S|N{i})[f(S{i})f(S)],superscriptdefitalic-ϕconditional𝑖𝑁subscript𝑆𝑁𝑖𝑃conditional𝑆𝑁𝑖delimited-[]𝑓𝑆𝑖𝑓𝑆\displaystyle\phi(i\,|\,N)\stackrel{{\scriptstyle\rm{def}}}{{=}}\sum_{S% \subseteq N\setminus\{i\}}P(S\,|\,N\setminus\{i\})\>[f(S\cup\{i\})-f(S)],italic_ϕ ( italic_i | italic_N ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_S ⊆ italic_N ∖ { italic_i } end_POSTSUBSCRIPT italic_P ( italic_S | italic_N ∖ { italic_i } ) [ italic_f ( italic_S ∪ { italic_i } ) - italic_f ( italic_S ) ] , (1)

where P(A|B)=(|B||A|)!|A|!(|B|+1)!𝑃conditional𝐴𝐵𝐵𝐴𝐴𝐵1P~{}(A\,|\,B)=\frac{(\absolutevalue{B}-\absolutevalue{A})!\absolutevalue{A}!}{% (\absolutevalue{B}+1)!}italic_P ( italic_A | italic_B ) = divide start_ARG ( | start_ARG italic_B end_ARG | - | start_ARG italic_A end_ARG | ) ! | start_ARG italic_A end_ARG | ! end_ARG start_ARG ( | start_ARG italic_B end_ARG | + 1 ) ! end_ARG. Here, ||\absolutevalue{\,\cdot\,}| start_ARG ⋅ end_ARG | denotes the cardinality of set. Namely, the Shapley value ϕ(i|N)italic-ϕconditional𝑖𝑁\phi(i\,|\,N)italic_ϕ ( italic_i | italic_N ) averages over all SN{i}𝑆𝑁𝑖S\subseteq N\setminus\{i\}italic_S ⊆ italic_N ∖ { italic_i } the reward increase on the participation of player i𝑖iitalic_i to player set S𝑆Sitalic_S.

Interactions.

Interactions measure the contribution of the cooperation between the two players to the total reward [13]. Interactions I(i,j)𝐼𝑖𝑗I(i,j)italic_I ( italic_i , italic_j ) by players i𝑖iitalic_i and j𝑗jitalic_j are defined as follows.

I(i,j|N)=defsuperscriptdef𝐼𝑖conditional𝑗𝑁absent\displaystyle I(i,j\,|\,N)\stackrel{{\scriptstyle\rm{def}}}{{=}}italic_I ( italic_i , italic_j | italic_N ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP ϕ(Sij|N)ϕ(i|N{j})ϕ(j|N{i}),italic-ϕconditionalsubscript𝑆𝑖𝑗superscript𝑁italic-ϕconditional𝑖𝑁𝑗italic-ϕconditional𝑗𝑁𝑖\displaystyle\phi(S_{ij}\,|\,N^{\prime})-\phi(i\,|\,N\setminus\{j\})-\phi(j\,|% \,N\setminus\{i\}),italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_ϕ ( italic_i | italic_N ∖ { italic_j } ) - italic_ϕ ( italic_j | italic_N ∖ { italic_i } ) , (2)

where two players i,jN𝑖𝑗𝑁i,j\in Nitalic_i , italic_j ∈ italic_N are regarded as a single player Sij={i,j}subscript𝑆𝑖𝑗𝑖𝑗S_{ij}=\{i,j\}italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { italic_i , italic_j } and N=N{i,j}{Sij}superscript𝑁𝑁𝑖𝑗subscript𝑆𝑖𝑗N^{\prime}=N\setminus\{i,j\}\cup\{S_{ij}\}italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_N ∖ { italic_i , italic_j } ∪ { italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } (i.e., |N|=n1superscript𝑁𝑛1\absolutevalue{N^{\prime}}=n-1| start_ARG italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG | = italic_n - 1). In Eq. (2), the first term corresponds to the joint contribution from players (i,j)𝑖𝑗(i,j)( italic_i , italic_j ), and the second and the third terms correspond to the individual contribution of players i𝑖iitalic_i and j𝑗jitalic_j, respectively. Namely, interactions quantify the average cooperation on the reward of two players joining simultaneously. Importantly, we have I(i,i|N)=ϕ(i|N)𝐼𝑖conditional𝑖𝑁italic-ϕconditional𝑖𝑁I(i,i\,|\,N)=-\phi(i\,|\,N)italic_I ( italic_i , italic_i | italic_N ) = - italic_ϕ ( italic_i | italic_N ).

Application to image classifiers.

In the application of Shapley values and interactions to image classifiers, an image x𝑥xitalic_x with n𝑛nitalic_n pixels is regarded as the index set N={1,,m}𝑁1𝑚N=\{1,\ldots,m\}italic_N = { 1 , … , italic_m } of players. Typically, the reward function f𝑓fitalic_f is defined by f(x)=logP(y|x)1P(y|x)𝑓𝑥𝑃conditional𝑦𝑥1𝑃conditional𝑦𝑥f(x)=\log\frac{P(y\,|\,x)}{1-P(y\,|\,x)}italic_f ( italic_x ) = roman_log divide start_ARG italic_P ( italic_y | italic_x ) end_ARG start_ARG 1 - italic_P ( italic_y | italic_x ) end_ARG [9], where y𝑦yitalic_y represents the class of x𝑥xitalic_x, and P(y|x)𝑃conditional𝑦𝑥P(y\,|\,x)italic_P ( italic_y | italic_x ) denotes the classifier’s confidence score on class y𝑦yitalic_y with input x𝑥xitalic_x. The reward f(S)𝑓𝑆f(S)italic_f ( italic_S ) of a subset of pixels S2N𝑆superscript2𝑁S\subset 2^{N}italic_S ⊂ 2 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT of image x𝑥xitalic_x is similarly computed by feeding a partially masked x𝑥xitalic_x to the classifier (i.e., the pixels in NS𝑁𝑆N\setminus Sitalic_N ∖ italic_S are masked).

If the classifier is a convolutional neural network (CNN), the masked region is conventionally filled with some base value, such as 0 or the average pixel value [2, 30]. Such a replacement may drop the original information of an image but also inject a new feature. Thus, the choice of base value affects the Shapley values and interactions. In contrast, when a Vision Transformer is used, one can realize masking in a rigid manner by applying a mask to the attention. To our knowledge, most prior studies exploited Shapley values and interactions on CNNs with the base value replacement, which might not unleash the full potential of these quantities. To our knowledge, the only exception is [8], which demonstrated that Shapley values can be calculated more accurately using attention masking. We follow this strategy in the computation of Shapley values and interactions for Vision Transformers.

4 Method

We address the problem of identifying in a given image a set of pixels that significantly influence the confidence score of a classifier. While prior studies solve this by explicitly or implicitly measuring the independent contribution of each pixel to the confidence score, the proposed method takes into account the collective contribution of pixels using interactions. We refer to the proposed method as MoXI (Model eXplanation by Interactions).

We consider two approaches to measuring the contribution of pixels to the confidence score: (i) pixel insertion and (ii) pixel deletion. The former measures the contribution of a pixel by the confidence gain when it is unmasked as in Eqs. (1) and (2), while the latter measures it by the confidence drop when it is masked.

4.1 Pixel Insertion

Problem 1

Let N𝑁Nitalic_N be the index set of all pixels of image x𝑥xitalic_x. Let f:2N[0,1]:𝑓superscript2𝑁01f:2^{N}\to[0,1]italic_f : 2 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT → [ 0 , 1 ] be a function that gives the confidence score on the class of index set, with the convention that pixels not included in the index set are masked. Find a subset SkNsubscript𝑆𝑘𝑁S_{k}\subset Nitalic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊂ italic_N such that

Sk=argmaxSN,|S|=kf(S),subscript𝑆𝑘formulae-sequence𝑆𝑁𝑆𝑘argmax𝑓𝑆\displaystyle S_{k}=\underset{{S\subseteq N,|S|=k}}{\mathrm{arg\,max}}\ \ f(S),italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = start_UNDERACCENT italic_S ⊆ italic_N , | italic_S | = italic_k end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_f ( italic_S ) , (3)

for k=1,2,,|N|𝑘12𝑁k=1,2,\ldots,|N|italic_k = 1 , 2 , … , | italic_N |.

By its formulation, this problem is an NP-hard problem in general. Particularly, f𝑓fitalic_f is here a CNN or Vision Transformer,111With this assumption, we use a slight abuse of notation and assume, e.g., f({a,{b,c}})=f({a,b,c})𝑓𝑎𝑏𝑐𝑓𝑎𝑏𝑐f(\{a,\{b,c\}\})=f(\{a,b,c\})italic_f ( { italic_a , { italic_b , italic_c } } ) = italic_f ( { italic_a , italic_b , italic_c } ) because in either case of {a,{b,c}}𝑎𝑏𝑐\{a,\{b,c\}\}{ italic_a , { italic_b , italic_c } } or {a,b,c}𝑎𝑏𝑐\{a,b,c\}{ italic_a , italic_b , italic_c }, we input the image with pixels a,b,c𝑎𝑏𝑐a,b,citalic_a , italic_b , italic_c to the model. a highly nonlinear function. Thus, we resort to a greedy strategy to solve it approximately.

For k=1𝑘1k=1italic_k = 1, the index b1Nsubscript𝑏1𝑁b_{1}\in Nitalic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_N of the pixel with the highest Shapley value of ϕ(b1|{b1})italic-ϕconditionalsubscript𝑏1subscript𝑏1\phi(b_{1}\,|\,\{b_{1}\})italic_ϕ ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } ) gives the optimal set S1={b1}subscript𝑆1subscript𝑏1S_{1}=\{b_{1}\}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } by the its definition. For k=2𝑘2k=2italic_k = 2, we select the next pixel b2subscript𝑏2b_{2}italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with the one maximizing f({b1,b2})𝑓subscript𝑏1subscript𝑏2f(\{b_{1},b_{2}\})italic_f ( { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ). Importantly, this is equivalent to maximizing the sum of the Shapley value and interaction, not the Shapley value alone.

b2subscript𝑏2\displaystyle b_{2}italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =argmaxbN{b1}f({b1,b})f()absent𝑏𝑁subscript𝑏1argmax𝑓subscript𝑏1𝑏𝑓\displaystyle=\underset{b\in N\setminus\{b_{1}\}}{\mathrm{arg\,max}}\ f(\{b_{1% },b\})-f(\varnothing)= start_UNDERACCENT italic_b ∈ italic_N ∖ { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_f ( { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b } ) - italic_f ( ∅ )
=argmaxbN{b1}ϕ({b1,b}|{{b1,b}})absent𝑏𝑁subscript𝑏1argmaxitalic-ϕconditionalsubscript𝑏1𝑏subscript𝑏1𝑏\displaystyle=\underset{b\in N\setminus\{b_{1}\}}{\mathrm{arg\,max}}\ \phi(\{b% _{1},b\}\,|\,\{\{b_{1},b\}\})= start_UNDERACCENT italic_b ∈ italic_N ∖ { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_ϕ ( { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b } | { { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b } } )
=argmaxbN{b1}ϕ(b|{b})+I(b1,b|{b1,b})absent𝑏𝑁subscript𝑏1argmaxitalic-ϕconditional𝑏𝑏𝐼subscript𝑏1conditional𝑏subscript𝑏1𝑏\displaystyle=\underset{b\in N\setminus\{b_{1}\}}{\mathrm{arg\,max}}\ \phi(b\,% |\,\{b\})+I(b_{1},b\,|\,\{b_{1},b\})= start_UNDERACCENT italic_b ∈ italic_N ∖ { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_ϕ ( italic_b | { italic_b } ) + italic_I ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b | { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b } )
=argmaxbN{b1}ϕ(0)(b)+I(0)(b1,b),absent𝑏𝑁subscript𝑏1argmaxsuperscriptitalic-ϕ0𝑏superscript𝐼0subscript𝑏1𝑏\displaystyle=\underset{b\in N\setminus\{b_{1}\}}{\mathrm{arg\,max}}\ \phi^{(0% )}(b)+I^{(0)}(b_{1},b),= start_UNDERACCENT italic_b ∈ italic_N ∖ { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_ϕ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_b ) + italic_I start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b ) , (4)

where

ϕ(0)(a)superscriptitalic-ϕ0𝑎\displaystyle\phi^{(0)}(a)italic_ϕ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_a ) =defϕ(a|{a})=f(a)f()superscriptdefabsentitalic-ϕconditional𝑎𝑎𝑓𝑎𝑓\displaystyle\stackrel{{\scriptstyle\rm{def}}}{{=}}\phi(a\,|\,\{a\})=f(a)-f(\varnothing)start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP italic_ϕ ( italic_a | { italic_a } ) = italic_f ( italic_a ) - italic_f ( ∅ ) (5)
I(0)(a1,a2)superscript𝐼0subscript𝑎1subscript𝑎2\displaystyle I^{(0)}(a_{1},a_{2})italic_I start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) =defI(a1,a2|{a1,a2})superscriptdefabsent𝐼subscript𝑎1conditionalsubscript𝑎2subscript𝑎1subscript𝑎2\displaystyle\stackrel{{\scriptstyle\rm{def}}}{{=}}I(a_{1},a_{2}\,|\,\{a_{1},a% _{2}\})start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP italic_I ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } )
=f(a1a2)f(a1)f(a2)+f().absent𝑓subscript𝑎1subscript𝑎2𝑓subscript𝑎1𝑓subscript𝑎2𝑓\displaystyle=f(a_{1}\cup a_{2})-f(a_{1})-f(a_{2})+f(\varnothing).= italic_f ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_f ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_f ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_f ( ∅ ) . (6)

We refer to such a particular form of Shapley values and interactions to be self-context in the pixel insertion approach, and they play an essential role in our framework. For k3𝑘3k\geq 3italic_k ≥ 3, we can similarly show that maximizing f(Sk1{bk})𝑓subscript𝑆𝑘1subscript𝑏𝑘f(S_{k-1}\cup\{b_{k}\})italic_f ( italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∪ { italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ) with respect to bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is equivalent to

bksubscript𝑏𝑘\displaystyle b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =argmaxbNSk1ϕ(0)(b)+I(0)(Sk1,b).absent𝑏𝑁subscript𝑆𝑘1argmaxsuperscriptitalic-ϕ0𝑏superscript𝐼0subscript𝑆𝑘1𝑏\displaystyle=\underset{b\in N\setminus S_{k-1}}{\operatorname{argmax}}\>\phi^% {(0)}(b)+I^{(0)}(S_{k-1},b).= start_UNDERACCENT italic_b ∈ italic_N ∖ italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_argmax end_ARG italic_ϕ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_b ) + italic_I start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_b ) . (7)

Equation (7) shows that for identifying of index bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, it is crucial to consider the interaction between Sk1subscript𝑆𝑘1S_{k-1}italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT and bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Even when a pixel indexed b𝑏bitalic_b has a large Shapley value (the first term), it may have a large negative interaction (the second term) if its pixel information overlaps with that of Sk1subscript𝑆𝑘1S_{k-1}italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT. Namely, collecting pixels with large Shapley values does not necessarily give the most informative pixel set.

To summarize, our analysis justifies a very simple greedy algorithm Algorithm 1 from a game-theoretic perspective. The algorithm seems trivial in hindsight, but prior studies visualize highly contributing pixels only using Shapley values [18, 16, 8].

Algorithm 1 Identification of a group of pixels in the pixel insertion approach
0:  reward function f𝑓fitalic_f, index set N𝑁Nitalic_N of image pixels.
0:  Sequence of subsets S1,,S|N|Nsubscript𝑆1subscript𝑆𝑁𝑁S_{1},\ldots,S_{|N|}\subset Nitalic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT | italic_N | end_POSTSUBSCRIPT ⊂ italic_N
1:  Sk{}for allk=0,,|N|formulae-sequencesubscript𝑆𝑘for all𝑘0𝑁S_{k}\leftarrow\{\}\;\text{for all}\;k=0,\ldots,|N|italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← { } for all italic_k = 0 , … , | italic_N |
2:  for k=1,,|N|𝑘1𝑁k=1,\ldots,|N|italic_k = 1 , … , | italic_N | do
3:     bkargmaxbNSk1f(Sk1{b})subscript𝑏𝑘𝑏𝑁subscript𝑆𝑘1argmax𝑓subscript𝑆𝑘1𝑏b_{k}\leftarrow\underset{b\in N\setminus S_{k-1}}{\operatorname{argmax}}f(S_{k% -1}\cup\{b\})italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← start_UNDERACCENT italic_b ∈ italic_N ∖ italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_argmax end_ARG italic_f ( italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∪ { italic_b } )
4:     SkSk1{bk}subscript𝑆𝑘subscript𝑆𝑘1subscript𝑏𝑘S_{k}\leftarrow S_{k-1}\cup\{b_{k}\}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∪ { italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }
5:  end for
6:  return  S1,,S|N|subscript𝑆1subscript𝑆𝑁S_{1},\ldots,S_{|N|}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT | italic_N | end_POSTSUBSCRIPT

Computational cost.

The identification of important pixels (or patches, in practice) using Shapley values requires 𝒪(|N|2|N|)𝒪𝑁superscript2𝑁\mathcal{O}(|N|2^{|N|})caligraphic_O ( | italic_N | 2 start_POSTSUPERSCRIPT | italic_N | end_POSTSUPERSCRIPT ) times of forward passes because of the average over all SN{i}𝑆𝑁𝑖S\in N\setminus\{i\}italic_S ∈ italic_N ∖ { italic_i } for all iN𝑖𝑁i\in Nitalic_i ∈ italic_N (cf. Eq. (1)). In contrast, our approach only requires 𝒪(|N|2)𝒪superscript𝑁2\mathcal{O}(|N|^{2})caligraphic_O ( | italic_N | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) times of forward passes in the worst case (see Appendix C for details of the algorithm complexity and runtime).

Set-Sum task.

We now give an intuitive example for showing the necessity of interactions using Set-Sum task. Set-Sum task is a variant of Problem 1 with a collection of integers N𝑁N\subset\mathbb{Z}italic_N ⊂ blackboard_Z and reward function f(S)=s𝑓𝑆𝑠f(S)=sitalic_f ( italic_S ) = italic_s for SN𝑆𝑁S\subseteq Nitalic_S ⊆ italic_N, where s𝑠sitalic_s denotes the sum of all types of integers in S𝑆Sitalic_S. For example, s=3𝑠3s=3italic_s = 3 for S={2,2,1}𝑆221S=\{2,2,1\}italic_S = { 2 , 2 , 1 }. Note that for any iN𝑖𝑁i\in Nitalic_i ∈ italic_N, we have f(Sk1{i})=f(Sk1)+i𝑓subscript𝑆𝑘1𝑖𝑓subscript𝑆𝑘1𝑖f(S_{k-1}\cup\{i\})=f(S_{k-1})+iitalic_f ( italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∪ { italic_i } ) = italic_f ( italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) + italic_i if iSk1𝑖subscript𝑆𝑘1i\notin S_{k-1}italic_i ∉ italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT and otherwise f(Sk1{i})=f(Sk1)𝑓subscript𝑆𝑘1𝑖𝑓subscript𝑆𝑘1f(S_{k-1}\cup\{i\})=f(S_{k-1})italic_f ( italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∪ { italic_i } ) = italic_f ( italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ). In this way, when the features already possessed are equal to the newly added features, the model does not gain new information. This shows the role of interaction in considering information redundancy.

Visual Set-Sum task.

We empirically confirm the advantage of using interactions in the visual Set-Sum task on the synthetic MNIST dataset. This task is to accurately predict the sum of all types of numbers in an image using a model. We constructed composite images, each of which consists of four randomly selected MNIST images arranged in a 2x2 grid (cf. Fig. 2(a)). The label of a composite image is the sum of all types of numbers in the image as in the Set-Sum problem. The evaluation metric utilizes the insertion curve, as detailed in Sec. 5. For the model and dataset details, refer to Appendix A. The insertion curves in Fig. 2(b) show that the MoXI achieves higher accuracy than the methods using MoXI(-), which uses self-context Shapley values, and the Shapley value methods when 50%percent5050\%50 % and 75%percent7575\%75 % of the image area are unmasked, i.e., the second and the third number is appended. This demonstrates that MoXI acquires non-redundant information more effectively.

Refer to caption
Figure 2: (a) Example of a synthetic MNIST image in the visual Set-Sum task, labeled 17 by the sum of all types of numbers in the image. (b) Insertion curves. The curves illustrate the change of accuracy when adding image patches gradually with high contributions identified by different methods at various unmasked image rates, ranging from 00 to 100%percent100100\%100 %. These curves use a masking method that fills in zeros for game-theoretic calculations and model input during classification accuracy measurement. MoXI(-) only employs self-context Shapley values, whereas MoXI additionally uses interactions across highly contributing patches.

4.2 Pixel Deletion

To address Problem 1, we considered the problem of identifying groups of pixels with high confidence scores through pixel insertion. Here, we aim at decreasing the confidence scores via pixel deletion.

Problem 2

With the same conditions as outlined in Problem 1, find a subset SkNsubscript𝑆𝑘𝑁S_{k}\subset Nitalic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊂ italic_N such that

Sk=argminSN,|S|=kf(NS),subscript𝑆𝑘formulae-sequence𝑆𝑁𝑆𝑘argmin𝑓𝑁𝑆\displaystyle S_{k}=\underset{{S\subseteq N,|S|=k}}{\mathrm{arg\,min}}\ \ f(N% \setminus S),italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = start_UNDERACCENT italic_S ⊆ italic_N , | italic_S | = italic_k end_UNDERACCENT start_ARG roman_arg roman_min end_ARG italic_f ( italic_N ∖ italic_S ) , (8)

for k=1,2,,|N|𝑘12𝑁k=1,2,\ldots,|N|italic_k = 1 , 2 , … , | italic_N |.

We again resort to a greedy approach. The key difference is that now we define and utilize a variant of Shapley value that measures the contribution of a player by its absence.

ϕd(i|N)subscriptitalic-ϕdconditional𝑖𝑁\displaystyle\phi_{\mathrm{d}}(i\,|\,N)italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( italic_i | italic_N ) =defSN,iSPd(S{i}|N)[f(S)f(S{i})],superscriptdefabsentsubscriptformulae-sequence𝑆𝑁𝑖𝑆subscript𝑃d𝑆conditional𝑖𝑁𝑓𝑆𝑓𝑆𝑖\displaystyle\stackrel{{\scriptstyle\rm{def}}}{{=}}\sum_{S\subseteq N,i\in S}P% _{\mathrm{d}}(S\setminus\{i\}\,|\,N)\quantity[f(S)-f(S\setminus\{i\})],start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_S ⊆ italic_N , italic_i ∈ italic_S end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( italic_S ∖ { italic_i } | italic_N ) [ start_ARG italic_f ( italic_S ) - italic_f ( italic_S ∖ { italic_i } ) end_ARG ] , (9)

where Pd(A|B)=(|B||A|1)!|A|!|B|!subscript𝑃dconditional𝐴𝐵𝐵𝐴1𝐴𝐵P_{\mathrm{d}}(A\,|\,B)=\frac{(|B|-|A|-1)!|A|!}{|B|!}italic_P start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( italic_A | italic_B ) = divide start_ARG ( | italic_B | - | italic_A | - 1 ) ! | italic_A | ! end_ARG start_ARG | italic_B | ! end_ARG. This Shapley value quantifies the average impact attributable to the removal of player i𝑖iitalic_i. In Problem 1, we addressed the issue by defining self-context Shapley values and interactions, as it involves the case of incrementally adding pixels from the entire image. In contrast, Problem 2 involves the sequential deletion of pixels from an image, necessitating the formulation of full-context Shapley values and interactions as follows:

ϕd(|N|)(a)=defPd(S{a}|N)[f(N)f(N{a})]superscriptdefsuperscriptsubscriptitalic-ϕd𝑁𝑎subscript𝑃d𝑆conditional𝑎𝑁delimited-[]𝑓𝑁𝑓𝑁𝑎\displaystyle\phi_{\mathrm{d}}^{(|N|)}(a)\stackrel{{\scriptstyle\rm{def}}}{{=}% }P_{\mathrm{d}}(S\setminus\{a\}\,|\,N)[f(N)-f(N\setminus\{a\})]italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( | italic_N | ) end_POSTSUPERSCRIPT ( italic_a ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP italic_P start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( italic_S ∖ { italic_a } | italic_N ) [ italic_f ( italic_N ) - italic_f ( italic_N ∖ { italic_a } ) ]
=1|N|[f(N)f(N{a})]absent1𝑁delimited-[]𝑓𝑁𝑓𝑁𝑎\displaystyle\quad\quad\quad\>=\frac{1}{|N|}\>[f(N)-f(N\setminus\{a\})]= divide start_ARG 1 end_ARG start_ARG | italic_N | end_ARG [ italic_f ( italic_N ) - italic_f ( italic_N ∖ { italic_a } ) ] (10)
Id(|N|1)(a1,a2)superscriptsubscript𝐼d𝑁1subscript𝑎1subscript𝑎2\displaystyle I_{\mathrm{d}}^{(|N|-1)}(a_{1},a_{2})italic_I start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( | italic_N | - 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
=defϕd(|N|1)({a1,a2}|N{a1,a2}{{a1,a2}})superscriptdefabsentsuperscriptsubscriptitalic-ϕd𝑁1conditionalsubscript𝑎1subscript𝑎2𝑁subscript𝑎1subscript𝑎2subscript𝑎1subscript𝑎2\displaystyle\stackrel{{\scriptstyle\rm{def}}}{{=}}\phi_{\mathrm{d}}^{(|N|-1)}% (\{a_{1},a_{2}\}\,|\,N\setminus\{a_{1},a_{2}\}\cup\{\{a_{1},a_{2}\}\})start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG roman_def end_ARG end_RELOP italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( | italic_N | - 1 ) end_POSTSUPERSCRIPT ( { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } | italic_N ∖ { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∪ { { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } } )
ϕd(|N|1)(a1|N{a2})ϕd(|N|1)(a2|N{a1})superscriptsubscriptitalic-ϕd𝑁1conditionalsubscript𝑎1𝑁subscript𝑎2superscriptsubscriptitalic-ϕd𝑁1conditionalsubscript𝑎2𝑁subscript𝑎1\displaystyle\quad\quad-\phi_{\mathrm{d}}^{(|N|-1)}(a_{1}\,|\,N\setminus\{a_{2% }\})-\phi_{\mathrm{d}}^{(|N|-1)}(a_{2}\,|\,N\setminus\{a_{1}\})- italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( | italic_N | - 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_N ∖ { italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ) - italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( | italic_N | - 1 ) end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_N ∖ { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } )
=1|N|1[f(N)f(N{a1})\displaystyle=\frac{1}{|N|-1}[f(N)-f(N\setminus\{a_{1}\})= divide start_ARG 1 end_ARG start_ARG | italic_N | - 1 end_ARG [ italic_f ( italic_N ) - italic_f ( italic_N ∖ { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } )
f(N{a2})+f(N{a1,a2})].\displaystyle\quad\quad-f(N\setminus\{a_{2}\})+f(N\setminus\{a_{1},a_{2}\})].- italic_f ( italic_N ∖ { italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ) + italic_f ( italic_N ∖ { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ) ] . (11)

With these quantities, the greedy algorithm for pixel deletion is as follows. For k=1𝑘1k=1italic_k = 1, the index b1Nsubscript𝑏1𝑁b_{1}\in Nitalic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_N of the pixel with the highest (deletion-based) Shapley value ϕd(|N|)(b1)=1|N|[f(N{a})f(N)]superscriptsubscriptitalic-ϕd𝑁subscript𝑏11𝑁delimited-[]𝑓𝑁𝑎𝑓𝑁\phi_{\mathrm{d}}^{(|N|)}(b_{1})=-\frac{1}{|N|}\>[f(N\setminus\{a\})-f(N)]italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( | italic_N | ) end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG | italic_N | end_ARG [ italic_f ( italic_N ∖ { italic_a } ) - italic_f ( italic_N ) ] gives the optimal set S1={b1}subscript𝑆1subscript𝑏1S_{1}=\{b_{1}\}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } by its definition. For k=2𝑘2k=2italic_k = 2, we select the next pixel b2subscript𝑏2b_{2}italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that minimizes f(N{b1,b2})𝑓𝑁subscript𝑏1subscript𝑏2f(N\setminus\{b_{1},b_{2}\})italic_f ( italic_N ∖ { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ). This choice is again explained as a sum of Shapley value and interaction,

b2subscript𝑏2\displaystyle b_{2}italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =argminbN{b1}f(N{b1,b})f(N)absent𝑏𝑁subscript𝑏1argmin𝑓𝑁subscript𝑏1𝑏𝑓𝑁\displaystyle=\underset{b\in N\setminus\{b_{1}\}}{\mathrm{arg\,min}}\ f(N% \setminus\{b_{1},b\})-f(N)= start_UNDERACCENT italic_b ∈ italic_N ∖ { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } end_UNDERACCENT start_ARG roman_arg roman_min end_ARG italic_f ( italic_N ∖ { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b } ) - italic_f ( italic_N )
=argmaxbN{b1}ϕd(|N|1)({b1,b}|N{b1,b}{{b1,b}})absent𝑏𝑁subscript𝑏1argmaxsuperscriptsubscriptitalic-ϕd𝑁1conditionalsubscript𝑏1𝑏𝑁subscript𝑏1𝑏subscript𝑏1𝑏\displaystyle=\underset{b\in N\setminus\{b_{1}\}}{\mathrm{arg\,max}}\ \phi_{% \mathrm{d}}^{(|N|-1)}(\{b_{1},b\}\,|\,N\setminus\{b_{1},b\}\cup\{\{b_{1},b\}\})= start_UNDERACCENT italic_b ∈ italic_N ∖ { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( | italic_N | - 1 ) end_POSTSUPERSCRIPT ( { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b } | italic_N ∖ { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b } ∪ { { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b } } )
=argmaxbN{b1}ϕd(|N|)(b)+(|N|1)Id(|N|1)(b1,b).absent𝑏𝑁subscript𝑏1argmaxsuperscriptsubscriptitalic-ϕd𝑁𝑏𝑁1superscriptsubscript𝐼d𝑁1subscript𝑏1𝑏\displaystyle=\underset{b\in N\setminus\{b_{1}\}}{\mathrm{arg\,max}}\ \phi_{% \mathrm{d}}^{(|N|)}(b)+(|N|-1)I_{\mathrm{d}}^{(|N|-1)}(b_{1},b).= start_UNDERACCENT italic_b ∈ italic_N ∖ { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( | italic_N | ) end_POSTSUPERSCRIPT ( italic_b ) + ( | italic_N | - 1 ) italic_I start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( | italic_N | - 1 ) end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b ) . (12)

For k3𝑘3k\geq 3italic_k ≥ 3, we can similarly show that minimizing f(NSk1{b})𝑓𝑁subscript𝑆𝑘1𝑏f(N\setminus S_{k-1}\cup\{b\})italic_f ( italic_N ∖ italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∪ { italic_b } ) with respect to bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is equivalent to

bksubscript𝑏𝑘\displaystyle b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =argmaxbNSk1[ϕd(|N|)(b)\displaystyle=\underset{b\in N\setminus S_{k-1}}{\operatorname{argmax}}\>[\phi% _{\mathrm{d}}^{(|N|)}(b)= start_UNDERACCENT italic_b ∈ italic_N ∖ italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_argmax end_ARG [ italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( | italic_N | ) end_POSTSUPERSCRIPT ( italic_b )
+(|N||Sk1|)Id(|N||Sk1|)(Sk1,b)].\displaystyle\quad+(|N|-|S_{k-1}|)\,I_{\mathrm{d}}^{(|N|-|S_{k-1}|)}(S_{k-1},b% )].+ ( | italic_N | - | italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | ) italic_I start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( | italic_N | - | italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_b ) ] . (13)
Algorithm 2 Identification of a group of pixels in the pixel deletion approach
0:  reward function f𝑓fitalic_f, index set of all images N𝑁Nitalic_N.
0:  Sequence of subsets S1,,S|N|Nsubscript𝑆1subscript𝑆𝑁𝑁S_{1},\ldots,S_{|N|}\subset Nitalic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT | italic_N | end_POSTSUBSCRIPT ⊂ italic_N
1:  Sk{}for allk=0,,|N|formulae-sequencesubscript𝑆𝑘for all𝑘0𝑁S_{k}\leftarrow\{\}\;\text{for all}\;k=0,\ldots,|N|italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← { } for all italic_k = 0 , … , | italic_N |
2:  for k=1,,|N|𝑘1𝑁k=1,\ldots,|N|italic_k = 1 , … , | italic_N | do
3:     bkargminbNSk1f(N(Sk1{b}))subscript𝑏𝑘𝑏𝑁subscript𝑆𝑘1argmin𝑓𝑁subscript𝑆𝑘1𝑏b_{k}\leftarrow\underset{b\in N\setminus S_{k-1}}{\operatorname{argmin}}f(N% \setminus(S_{k-1}\cup\{b\}))italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← start_UNDERACCENT italic_b ∈ italic_N ∖ italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_argmin end_ARG italic_f ( italic_N ∖ ( italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∪ { italic_b } ) )
4:     SkSk1{bk}subscript𝑆𝑘subscript𝑆𝑘1subscript𝑏𝑘S_{k}\leftarrow S_{k-1}\cup\{b_{k}\}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← italic_S start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∪ { italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }
5:  end for
6:  return  S1,,S|N|subscript𝑆1subscript𝑆𝑁S_{1},\ldots,S_{|N|}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT | italic_N | end_POSTSUBSCRIPT

Again, the greedy algorithm is described from a game-theoretic viewpoint. The only difference from the insertion case is that the interaction term is now weighted. Algorithm 2 summarises the procedure. The computational cost of the pixel deletion approach is the same as the pixel insertion approach, which only requires 𝒪(|N|2)𝒪superscript𝑁2\mathcal{O}(|N|^{2})caligraphic_O ( | italic_N | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) times of forward passes in the worst case.

5 Experiments

In this section, we evaluate the characteristics of identified patches through comparative experiments with existing methods and demonstrate the effectiveness of our method.

Setup.

Our experiments utilize the ImageNet dataset [10] and focus on analyzing Vision Transformer [11] pre-trained for the classification task. For baseline methods, we use Grad-CAM [23]222The target layer of Grad-CAM is set to the one before the layer normalization in the final attention block of network. This choice is common, see https://github.com/jacobgil/pytorch-grad-cam., Grad-CAM++ [5], Attention rollout [1], Shapley values, and MoXI(-), which do not utilize the interactions present in MoXI. For insertion curve experiments, we use the Pixel Insertion approach, while for deletion curves, we utilize the Pixel Deletion approach. Following the previous studies [21, 29], we consider image patches instead of pixels to reduce computational costs. All methods calculate the contributions for 14×14141414\times 1414 × 14 patches with a patch size of 16×16161616\times 1616 × 16, which is equal to the patch size and the number of tokens in standard ViT models. We used a pre-trained ViT-T333https://huggingface.co/WinKawaks/vit-tiny-patch16-224 [11], DeiT-T444https://huggingface.co/facebook/deit-tiny-patch16-224 [26] and ResNet-18555https://huggingface.co/microsoft/resnet-18 [14]. We selected 1000100010001000 images, one corresponding to each label, all of which were successfully classified in the test set. To reduce the computational burden, we computed Shapley values approximately by random sampling of S𝑆Sitalic_S in Eq. (1) as in other studies [4, 22, 29, 25]. The sampling size is set to 200. Moreover, we have adopted feature patch deletion as the masking method for Shapley values and interactions. In the following, we focus on ViT-T. See Appendix B for more results.

5.1 Evaluating the importance of identified patches

We evaluate the importance of the image patches as determined by the above methods, using insertion/deletion curve metrics. The insertion curve identifies information-rich patches, while the deletion curve helps identify patches important for the model’s decision-making process. In our insertion/deletion curve experiments, we utilized the masking method for patch deletion. For Grad-CAM, Attention rollout, and Shapley value, image patches are inserted and deleted in the same order.

Refer to caption
Figure 3: (a) Insertion curves. (b) Deletion curves. The curves illustrate the change of accuracy when appending (removing) image patches gradually with high contributions identified by different methods at various unmasked (masked) image rates, ranging from 00 to 100%percent100100\%100 %.
Refer to caption
Figure 4: Visualization of important image patches by each method. The highlighted image patches are selected based on their contributions calculated by each method. (a) Highlighting the patches incrementally added to an entire image until classification success. (b) Highlighting the patches sequentially removed from a full image until classification failure.

The insertion curves in Fig. 3(a) show that MoXI exhibits a sharper increase in classification accuracy compared to the other methods. In particular, even with images where only 4%percent44\%4 % is visible, MoXI achieves an accuracy of 90%percent9090\%90 %, whereas Grad-CAM, Attention rollout, and Shapley value achieve 2%percent22\%2 %, 4%percent44\%4 %, and 25%percent2525\%25 %, respectively. This result indicates that MoXI can efficiently identify important patches for classification. Then, both the self-context and original Shapley values, which are based on confidence scores, achieve a sharper increase in classification accuracy. However, these two methods calculate the importance of individual patches and often select patches with similar information. Consequently, MoXI can identify features contributing to a higher classification accuracy than these methods.

The deletion curves in Fig. 3(b) show that MoXI exhibits a sharp decrease in classification accuracy compared to the other methods. When concealing just 10%percent1010\%10 % of an image, MoXI significantly decreases the model’s accuracy to 16%percent1616\%16 %. In contrast, Grad-CAM and Attention rollout only decrease the accuracy to approximately 79%percent7979\%79 % under the same conditions. This result indicates that MoXI, which accounts for interactions between patches, effectively identifies the image patches important for classification. We observed analogous results for DeiT-T [26] and ResNet-18 [14] models, as detailed in Appendix B. Additionally, we discuss the application of masks using our method in Appendix D.

5.2 Confidence score-based visualization

We introduce two heatmap-based visualization methods tailored for analyzing insertion and deletion patches. The first method visualizes insertion patches, highlighting those important for accurate classification. The second focuses on deletion patches, specifically identifying those whose deletion significantly impacts the classification. The heatmap shows higher values, indicated by shades closer to red, for patches that were inserted or deleted earlier. The insertion or deletion stops when the model reaches a successful classification or misclassification.

Heatmap visualization.

Figure 4(a) displays a heatmap for patch insertion. Compared to the existing methods, MoXI’s heatmap highlights fewer regions and identifies the class object. Interestingly, MoXI selects the patches on the background as well as the class object. This visualization explains the object and background is required for classification and demonstrates the usefulness of the interaction.

Figure 4(b) displays a heatmap for patch deletion. The heatmaps generated by MoXI(-) and Grad-CAM display extensive highlights across the image, while MoXI, Attention rollout, and Shapley value show more concentrated highlights on the class object. This finding indicates that these latter methods accurately capture important information from the object. Notably, MoXI places less emphasis on the background than Attention rollout and Shapley value. This result suggests that MoXI effectively narrows down information by selectively deleting the class object, which could be advantageous for precise object localization.

Class-dicriminative localization.

To enhance understanding of the model’s prediction process, localization for specific classes improve interpretability. We have extended MoXI to analyze a target class that differs from the model’s prediction. For the detailed visualization, see Appendix F. Figures 5(b) and  5(c) visualizes important regions for two classes: the bull mastiff, as predicted by the model, and the tiger cat, the target class. The heatmaps reveal that MoXI highlights the bull mastiff’s facial area and the tiger cat’s face and body. These observations demonstrate that MoXI can identify important groups of image patches relevant to the predicted class and class-specific features important for decision-making.

Refer to caption
Figure 5: Visualization of important region for a targeted class using the proposed method. (a) Original image. (b) Targeting the bull mastiff class, which is predicted by the model. The highlighted patches are those sequentially removed from a full image until predict the bull mastiff class. (c) Targeting tiger cat class. We first removed the patches that has a positive contribution to bull mastiff class and also negative contribution to tiger cat. Once the tiger cat becomes the predicted class of the model, the patches highly contributing to tiger cat is removed sequentially until the prediction change, which are the highlighted patches.

5.3 Common corruption effect on patch deletion

We investigate the risk of model misclassification when image patches important for model accuracy are disrupted by adding noise. In the deletion curve experiment of Sec. 5.1, we used patch masking to simulate feature absence. Instead of patch masking, we consider common corruption [15]: fog and Gaussian noise at level 5 (for the other corruptions such as brightness and motion blur, see Appendix G.1). We apply these corruptions to image patches in the order selected for patch deletion in Sec. 5.1.

Figure 6(a) shows the effect of Gaussian noise on the deletion curve results. MoXI exhibits a significant decrease in accuracy compared to the others, indicating MoXI is vulnerable to Gaussian noise. This result implies that MoXI efficiently identifies important patches. Figure 6(b) shows the fog corruption results, which are similar to those observed for Gaussian noise. Furthermore, as detailed in Appendix G.1, MoXI similarly affects accuracy with the other common corruptions. Additionally, we evaluate the effect of adversarial perturbations. Interestingly, adversarial perturbations yield distinct results due to their deceptive effect on the model’s internal features (see Appendix G.2).

Refer to caption
Figure 6: Deletion curves by image corruptions instead of masking: (a) Gaussian noise and (b) fog. The curves illustrate the change in accuracy along with the increase in the number of corrupted image patches. The patches are corrupted from the highly contributing ones determined by each method.

5.4 Consistant explainability

We examine the consistent explainability of visualization methods, regardless of the internal feature representation, which is a key aspect of explainable artificial intelligence. Specifically, we examine whether the models, trained with varying numbers of classification classes, consistently select important image patches. We evaluate the consistency using insertion and deletion curves for the models trained with datasets containing 10, 20, 100, and 1000 classes. For training the 10-class model, we select images from ImageNet that share labels with CIFAR10. For the models with 20, 100, and 1000 classes, we extend the 10-class dataset by adding images with randomly selected classes from ImageNet. We draw the insertion and deletion curves using the 10-class test images that are correctly classified.

Figures 7(a) and 7(b) show the insertion curve results for Attention rollout and MoXI, respectively. Attention rollout decreases accuracy as the number of classes increases. In contrast, MoXI does not decrease in accuracy. Therefore, MoXI consistently selects important image patches for accurate classification. In addition, the results from other methods and deletion experiments are shown in the Appendix H. We confirmed that MoXI provides consistent explainability in the deletion curve experiments.

Refer to caption
Figure 7: Insertion curves. (a) Attention rollout, (b) MoXI. The curves illustrate the change in accuracy along with the increase in the number of unmasked image patches. Each curve represents the results from the pretrained models with 10101010, 20202020, 100100100100, and 1000100010001000 classes, respectively. As the number of classes the model learns increases, the accuracy of Attention rollout significantly decreases, whereas MoXI experiences only a minor decrease in accuracy.

6 Conclusion

This study addressed the problem of identifying a group of pixels that largely and collectively impact confidence scores in image classification models. We justify simple greedy algorithms from a game-theoretic view using Shapley values and interactions. This analysis naturally suggests the use of self-context and full-context variants of Shapley values and interactions. Their computation only requires a quadratic number of forward passes, whereas prior studies compute Shapley values and/or interactions with an exponential number of forward passes or heavy sampling-based approximation. The experimental results show that our method is more accurate in identifying the important image patches for models than popular methods.

Acknowledgments

This work was supported by JSPS KAKENHI Grant Number JP22H03658 and JP22K17962.

References

  • Abnar and Zuidema [2020] Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Online, 2020. Association for Computational Linguistics.
  • Ancona et al. [2019] Marco Ancona, Cengiz Oztireli, and Markus Gross. Explaining deep neural networks with a polynomial time algorithm for shapley value approximation. In Proceedings of the 36th International Conference on Machine Learning, pages 272–281, Long Beach, California, USA, 2019. PMLR.
  • Binder et al. [2016] Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, Klaus-Robert Müller, and Wojciech Samek. Layer-Wise Relevance Propagation for Neural Networks with Local Renormalization Layers, pages 63–71. Springer International Publishing, Cham, 2016.
  • Castro et al. [2009] Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based on sampling. Computers & Operations Research, 36(5):1726–1730, 2009. Selected papers presented at the Tenth International Symposium on Locational Decisions (ISOLDE X).
  • Chattopadhay et al. [2018] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-CAM+++++ +: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. IEEE, 2018.
  • Chefer et al. [2021] Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 782–791, 2021.
  • Cheng et al. [2021] Xu Cheng, Chuntung Chu, Yi Zheng, Jie Ren, and Quanshi Zhang. A game-theoretic taxonomy of visual concepts in DNNs. arXiv preprint arXiv:2106.10938, 2021.
  • Covert et al. [2023] Ian Connick Covert, Chanwoo Kim, and Su-In Lee. Learning to estimate shapley values with vision transformers. In The Eleventh International Conference on Learning Representations, 2023.
  • Deng et al. [2022] Huiqi Deng, Qihan Ren, Hao Zhang, and Quanshi Zhang. Discovering and explaining the representation bottleneck of DNNs. In Proceedings of the International Conference on Learning Representations, 2022.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, 2021.
  • Goodfellow et al. [2015] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations, 2015.
  • Grabisch and Roubens [1999] Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games. International Journal of Game Theory, 28:547–565, 1999.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • Hendrycks and Dietterich [2019] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of the International Conference on Learning Representations, 2019.
  • Jethani et al. [2022] Neil Jethani, Mukund Sudarshan, Ian Connick Covert, Su-In Lee, and Rajesh Ranganath. FastSHAP: Real-time Shapley value estimation. In Proceedings of the International Conference on Learning Representations, 2022.
  • Kurakin et al. [2017] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. Proceedings of the International Conference on Learning Representations Workshop, 2017.
  • Lundberg and Lee [2017] Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, page 4768–4777, Red Hook, NY, USA, 2017. Curran Associates Inc.
  • Madry et al. [2018] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In Proceedings of the International Conference on Learning Representations, 2018.
  • Petsiuk et al. [2018] Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models. In Proceedings of the British Machine Vision Conference, 2018.
  • Ren et al. [2021] Jie Ren, Die Zhang, Yisen Wang, Lu Chen, Zhanpeng Zhou, Yiting Chen, Xu Cheng, Xin Wang, Meng Zhou, Jie Shi, and Quanshi Zhang. Towards a unified game-theoretic view of adversarial perturbations and robustness. In Proceedings of the Advances in Neural Information Processing Systems, pages 3797–3810, 2021.
  • Ren et al. [2022] Jie Ren, Zhanpeng Zhou, Qirui Chen, and Quanshi Zhang. Towards a game-theoretic view of baseline values in the shapley value, 2022.
  • Selvaraju et al. [2017] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of IEEE/CVF International Conference on Computer Vision, pages 618–626, 2017.
  • Shapley [1953] Lloyd S. Shapley. A value for n-person games. In Contributions to the Theory of Games, pages 307–317, 1953.
  • Sumiyasu et al. [2022] Kosuke Sumiyasu, Kazuhiko Kawamoto, and Hiroshi Kera. Game-theoretic understanding of misclassification, 2022.
  • Touvron et al. [2021] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, pages 10347–10357, 2021.
  • Wang et al. [2020] Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neural networks, 2020.
  • Wang et al. [2021] Xin Wang, Jie Ren, Shuyun Lin, Xiangming Zhu, Yisen Wang, and Quanshi Zhang. A unified approach to interpreting and boosting adversarial transferability. In Proceedings of the International Conference on Learning Representations, 2021.
  • Zhang et al. [2021a] Hao Zhang, Sen Li, YinChao Ma, Mingjie Li, Yichen Xie, and Quanshi Zhang. Interpreting and boosting dropout from a game-theoretic view. In Proceedings of the International Conference on Learning Representations, 2021a.
  • Zhang et al. [2021b] Hao Zhang, Yichen Xie, Longjie Zheng, Die Zhang, and Quanshi Zhang. Interpreting multivariate shapley interactions in dnns. In The AAAI Conference on Artificial Intelligence, 2021b.
  • Zhou et al. [2016] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921–2929, 2016.
\thetitle

Supplementary Material

A Visual Set-Sum Task

We here describe the details of the experiment for the Visual Set-Sum task. The dataset consists of composite images, each of which consists of four MNIST images. The composite images are labeled by the sum of all types of digits in that image as the label (see examples in Fig. 2(a)). The size of a composite image is 56x56, and the patch size is 28x28. As we sample the digits (i.e., MNIST images) uniformly, a composite image has duplicate numbers with a probability of roughly 47%percent4747\%47 %. In the test set, each composite image was designed to have its largest digit in two patches, which is the most advantageous case of using interactions. We trained a ResNet-18 [14] model and evaluated it on a test set of size 10,000. The loss function used for the training is MNIST=CE+MSEsubscriptMNISTsubscriptCEsubscriptMSE\mathcal{L}_{\text{MNIST}}=\mathcal{L}_{\text{CE}}+\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MNIST end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT, where the first loss denotes the cross-entropy loss and the second loss denotes the mean-squared error between the model prediction and the true class. The second loss adds a regression flavor and takes a lower value when the model prediction (i.e., predicted set-sum) is closer to the label (i.e., the set sum). We filled the zero value for masking patches for computing Shapley values and interactions and for the accuracy evaluation.

B Results of the Insertion/Deletion curve with additional models

In Sec. 5.1, we evaluated the proposed and baseline methods using ViT-T [11]. The insertion and deletion curves show that the proposed method provides the most efficient visual explanation. To demonstrate this generalization across different models and architectures, we provide results using both the DeiT-T [26], a ViT architecture, and ResNet-18 [14], a widely used CNN model. For details of the experiment in Deit-T, refer to Sec. 5.1. The insertion curve in Fig. 8(a) again shows that MoXI exhibits a sharper increase compared to the other methods. The deletion curve in Fig. 8(b) also demonstrates that MoXI exhibits a sharper decrease compared to the other methods. Similarly, Fig. 9 exhibits that the results for ResNet-18 are similar to these findings. These results indicate that our method can efficiently and accurately identify the critical patches in the model’s decision-making process.

Refer to caption
Figure 8: Results for DeiT-T:(a) Insertion curves. (b) Deletion curves. The curves illustrate the accuracy growth when inserting (deleting) image patches according to the contributions computed by each method.
Refer to caption
Figure 9: Results for ResNet-18: (a) Insertion curves. (b) Deletion curves. The curves illustrate the accuracy growth when inserting (deleting) image patches according to the contributions computed by each method.

C Comparison of algorithm complexity and runtime

In this section, we compare the algorithm complexity and runtime for each method.

First, we provide an explanation of the complexity of the algorithm, focusing on the number of forward passes required per image. Let |N|𝑁|N|| italic_N | be the number of patches in an image (typically, |N|=142𝑁superscript142|N|=14^{2}| italic_N | = 14 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT). Grad-CAM needs 1111 forward pass (and 1111 backward pass), Attention rollout needs 1111 forward pass, and Shapley value needs 𝒪(|N|2|N|)𝒪𝑁superscript2𝑁\mathcal{O}(|N|\cdot 2^{|N|})caligraphic_O ( | italic_N | ⋅ 2 start_POSTSUPERSCRIPT | italic_N | end_POSTSUPERSCRIPT ) forward passes. MoXI needs 𝒪(|N|2)𝒪superscript𝑁2\mathcal{O}(|N|^{2})caligraphic_O ( | italic_N | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) forward passes in the worst case. The number of passes for Shapley value and MoXI is given in Sec. 4, which we will elaborate it again. As defined in Eq. (1), computing the Shapley value for the i𝑖iitalic_i-th pixel requires 𝒪(2|N|)𝒪superscript2𝑁\mathcal{O}(2^{|N|})caligraphic_O ( 2 start_POSTSUPERSCRIPT | italic_N | end_POSTSUPERSCRIPT ) passes due to the 2|N|1superscript2𝑁12^{|N|-1}2 start_POSTSUPERSCRIPT | italic_N | - 1 end_POSTSUPERSCRIPT possible choices of S𝑆Sitalic_S. leading to 𝒪(|N|2|N|)𝒪𝑁superscript2𝑁\mathcal{O}(|N|\cdot 2^{|N|})caligraphic_O ( | italic_N | ⋅ 2 start_POSTSUPERSCRIPT | italic_N | end_POSTSUPERSCRIPT ) passes for an entire image. On the other hand, MoXI needs O(|N|2)𝑂superscript𝑁2O(|N|^{2})italic_O ( | italic_N | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) passes. For example, at the k𝑘kitalic_k-th step of the greedy insertion, it recruits a new patch from the remaining |N|k+1𝑁𝑘1|N|-k+1| italic_N | - italic_k + 1 patches to maximize the confidence score (i.e., |N|k+1𝑁𝑘1|N|-k+1| italic_N | - italic_k + 1 passes). For |N|𝑁|N|| italic_N | steps, it needs 𝒪(|N|2)𝒪superscript𝑁2\mathcal{O}(|N|^{2})caligraphic_O ( | italic_N | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) passes in total. Note that this is the worst-case scenario; the algorithm stops when the classification becomes correct, and Fig. 3(a) indicates that more than 90% of evaluation images require less than 0.04N0.04𝑁0.04N0.04 italic_N steps. In the runtime experiment, the median of the steps was 6 (with std 7.6) and 10 (with std 11.3) for insertion and deletion, respectively. A similar discussion holds for the deletion case.

Furthermore, our method can leverage parallel processing with mini-batches, leading to a linear number of forward passes at the cost of additional memory usage. Specifically, the k𝑘kitalic_k-th step of MoXI can be done by a single forward pass of |N|k+1𝑁𝑘1|N|-k+1| italic_N | - italic_k + 1 patterns of the insertion from remaining patches. Our implementation is based on this parallelization.

Next, we compare the runtime required for measuring the importance in each method. The comparison is based on the average runtime across 100 images, following the experimental setup described in Sec. 5. For Grad-CAM, Attention rollout, and Shapley value, the runtime represents the duration required to compute the importance of the entire image. In contrast, in the case of MoXI, we separately measure the runtime for pixel insertion until successful classification and for pixel deletion until classification failure. Our experiments were conducted using a machine equipped with a 12-core processor, 64GB RAM, and an NVIDIA RTX 3090.

The runtime for each method is shown in Table 1. This indicates that the runtime for MoXI is approximately 30303030 times faster than that for Shapley value. Recalling the results from Fig. 3, MoXI achieves higher accuracy in capturing an important group of patches than Shapley value method does. Therefore, MoXI surpasses Shapley value in both accuracy and runtime. While not as fast as Grad-CAM and Attention rollout, we consider that MoXI meets most use cases of visualization, and the quality is better, as our extensive experiments show.

Table 1: Average runtime 100 ImageNet images [sec] in ViT-T.
Grad-CAM Attention R. Shapley V. MoXI (Ins/Del)
0.15 0.02 17.9 0.60/1.34

D Analysis of effective layers to remove patches

In Sec. 5.1, we consider the absence of players (i.e., pixels/patches) for calculating Shapley values and interactions in the input space. Specifically, the patches are removed after the input embedding layer. Here, we examine the case where several self-attention layers are instead masked. To this end, we utilize a variant of the attention-masking approach used in [8]. Specifically, let the k𝑘kitalic_k-th layer be our target layer. Then, a large negative value is added to the product of the query and key matrices from k𝑘kitalic_k-th to the last self-attention layers. Figure 10 displays the insertion curve results when MoXI is applied to various target layers. The experimental setup is the same as in Sec. 5.1. The result demonstrates that MoXI prefers the earlier layers and better pinpoints the important features of images.

Refer to caption
Figure 10: Insertion curves. The curves illustrate the accuracy growth when inserting image patches according to the contributions computed by each method. The horizontal axis presents the insertion rate. The masking method used in the computation of Shapley values and interactions employs attention masking. For the insertion curve experiments, masks used for input to the model for accuracy measurement employ patch deletion.

E Additional results of visualization

We provide additional visualization results in Fig. 11 and 12. As in Sec. 5.2, the results demonstrated that the patches highlighted by MoXI are smaller than those highlighted by other methods.

We observed that MoXI behaves slightly unstable at the insertion case. Recall that in this case, MoXI appends important patches to an empty set accordingly and terminates when the model gives the correct classification. Empirically, the termination can happen at a very early stage, where the confidence score of the correct class is the largest but still very low. If we continue to patch, the model prediction can fluctuate among several classes. Note this does not cause a big problem in most cases; all the insertion curves in this paper consistently show a monotonic increase of classification accuracy with the increase in insertion rate. If needed, one can introduce a minimum confidence score τ[0,1]𝜏01\tau\in[0,1]italic_τ ∈ [ 0 , 1 ] and terminate the insertion when the confidence score exceeds this threshold with the correct classification. We include this hyperparameter in our official implementation of MoXI.

Refer to caption
Figure 11: Visualization of important image patches by each method. The highlighted image patches are selected based on their contributions calculated by each method. (a) Highlighting the patches incrementally added to an entire image until classification success. (b) Highlighting the patches sequentially removed from a full image until classification failure.
Refer to caption
Figure 12: Visualization of important image patches by each method. The highlighted image patches are selected based on their contributions calculated by each method. (a) Highlighting the patches incrementally added to an entire image until classification success. (b) Highlighting the patches sequentially removed from a full image until classification failure.

F Class-descriminative localization

The proposed method was originally designed to identify important pixels to explain the model prediction. Here, we generalize MoXI (for pixel deletion) to visualize such pixels for a given target class, which is used in Fig. 5. To this end, we consider reward function switching as follows. Let x,yt,yf(x)𝑥subscript𝑦tsubscript𝑦𝑓𝑥x,y_{\mathrm{t}},y_{f(x)}italic_x , italic_y start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_f ( italic_x ) end_POSTSUBSCRIPT be the input image, the target label, and the predicted label, respectively. If yt=yf(x)subscript𝑦tsubscript𝑦𝑓𝑥y_{\mathrm{t}}=y_{f(x)}italic_y start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_f ( italic_x ) end_POSTSUBSCRIPT, we simply use a reward function f(x)=logP(yt|x)1P(yt|x)𝑓𝑥𝑃conditionalsubscript𝑦t𝑥1𝑃conditionalsubscript𝑦t𝑥f(x)=\log\frac{P(y_{\mathrm{t}}\,|\,x)}{1-P(y_{\mathrm{t}}\,|\,x)}italic_f ( italic_x ) = roman_log divide start_ARG italic_P ( italic_y start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG 1 - italic_P ( italic_y start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT | italic_x ) end_ARG. Otherwise, we use f(x)=logP(yf(x)|x)1Pyf(x)|x)P(yt|x)1P(yt|x)f(x)=\log\frac{P(y_{f(x)}\,|\,x)}{1-Py_{f(x)}\,|\,x)}-\frac{P(y_{\mathrm{t}}\,% |\,x)}{1-P(y_{\mathrm{t}}\,|\,x)}italic_f ( italic_x ) = roman_log divide start_ARG italic_P ( italic_y start_POSTSUBSCRIPT italic_f ( italic_x ) end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG 1 - italic_P italic_y start_POSTSUBSCRIPT italic_f ( italic_x ) end_POSTSUBSCRIPT | italic_x ) end_ARG - divide start_ARG italic_P ( italic_y start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG 1 - italic_P ( italic_y start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT | italic_x ) end_ARG, which helps us identify patches with positive effect on the confidence score on class yf(x)subscript𝑦𝑓𝑥y_{f(x)}italic_y start_POSTSUBSCRIPT italic_f ( italic_x ) end_POSTSUBSCRIPT and negative effect on class ytsubscript𝑦ty_{\mathrm{t}}italic_y start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT. The image patches removed in the former case are collected as important patches for class ytsubscript𝑦ty_{\mathrm{t}}italic_y start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT.

G Patch perturbations

In Sec. 5.3, we evaluated the effectiveness of each method by measuring the classification accuracy when Gaussian and fog noise were applied to important image patches identified. The deletion curves here are not plotted by removing patches but instead perturbed. We present experimental results on common corruptions and adversarial perturbations.

G.1 Common corruptions

We implemented 19 types of common corruptions using the imagecorruptions module with severity 5.666 https://github.com/hendrycks/robustness. Figures 13 and 14 showcase the deletion curves with different corruptions for ViT-T and DeiT-T, respectively. The results demonstrate that our method gives a sharper decrease at the early stage of deletion curves than others, as in Sec. 5.3.

Refer to caption
Figure 13: Deletion curves by image corruptions instead of masking with ViT-T. The curves illustrate the change in accuracy along with the increase in the number of corrupted image patches. The patches are corrupted from the highly contributing ones determined by each method.
Refer to caption
Figure 14: Deletion curves by image corruptions instead of masking with DeiT-T. The curves illustrate the change in accuracy along with the increase in the number of corrupted image patches. The patches are corrupted from the highly contributing ones determined by each method.

G.2 Adversarial perturbations

Besides common corruptions, we also investigated the case with adversarial perturbations [12, 17, 19], which are small but malicious perturbations that can largely change the model’s output. We conducted the same experiment given in Sec. G.1 but with adversarial perturbations instead of common corruptions. To obtain adversarial perturbations, we adopted L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-untargeted PGD with ϵ=1.0italic-ϵ1.0\epsilon=1.0italic_ϵ = 1.0 and stepsize α=0.2𝛼0.2\alpha=0.2italic_α = 0.2. Figure 16(a) and 16(a) present the deletion curves for ViT-T and Deit-T, respectively. The results show that the attention rollout method gives a slightly sharper decrease than MoXI. This differs from the results for common corruptions. We suspect that adversarial perturbations mostly lie in the patches that are suggested as important by attention rollout. To confirm this, we measured the magnitude of adversarial perturbations on each image patch. Specifically, the magnitude is measured by the L2 norm. Figure 16(b) shows the magnitude of the perturbations of each patch. The patches are ordered as in the deletion curves in Fig. 16(a). The results indicate that the importance of image patches identified by attention rollout is well aligned with the amount of perturbations on them. On the other hand, image patches identified by MoXI contain a larger amount of perturbations at the early and late stages than those at the middle stage. This may be because the attention rollout reflects the internal computation process of the features directly when measuring the contributions of image patches, while adversarial perturbations are designed to hack this process. On the other hand, MoXI treats a Vision Transformer as a black-box model and is unaware of the internal process.

Refer to caption
Figure 15: (a) Deletion curves by adversarial perturbations instead of masking with ViT-T. The curves illustrate the change in accuracy along with the increase in the number of perturbed image patches. The patches are perturbed from the highly contributing ones determined by each method. (b) The amount of adversarial perturbations.
Refer to caption
Figure 16: (a) Deletion curves by adversarial perturbations instead of masking with DeiT-T. The curves illustrate the change in accuracy along with the increase in the number of perturbed image patches. The patches are perturbed from the highly contributing ones determined by each method. (b) The amount of adversarial perturbations.

H More results in the stability of explanations.

In Sec 5.4, we evaluate the stability of explanations of MoXI and attention rollout with respect to the number of classes. Here, we consider both insertion and deletion metrics, utilizing Grad-CAM, attention rollout, Shapley value, and MoXI. Figure 17 shows insertion and deletion curves. The result again shows that MoXI maintains relatively stable accuracy when the model is trained on more classes. Similarly, other methods have significantly decreased classification accuracy in such scenarios. Therefore, MoXI acquires important image patches more consistently than other methods.

Refer to caption
Figure 17: (Top) Insertion curves. (Bottom) Deletion curves. The curves illustrate the change in accuracy along with the increase (decrease) in the number of unmasked (masked) image patches. Each curve represents the results from the pretrained models with 10101010, 20202020, 100100100100, and 1000100010001000 classes, respectively. (a) Grad-CAM results, (b) Attention Rollout results, (c) Shapley Value results, (d) MoXI results.