Towards Understanding Sensitive and Decisive Patterns in Explainable AI: A Case Study of Model Interpretation in Geometric Deep Learning

Jiajun Zhu¹¹1Zhu was an intern at Georgia Institute of Technology when doing this project. Zhejiang University, Hangzhou, China Siqi Miao Georgia Institute of Technology, Atlanta, US Rex Ying Yale University, New Haven, US Pan Li²²2Correspondence: [email protected] Georgia Institute of Technology, Atlanta, US

Abstract

The interpretability of machine learning models has gained increasing attention, particularly in scientific domains where high precision and accountability are crucial. This research focuses on distinguishing between two critical data patterns—sensitive patterns (model-related) and decisive patterns (task-related)—which are commonly used as model interpretations but often lead to confusion. Specifically, this study compares the effectiveness of two main streams of interpretation methods: post-hoc methods and self-interpretable methods, in detecting these patterns. Recently, geometric deep learning (GDL) has shown superior predictive performance in various scientific applications, creating an urgent need for principled interpretation methods. Therefore, we conduct our study using several representative GDL applications as case studies. We evaluate thirteen interpretation methods applied to three major GDL backbone models, using four scientific datasets to assess how well these methods identify sensitive and decisive patterns. Our findings indicate that post-hoc methods tend to provide interpretations better aligned with sensitive patterns, whereas certain self-interpretable methods exhibit strong and stable performance in detecting decisive patterns. Additionally, our study offers valuable insights into improving the reliability of these interpretation methods. For example, ensembling post-hoc interpretations from multiple models trained on the same task can effectively uncover the task’s decisive patterns.

1 Introduction

Machine learning (ML) methods can make accurate predictions with a data-driven approach, exhibiting significant promise for various scientific applications [1, 2, 3, 4]. Among these methods, geometric deep learning (GDL) has emerged as a revolutionary approach, especially in the domains where data naturally form point clouds, such as particle clouds in high energy physics (HEP) [5, 6], proteins in biochemistry [7, 8], and molecules in material science [9, 10]. GDL models have shown remarkable predictive performance in many such applications since they excel at learning representations from point cloud data by preserving geometric equivariance and incorporating domain-specific inductive bias [11, 12, 13, 14]. However, the black-box nature hinders the understanding of these models’ decision-making processes. This highlights the urgent need for interpretable GDL models, especially for those employed in scientific applications, where both high precision and accountability are paramount [15].

In terms of model interpretation, two patterns in the data are relevant yet often confused by researchers, which we name as sensitive patterns and decisive patterns (see Fig. 1B). Conceptually, sensitive patterns are those whose presence or absence greatly influences the model’s predictions. Sensitive patterns may vary among different models that tackle the same learning task. Decisive patterns, on the other hand, are intrinsic to the learning task and determine the labels of the prediction task, regardless of the specific model being used. Despite their conceptual distinction, existing studies rarely distinguishes between them when evaluating different interpretation approaches. Most previous works focus solely on detecting sensitive patterns [16, 17, 18, 19, 20], while several other works hypothesize the alignment of the two patterns and use them exchangeably [21, 22, 23, 24, 25]. This confusion risks misunderstanding evaluation outcomes. For instance, it might involve employing label-relevant data patterns (i.e., decisive patterns) to assess the quality of extracted sensitive patterns for a given ML model or leveraging the sensitive patterns a model detects to gain insights into the underlying learning task [26]. Hence, a systematic exploration of the connections and disparities between these two patterns, particularly concerning the capabilities of current model interpretation methods to detect them, is imperative.

There are mainly two categories of methods designed to provide ML models with interpretability [27, 28, 23] (see Fig. 1A). The first category, known as post-hoc methods, operates on already-trained models and aims to interpret their predictive behaviors. Post-hoc methods may conceptually extract sensitive patterns, as the extracted patterns are specific to those already-trained models. The second category comprises self-interpretable methods, which often integrate interpretable modules into model architectures and optimize these modules during the model training. These interpretable modules are likely to better uncover decisive patterns by sharing the goal of accurately predicting the labels of the task. Nonetheless, the merits and drawbacks of these two categories of methods have sparked ongoing debates and controversies [28, 29]. In particular, there is a lack of systematic investigation and comparison on the ability of these methods to detect the two types of data patterns respectively. Most previous studies limit their scope to only one category, either post-hoc methods [23, 30, 31, 32, 33, 34, 29, 35, 36, 37] or self-interpretable methods [38, 39, 40, 41, 42]. Some recent surveys [43, 44, 45, 46] have overviewed both post-hoc and self-interpretable approaches, however, these surveys primarily focus on establishing taxonomies of different interpretability methods and fail to further compare and evaluate the two categories of approaches. As a result, the pros and cons of these approaches in detecting the two data patterns remain unresolved questions.

With the growing need for interpretability in scientific fields, this study uses GDL models that are notably prevalent in these areas [5, 6, 7, 8, 9, 10] as the testbeds to address the above questions. Considering there have been no studies on GDL model interpretation except [47] to the best of our knowledge, our study also contributes by extending many current methods originally proposed for other models to GDL models and evaluating them systematically in a highly modular software platform published together with this study³³3https://github.com/Graph-COM/xgdl. Given that geometric point cloud data can be represented as graphs by connecting close points in space, interpretability methods designed for the models that encode graph data, e.g., graph neural networks (GNNs) [48, 26, 21], could be extended for comparison in our study. Specifically, we adapt 11 established interpretation methods for GNNs to the GDL setting by incorporating geometric features and application-specific principles such as symmetries of the underlying physical systems. In total, we benchmark 13 interpretation methods with 3 important GDL backbone models to evaluate their abilities to extract sensitive patterns and decisive patterns. The model interpretation pipeline and its evaluation regarding each type of patterns, are illustrated in Figure 1.

Our study is based on 4 scientific datasets from applications in high-energy physics (HEP) [49, 50] and biochemistry [51, 52]. In these fields, GDL methods have proven extremely effective but urgently need reliable interpretability. These datasets are either collected from actual experiments or from reliable simulations that are extensively used in their corresponding domains. Contrary to previous studies that often use datasets lacking ground-truth labels for decisive patterns or generated by simple rules (e.g., motif-based) [48, 26, 53, 21, 24], the datasets we employed are more realistic and are annotated with decisive patterns according to the underlying scientific principles. This setup allows us to assess the methods’ capability to extract decisive patterns and to measure the alignment between decisive patterns and sensitive patterns.

Refer to caption — Figure 1: Overview of GDL model interpretation and its evaluation: Interpretation in geometric deep learning (GDL) tasks involves identifying a subset of points $C_{s}$ from the input point cloud $C$ . Decisive patterns are a subset of points that inherently dictate the labels of the point cloud, specified by the learning task, and their identification accuracy is measured by the alignment between $C_{s}$ and the true decisive patterns (Interpretation ROC-AUC). Sensitive patterns, on the other hand, are the subset of most influential points affecting the model’s predictions, as specified by the model itself. The evaluation of the model’s sensitivity involves assessing the changes of its predictions when $C_{s}$ is either added to or removed from the input (Fidelity AUC).

Our extensive evaluation yields solid evidence that the interpretations given by post-hoc methods generally align well with sensitive patterns but not decisive patterns. In contrast, the interpretations given by some self-interpretable methods, such as LRI-induced methods [47], align well with decisive patterns. In addition, we also observe that some post-hoc methods may face instability issues, i.e., the same method may demonstrate inconsistent performance across different datasets. The performance of self-interpretable methods can be more stable but method-dependent: some self-interpretable methods can effectively identify both decisive patterns and sensitive patterns, whereas others may fail to discern either.

Besides the above high-level observations, our studies elaborate more insights by answering the following three questions: Q1: Given that the interpretations yielded by post-hoc methods do not align well with the decisive patterns, what strategies can enhance the alignment to potentially enable post-hoc methods to detect decisive patterns for the learning tasks? Q2: Do the sensitive patterns of models trained based on self-interpretable methods align well with the decisive patterns of the task? In other words, are self-interpretable models inherently sensitive to decisive patterns? Q3: Whether and how the degree of alignment between the sensitive patterns and the decisive patterns is influenced by the quality (e.g., prediction accuracy) of models to be interpreted? Specifically, the insights based on our extensive evaluation for the above questions, along with some broader implications and significance, are summarized as follows:

1.

We observe that the interpretations given by post-hoc methods vary greatly among different models even when models were trained in the same setting and achieved high prediction accuracy but just used different random seeds. This indicates a fundamental limitation of post-hoc methods to detect decisive patterns for the learning tasks. Nonetheless, we address the problem by ensembling the interpretations yielded for multiple trained models. The ensembled interpretations align much better with decisive patterns, which enables post-hoc methods to more reliably uncover decisive patterns for the learning tasks.
2.

The sensitive patterns of some self-interpretable models may align well with the decisive patterns, which indicates that such models can be robust to non-decisive artifacts in the datasets and are mostly sensitive to the decisive patterns. This implies that well-performed self-interpretable methods may produce more reliable and interpretable models and may be a more favorable choice compared to post-hoc methods.
3.

Models with higher predictive accuracy tend to have better alignment between their sensitive patterns and decisive patterns for the learning tasks, suggesting that as predictive performance improves, a model’s predictive behavior becomes increasingly influenced by the decisive patterns. Consequently, robust label predictive performance is a foundational prerequisite if both the sensitive patterns of the model and the decisive patterns for the learning task are desired.

2 Results

2.1 Evaluation Framework

Model Interpretation Task. We focus on GDL tasks where each data sample is characterized as a point cloud, denoted by $C=(\mathcal{V},\mathbf{X},\mathbf{p})$ . Here, $\mathcal{V}=\{v_{1},v_{2},\ldots,v_{n}\}$ represents a collection of $n$ points, $\mathbf{X}\in\mathbb{R}^{n\times d}$ comprises $d$ -dimensional features for each point, and $\mathbf{p}\in\mathbb{R}^{n\times k}$ specifies 2D or 3D spatial coordinates for these points, depending on the specific application. Each sample $C\in\mathcal{C}$ is associated with a class label $Y\in\mathcal{Y}$ , and GDL models $f_{\theta}(\cdot):\mathcal{C}\rightarrow\mathcal{Y}$ are trained to make a prediction $\hat{Y}$ for each test data instance. Model interpretation is presented as a meaningful subset of points $C_{s}=(\mathcal{V}_{s},\mathbf{X}_{s},\mathbf{p}_{s})$ from $C$ output by an interpretation method. In practice, $C_{s}$ is typically obtained via the following process: An interpretation method will assign a list of importance scores $\mathcal{W}=\{w_{1},w_{2},\ldots,w_{n}\}$ where $w_{i}\in\mathbb{R}$ is for every individual point $v_{i}\in C$ , and output the top-ranked points given a selection ratio $\rho$ as the subset $C_{s}$ , e.g., for $\rho=0.2$ , the top-ranked 20% critical points in $C$ will form $C_{s}$ .

Decisive Patterns. Decisive patterns depend on the learning tasks, which inherently determine the class label $Y$ according to the specific scientific principles and are model-irrelevant. Specifically, each point in $C$ also comes with a binary label $z_{i}$ denoting if it is part of the decisive patterns, collected by $\mathcal{I}=\{z_{1},z_{2},\ldots,z_{n}\}$ . For instance, in the $\operatorname{Tau3Mu}$ dataset (shown later), each sample $C$ has the class label $Y$ indicating the occurrence of decay $\tau\rightarrow\mu\mu\mu$ in $C$ , and those points representing the $\mu$ ’s from this decay are labeled as $z_{i}=1$ . Note that $\mathcal{I}$ is not used during model training but serves exclusively for evaluation by testing the alignment between the results output by interpretation algorithms and $\mathcal{I}$ .

Sensitive Patterns. Sensitive patterns for point cloud data are defined as the most influential subsets of points on the model predictions. Specifically, the influence of any subset is quantified by assessing whether the model prediction changes upon the removal of the subset from the initial data, or whether it remains unchanged when only the subset is provided as input. Mathematically, these two criteria are respectively expressed as $\mathop{\arg\max}_{C_{s}}\left|f_{\theta}(C)-f_{\theta}(C\backslash C_{s})\right|$ and $\mathop{\arg\min}_{C_{s}}\left|f_{\theta}(C)-f_{\theta}(C_{s})\right|$ , where $f_{\theta}$ represents the model and $C_{s}$ is typically constrained with a budget on its size. In this study, we consider a combination of these two criteria $\mathop{\arg\max}_{C_{s}}\left|f_{\theta}(C)-f_{\theta}(C\backslash C_{s})% \right|-\left|f_{\theta}(C)-f_{\theta}(C_{s})\right|$ as the definition of sensitive patterns. Note that unlike decisive patterns that are inherently determined by scientific principles and can be labeled accordingly, sensitive patterns are defined as model-specific and are the patterns captured by each model during training, which can vary across different models.

Scientific Datasets. We briefly introduce 4 GDL datasets [47] derived for real-world scientific applications employed in our experiments as below, the details of which can be found in Sec. 4.2.

•

$\operatorname{ActsTrack}$ is to reconstruct the properties of charged particles using position measurements from tracking detectors. This process is essential for numerous downstream analyses in HEP, such as identifying particle types and reconstructing collision events [54, 49]. Unlike traditional track reconstruction, the task here involves predicting the presence of $\mu$ tracks from a $z\rightarrow\mu\mu$ decay and the detector hits left by $\mu$ corresponds to decisive patterns.
•

$\operatorname{Tau3Mu}$ is another HEP dataset, focusing on the detection of the rare and challenging signature of the $\tau\rightarrow\mu\mu\mu$ decay, which is highly suppressed in the Standard Model of particle physics [55, 56]. Therefore, the detection of such decays is a strong indicator of potential new physics [57, 58]. The decisive patterns in this dataset are the detector hits from $\mu$ .
•

$\operatorname{SynMol}$ aims at molecular property prediction, hypothesizing that molecules containing specific functional groups could bind to a target protein. Accurate prediction of molecular properties can significantly accelerate drug design and substance discovery efforts [59]. The decisive patterns in this dataset are the atoms in two functional groups: carbonyl and unbranched alkane.
•

$\operatorname{PLBind}$ is used to predict protein-ligand binding affinities based on the 3D structures of proteins and ligands. It is a crucial step because a high affinity is one of the major drug selecting criteria [60, 61]. The decisive patterns here are the amino acids located in the binding site of the test protein.

Benchmarked Methods. We select three GDL backbone models for interpretation evaluation, namely EGNN [62], DGCNN [63], and PointTrans [64], as they are widely employed in scientific applications [5, 65, 66]. As depicted in Fig. 2, our benchmark includes a total of 13 interpretability methods, which represent a broad spectrum of techniques to provide an inclusive evaluation. We provide a brief introduction of them as follows according to the taxonomy in [23]. The detailed description of each method and the strategy used to adapt the graph-based methods to GDL tasks can be found in Sec. 4.1.

Post-hoc methods generate interpretation results for an already-trained model $f_{\theta}$ by assigning an importance score for each point. Among the four main categories of post-hoc methods, gradient-based and decomposition-based approaches directly utilize the properties of the already-trained model, such as gradients, while perturbation and surrogate methods build an extra learnable explainer $g_{\phi}$ to assign point importance. Specifically, gradients-based methods, including GradxInput [67], GradCAM [68] and Integrated Gradients (IG) [69], compute the gradients with respect to the inputs or the learned (intermediate) point embeddings to identify important points. Decomposition-based methods, e.g., GNNLRP [70], devise score decomposition rules to distribute the prediction scores layer by layer in a back-propagation manner to the input space for identifying points that impact prediction scores the most. Perturbation methods, including GNNExplainer [48], PGExplainer [26] and SubgprahX [21], propose different approaches to perturb inputs and train an explainer $g_{\phi}$ to select important input patterns according to the output variations of $f_{\theta}$ . Surrogate methods, such as PGM-Explainer [24], train another interpretable surrogate model (e.g., probabilistic graphical model) to locally approximate the predictions of the original model and use the trained surrogate model to understand the decision-making process of the original model. Notably, all these post-hoc methods will not change the original $f_{\theta}$ in any way.

Self-interpretable methods, on the other hand, design interpretable modules $g_{\phi}$ and integrate such modules into existing backbone models $f_{\theta}$ . The combined models $f_{\theta}\circ g_{\phi}$ are then trained from scratch and are self-interpretable due to the integrated interpretable modules. The common self-interpretable methods include the following three categories. Attention-based methods, such as ASAP [71], regard attention distributions as an interpretation, where the part of inputs with higher attention weights are believed to have a greater influence on the model’s decision-making process. The other two categories of methods are based on principles such as the information bottleneck (IB) [72] and causality analysis [73]. The IB-based methods, including LRI-induced methods [47] and VGIB [74], design interpretable modules $g_{\phi}$ to restrict information flow and encourage the model $f_{\theta}\circ g_{\phi}$ to extract input patterns that are with minimal sufficient information for the task. Causality-based methods, such as CIGA [75], assume causal relationships within the data remain unchanged across different environments and aim to extract such invariant data patterns using $g_{\phi}$ .

For post-hoc methods, we begin by pre-training GDL backbone models (i.e., DGCNN, EGNN, and Point Transformer), each with 10 distinct random seeds per dataset, using the cross-entropy loss, and apply all post-hoc methods to each of the already-trained models. The same seeds are used again if the post-hoc methods include any parameters to be optimized in a data-driven way. As self-interpretable methods integrate their interpretable modules into the chosen GDL backbones, self-interpretable models are trained from scratch, each with 10 random seeds as well, using the objective functions specified by each method.

As far as we know, the only methods currently tailored specifically for GDL models are LRI-induced methods [47], i.e., LRI-Bern and LRI-Gaussian, and other methods are adapted by us to the GDL setting. Note that PGM-Explainer [24] and SubgraphX [21] are only evaluated on two of the datasets, i.e., $\operatorname{SynMol}$ and $\operatorname{ActsTrack}$ , due to their extreme inefficiency. For example, PGM-Explainer/SubgraphX requires more than 20/72 hours to train one seed on $\operatorname{Tau3Mu}$ dataset using an NVIDIA RTX A6000 GPU.

Evaluation Metrics. We employ three widely-used metrics to assess how well a method can extract sensitive patterns and decisive patterns.

To quantify each method’s effectiveness in identifying decisive patterns, we utilize Interpretation ROC-AUC [26] for the three datasets $\operatorname{SynMol}$ , $\operatorname{ActsTrack}$ and $\operatorname{Tau3Mu}$ , and Precision@20 [47] for the $\operatorname{PLBind}$ dataset. Interpretation ROC-AUC is calculated by comparing the decisive patterns $\mathcal{I}$ with the importance scores $\mathcal{W}$ . Precision@20 is gauged by the ratio of points in the decisive patterns among the top 20 ranked points. Higher Interpretation ROC-AUC or Precision@20 indicates the detected subset of points $C_{s}$ align better with the decisive patterns.

To evaluate the ability of each method to identify sensitive patterns, we use Fidelity AUC [23] for four datasets. This involves calculating Fidelity+ and Fidelity- given a subset $C_{s}$ derived from the importance scores $\mathcal{W}$ provided by an interpretation method, which correspond to the two criteria used in defining sensitive patterns, respectively. Fidelity+ is calculated by taking the average value over the test dataset of the expression $\mathbbm{1}(f_{\theta}(C)=Y)-\mathbbm{1}(f_{\theta}(C\setminus C_{s})=Y)$ for each test data instance $C$ . A higher Fidelity+ indicates more sensitivity of $f_{\theta}$ to the pattern $C_{s}$ . Fidelity- is determined by the formula $\mathbbm{1}(f_{\theta}(C)=Y)-\mathbbm{1}(f_{\theta}(C_{s})=Y)$ , with lower averages reflecting more sensitivity. The overall Fidelity score is defined as the arithmetic mean of Fidelity+ and $1-$ Fidelity-. To comprehensively evaluate performance, we vary the sizes of $C_{s}$ and compute Fidelity AUC as the area under the curve of Fidelity (versus the size of $C_{s}$ ). A higher Fidelity AUC suggests that the model’s interpretations are more closely aligned with sensitive patterns. For a more detailed description of the calculation of Fidelity AUC, we refer readers to Sec. 4.3.

2.2 Benchmarking Interpretability Performance

Table 1: Fidelity AUC and Interpretation ROC-AUC or Precision@20 performance of the 13 methods. The Bold and

\underline{\text{Underline}}

highlight the first and second best results within each category of methods. The

\textbf{Bold}^{\dagger}

highlights the best results across all methods in terms of Interpretation ROC-AUC or Precision@20. The results are reported as mean

\pm

std.

Method	$\operatorname{SynMol}$						$\operatorname{ActsTrack}$
	Fidelity AUC			Interpretation ROC AUC			Fidelity AUC			Interpretation ROC AUC
	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans
GNNLRP	$\underline{76.28}\pm 1.85$	$\mathbf{61.69}\pm 9.54$	$49.93\pm 2.66$	$\underline{81.75}\pm 4.01$	$\mathbf{84.61}\pm 3.81$	$50.38\pm 1.68$	$\underline{93.41}\pm 1.96$	$\mathbf{90.39}\pm 2.01$	$50.48\pm 11.32$	$\mathbf{86.01}\pm 2.31$	$\underline{86.40}\pm 5.08$	$50.20\pm 1.93$
GradCAM	$57.62\pm 4.52$	$51.14\pm 2.27$	$66.39\pm 2.62$	$57.82\pm 4.42$	$78.89\pm 3.84$	$84.10\pm 3.66$	$\mathbf{93.55}\pm 3.07$	$80.23\pm 3.77$	$\underline{89.75}\pm 4.07$	$\underline{69.38}\pm 2.72$	$75.25\pm 3.67$	$\underline{77.32}\pm 2.83$
GradxInput	$71.11\pm 5.04$	$52.02\pm 2.38$	$68.31\pm 1.82$	$76.03\pm 4.82$	$71.39\pm 5.89$	$78.03\pm 1.52$	$79.19\pm 1.89$	$72.58\pm 3.43$	$79.80\pm 3.08$	$68.74\pm 1.84$	$65.17\pm 1.56$	$64.78\pm 1.90$
IG	$72.98\pm 7.12$	$50.66\pm 1.10$	$\underline{72.71}\pm 2.38$	$78.59\pm 7.83$	$64.31\pm 9.14$	$\underline{84.23}\pm 1.83$	$79.22\pm 1.86$	$72.49\pm 3.41$	$79.68\pm 3.17$	$68.78\pm 1.82$	$65.27\pm 1.47$	$64.80\pm 1.88$
GNNExplainer	$58.55\pm 12.91$	$50.03\pm 0.54$	$27.60\pm 2.80$	$58.94\pm 15.89$	$51.03\pm 5.58$	$26.28\pm 2.70$	$50.71\pm 16.03$	$69.10\pm 5.47$	$87.00\pm 2.74$	$51.77\pm 4.41$	$64.34\pm 4.05$	$71.38\pm 2.61$
PGExplainer	$67.62\pm 16.36$	$51.06\pm 2.02$	$66.17\pm 2.17$	$77.92\pm 22.04$	$49.56\pm 39.82$	$\mathbf{87.41}\pm 2.66$	$28.18\pm 40.3$	$\underline{89.83}\pm 5.59$	$\mathbf{95.24}\pm 2.61$	$33.54\pm 23.17$	$\mathbf{92.63}\pm 1.57$	$\mathbf{88.39}\pm 3.13$
PGM-Explainer	$60.40\pm 2.55$	$50.45\pm 0.41$	$53.83\pm 1.29$	$64.59\pm 0.87$	$51.45\pm 1.57$	$58.99\pm 0.75$	$80.37\pm 3.21$	$54.44\pm 2.24$	$56.17\pm 2.32$	$62.89\pm 1.13$	$55.06\pm 1.29$	$55.00\pm 1.02$
SubgraphX	$\mathbf{88.06}\pm 1.28$	$\underline{59.13}\pm 7.86$	$\mathbf{80.20}\pm 1.22$	$\mathbf{86.7}\pm 1.78$	$\underline{68.26}\pm 4.33$	$77.82\pm 1.00$	$92.00\pm 2.91$	$83.68\pm 3.40$	$86.57\pm 1.42$	$62.93\pm 2.86$	$60.08\pm 0.05$	$62.78\pm 0.94$
ASAP	$59.25\pm 6.70$	$65.68\pm 6.18$	$51.32\pm 4.66$	$64.20\pm 11.69$	$79.55\pm 6.82$	$57.10\pm 8.54$	$\mathbf{92.2}\pm 4.93$	$88.49\pm 4.97$	$64.63\pm 14.77$	$\underline{81.03}\pm 4.25$	$89.00\pm 2.22$	$64.69\pm 13.35$
CIGA	$55.38\pm 11.2$	$51.12\pm 3.09$	$47.29\pm 14.05$	$62.19\pm 21.23$	$61.90\pm 22.34$	$51.21\pm 21.69$	$36.22\pm 37.22$	$48.07\pm 37.99$	$36.51\pm 37.68$	$43.98\pm 21.15$	$47.31\pm 31.47$	$40.53\pm 24.11$
LRI-Bern	$\underline{80.97}\pm 3.82$	$\underline{74.74}\pm 3.76$	$\underline{70.41}\pm 1.94$	$\underline{92.04}\pm 3.00$	$\underline{94.20}\pm 4.53$	$\underline{90.46}\pm 1.21$	$87.63\pm 2.19$	$\underline{90.52}\pm 1.84$	$\mathbf{92.31}\pm 1.22$	$80.97\pm 2.07$	$\underline{90.74}\pm 1.72$	$\underline{86.84}\pm 1.85$
LRI-Gaussian	$\mathbf{82.97}\pm 3.26$	$\mathbf{77.26}\pm 2.67$	$\mathbf{75.12}\pm 1.72$	$\mathbf{97.13}^{\dagger}\pm 0.79$	$\mathbf{98.23}^{\dagger}\pm 1.00$	$\mathbf{93.06}^{\dagger}\pm 1.19$	$\underline{90.93}\pm 3.85$	$\mathbf{91.12}\pm 1.55$	$\underline{90.17}\pm 3.19$	$\mathbf{92.93}^{\dagger}\pm 1.58$	$\mathbf{94.18}^{\dagger}\pm 0.88$	$\mathbf{91.85}^{\dagger}\pm 1.15$
VGIB	$79.02\pm 3.05$	$53.13\pm 2.73$	$56.82\pm 1.93$	$88.19\pm 3.23$	$93.88\pm 6.51$	$72.03\pm 2.96$	$71.09\pm 20.48$	$78.20\pm 8.11$	$74.39\pm 18.87$	$60.27\pm 11.06$	$90.30\pm 2.12$	$72.31\pm 15.31$

Method	$\operatorname{Tau3Mu}$						$\operatorname{PLBind}$
	Fidelity AUC			Interpretation ROC AUC			Fidelity AUC			Precision@20
	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans
GNNLRP	$\underline{61.96}\pm 1.61$	$\mathbf{63.68}\pm 1.52$	$49.13\pm 0.45$	$76.00\pm 0.82$	$\mathbf{73.15}\pm 1.9$	$50.00\pm 0.21$	$82.47\pm 6.01$	$\mathbf{75.41}\pm 7.77$	$50.04\pm 4.44$	$\mathbf{67.17}\pm 9.03$	$\mathbf{63.34}^{\dagger}\pm 5.33$	$45.41\pm 2.94$
GradCAM	$61.03\pm 2.65$	$60.66\pm 1.60$	$\underline{62.34}\pm 0.82$	$74.49\pm 3.69$	$\underline{68.48}\pm 3.58$	$\mathbf{80.72}^{\dagger}\pm 0.80$	$79.27\pm 6.56$	$\underline{73.33}\pm 5.87$	$70.03\pm 10.55$	$57.89\pm 6.57$	$\underline{60.65}\pm 5.20$	$57.48\pm 6.29$
GradxInput	$61.14\pm 1.98$	$\underline{63.28}\pm 1.30$	$61.78\pm 0.79$	$\mathbf{77.50}\pm 2.36$	$68.25\pm 0.27$	$68.69\pm 0.42$	$\underline{83.21}\pm 7.28$	$65.30\pm 9.43$	$74.97\pm 5.56$	$60.60\pm 3.67$	$55.91\pm 7.01$	$\mathbf{59.10}\pm 5.22$
IG	$61.45\pm 2.36$	$60.37\pm 0.80$	$60.66\pm 0.79$	$\mathbf{77.50}\pm 2.62$	$65.39\pm 0.26$	$68.35\pm 0.43$	$\mathbf{85.97}\pm 5.84$	$65.84\pm 13.50$	$76.90\pm 5.22$	$\underline{60.68}\pm 4.31$	$53.15\pm 6.82$	$\underline{57.58}\pm 5.36$
GNNExplainer	$61.94\pm 1.22$	$48.17\pm 2.85$	$36.45\pm 1.23$	$71.98\pm 2.38$	$52.51\pm 3.29$	$30.47\pm 0.77$	$52.4\pm 10.25$	$44.56\pm 11.09$	$\mathbf{77.96}\pm 5.57$	$42.27\pm 3.69$	$44.15\pm 3.14$	$57.57\pm 4.01$
PGExplainer	$\mathbf{62.09}\pm 1.66$	$48.92\pm 11.88$	$\mathbf{62.74}\pm 0.87$	$\underline{76.10}\pm 2.32$	$52.24\pm 19.53$	$\underline{78.60}\pm 0.55$	$59.8\pm 29.83$	$55.87\pm 28.57$	$\underline{77.10}\pm 7.49$	$56.36\pm 9.97$	$47.82\pm 10.03$	$55.93\pm 6.53$
ASAP	$56.27\pm 1.45$	$52.58\pm 3.80$	$50.10\pm 0.24$	$66.63\pm 1.67$	$69.54\pm 0.61$	$52.15\pm 3.77$	$50.14\pm 0.30$	$49.87\pm 0.22$	$50.19\pm 0.34$	$45.05\pm 0.16$	$45.10\pm 0.00$	$45.10\pm 0.00$
CIGA	$51.00\pm 17.51$	$43.32\pm 11.84$	$48.74\pm 10.2$	$54.07\pm 23.31$	$43.95\pm 19.53$	$49.72\pm 17.67$	$49.26\pm 1.91$	$49.24\pm 9.62$	$50.79\pm 1.94$	$45.55\pm 7.68$	$41.75\pm 4.40$	$48.74\pm 5.54$
LRI-Bern	$\underline{63.07}\pm 1.79$	$\mathbf{65.52}\pm 1.67$	$\underline{62.83}\pm 0.99$	$\underline{77.51}\pm 2.79$	$\underline{78.23}\pm 1.26$	$\underline{78.02}\pm 0.82$	$\underline{50.94}\pm 3.36$	$\underline{56.99}\pm 2.77$	$51.83\pm 3.93$	$\underline{72.14}\pm 4.89$	$\underline{61.75}\pm 7.72$	$\mathbf{68.86}^{\dagger}\pm 7.81$
LRI-Gaussian	$\mathbf{63.60}\pm 1.44$	$\underline{64.62}\pm 0.95$	$\mathbf{63.76}\pm 1.28$	$\mathbf{80.48}^{\dagger}\pm 0.49$	$\mathbf{81.41}^{\dagger}\pm 0.63$	$\mathbf{79.88}\pm 0.46$	$\mathbf{54.97}\pm 4.87$	$50.26\pm 4.65$	$\underline{55.40}\pm 8.55$	$\mathbf{74.40}^{\dagger}\pm 0.64$	$\mathbf{63.16}\pm 5.25$	$\underline{57.70}\pm 5.70$
VGIB	$59.86\pm 3.91$	$58.92\pm 5.86$	$51.37\pm 13.53$	$73.88\pm 5.56$	$72.70\pm 11.80$	$54.16\pm 26.22$	$47.14\pm 9.64$	$\mathbf{66.63}\pm 4.05$	$\mathbf{75.24}\pm 6.76$	$46.16\pm 6.91$	$54.66\pm 6.34$	$56.31\pm 6.00$

In this subsection, we benchmark 8 post-hoc and 5 self-interpretable methods to evaluate their effectiveness in extracting sensitive patterns and decisive patterns. It is important to note that since sensitive patterns are model-specific, comparing the extraction capabilities of sensitive patterns between post-hoc and self-interpretable methods is not appropriate (all post-hoc methods work on the same already-trained models, but self-interpretable methods would train new models from scratch using their proposed objectives). Below we briefly describe the results presented in Table. 1.

2.2.1 Benchmarking Post-Hoc Methods

Regarding the performance of extracting sensitive patterns of the already-trained models, SubgraphX outperforms most other post-hoc methods across backbone models and datasets. Its success is likely due to its unique, albeit computationally intensive, approach that employs Monte Carlo tree search to identify important points. However, its computational complexity limits its applicability to larger datasets like $\operatorname{Tau3Mu}$ and $\operatorname{PLBind}$ . Following closely is GNNLRP, which demonstrates strong performance on various datasets, but its performance declines when applied to Point Transformer. We speculate this drop stems from the manually crafted propagation rules needed by GNNLRP, which may conflict with the architecture of Point Transformer. PGExplainer represents a more intricate case. While it can achieve the third-best Fidelity AUC when its results are stable, i.e., with low variances, it may fail on models trained with certain random seeds, especially when applied to EGNN. Among gradient-based approaches, which yield consistent results across all settings likely because they do not involve a separate learning phase, only GradCAM stands out as competitive. On the downside, GNNExplainer and PGM-Explainer underperform in our benchmark, revealing their limitations in effectively extracting sensitive patterns in the GDL tasks considered.

Regarding the performance of extracting decisive patterns of the learning tasks, GNNLRP achieves leading Interpretation ROC-AUC scores when it is not paired with Point Transformer. Surprisingly, this time GradCAM performs rather competitively, surpassing SubgraphX in most cases, while other gradient-based methods still perform subpar. As for PGExplainer, again, it provides unstable results with high variances, even though it shows great performance in a few cases. GNNExplainer and PGM-Explainer, similarly, do not seem to work well in our experiments.

2.2.2 Benchmarking Self-Interpretable Models

Although self-interpretable methods are not designed to detect sensitive patterns for a given model, it is still interesting to see whether the models trained by self-interpretable methods are sensitive to their extracted interpretation patterns. Notably, LRI-Bern and LRI-Gaussian achieve relatively high Fidelity AUC scores. As for the remaining models, VGIB overall performs the third best but suffers from high variances on some datasets, ASAP occasionally exhibits high Fidelity AUC scores but generally lags behind, while CIGA appears ill-suited when adapted to the GDL even with significant parameter tuning.

As for the extraction of decisive patterns, LRI-Bern and LRI-Gaussian consistently deliver superior performance in all settings, significantly outperforming other methods, including post-hoc ones. VGIB and ASAP follow LRI-induced methods in performance, yet ASAP demonstrates high variances across different settings.

2.2.3 Comparing Post-Hoc and Self-Interpretable Methods

To summarize, when comparing the ability to identify decisive patterns, although post-hoc methods, notably SubgraphX and GNNLRP, may often offer decent results, top-performed self-interpretable methods, e.g., LRI-Gaussian, significantly outperform all post-hoc methods, suggesting using the output interpretations of self-interpretable methods when one cares more about the decisive patterns for the learning tasks. Furthermore, the generally poor Interpretation ROC-AUC performance of post-hoc methods, in contrast to their relatively high Fidelity AUC, indicates that post-hoc interpretations may not align well with the decisive patterns, and we will further investigate this issue in Sec. 2.3.

Note that one cannot directly compare post-hoc and self-interpretable methods regarding their capabilities of detecting sensitive patterns, as the models to be interpreted are revised when one applies self-interpretable methods. Nonetheless, we can still see a trend that self-interpretable methods achieving better Interpretation ROC-AUC (the metric for detecting decisive patterns) typically obtain better Fidelity AUC (the metric for detecting sensitive patterns). Moreover, as the achieved Fidelity AUC scores of some self-interpretable methods are generally comparable with those yielded by post-hoc methods, the models trained based on self-interpretable methods are also sensitive to the interpretations these methods output.

2.3 Relationship of Post-Hoc Extracted Interpretations and Decisive Patterns

Besides our benchmark, in this section, we study the question (Q1): Given that the interpretations given by post-hoc methods do not align well with the decisive patterns (i.e., post-hoc methods tend to exhibit poor performance regarding Interpretation ROC-AUC despite having high Fidelity AUC), what strategies can enhance the alignment to potentially enable post-hoc methods to detect decisive patterns for the learning tasks?

2.3.1 Investigating the General Misalignment Between Sensitive Patterns and Decisive Patterns

We hypothesize the misalignment between post-hoc interpretations and decisive patterns is essentially caused by the general misalignment between the sensitive patterns of a model and the decisive patterns of the learning tasks, as post-hoc methods are more designed to detect those model-specific sensitive patterns. Based on this hypothesis, we propose to check the Fidelity AUC when directly inputting the labeled ground-truth decisive patterns for each sample as identified important points, which we term as Decisive-Induced Fidelity AUC. This is built on the assumption that the model should be sensitive to decisive patterns if the two patterns are well-aligned. We assess the Decisive-Induced Fidelity AUC across 50 models for each backbone using the $\operatorname{SynMol}$ and $\operatorname{ActsTrack}$ datasets, and we visualize the distribution of Decisive-Induced Fidelity AUC.

As shown in Figure 3, the Decisive-Induced Fidelity AUC scores are generally low across diverse datasets and backbones and are even significantly lower than the Fidelity AUC scores of the corresponding post-hoc methods (Supplementary Table 2). Note that here all the models are well trained and some of them may even have accuracy as high as 98% (Supplementary Table 1). These observations imply a fundamental misalignment between the two patterns for the GDL models. Moreover, Decisive-Induced Fidelity AUC scores also exhibit substantial standard deviations, suggesting that models trained with different random seeds have significantly varying levels of sensitivity to decisive patterns. This highlights the need to distinguish between the objectives of extracting sensitive patterns and decisive patterns in practical applications. Specifically, if the goal is to identify what patterns a model is sensitive to, post-hoc methods such as GNNLRP prove effective. However, if the goal is to derive knowledge from the data by extracting decisive patterns, one should be conservative when applying post-hoc methods since they may produce interpretations that misalign with decisive patterns.

2.3.2 The Ensemble Strategy to Improve the Alignment

Table 2: Performance of extracting decisive patterns using post-hoc methods with the ensemble strategy. Numbers in the parentheses indicate the improvement upon the Interpretation ROC-AUC reported in Table 1.

Method	SynMol			ActsTrack			Tau3Mu			PLbind
Method	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans
GNNExplainer	$70.98(12.04)$	$45.49(-5.54)$	$39.92(13.64)$	$62.47(10.70)$	$75.73(11.39)$	$80.08(8.70)$	$77.52(5.54)$	$57.22(4.71)$	$41.64(11.17)$	$49.10(6.83)$	$46.30(2.15)$	$61.50(3.93)$
GNNLRP	$84.42(2.67)$	$87.26(2.65)$	$53.62(3.24)$	$89.97(3.96)$	$89.76(3.36)$	$53.25(3.05)$	$78.87(2.87)$	$75.99(2.84)$	$49.75(-0.25)$	$74.40(7.23)$	$62.20(-1.14)$	$50.60(5.19)$
GradCAM	$63.38(5.56)$	$82.93(4.04)$	$90.12(6.02)$	$75.05(5.67)$	$80.19(4.94)$	$86.15(8.83)$	$80.50(6.01)$	$71.95(3.47)$	$83.06(2.34)$	$62.10(4.21)$	$61.80(1.15)$	$58.70(1.22)$
GradxInput	$82.42(6.39)$	$78.41(7.02)$	$82.67(4.64)$	$69.61(0.87)$	$65.43(0.26)$	$65.08(0.30)$	$80.14(2.64)$	$68.20(-0.05)$	$69.35(0.66)$	$62.40(1.80)$	$57.90(1.99)$	$63.70(4.60)$
IG	$91.87(13.28)$	$69.78(5.47)$	$87.48(3.25)$	$69.65(0.87)$	$65.49(0.22)$	$65.11(0.31)$	$79.29(1.79)$	$65.41(0.02)$	$68.70(0.35)$	$63.70(3.02)$	$60.60(7.45)$	$59.60(2.02)$
PGExplainer	$96.20(18.28)$	$94.69(45.13)$	$90.51(3.10)$	$63.59(30.05)$	$95.01(2.38)$	$91.71(3.32)$	$78.60(2.50)$	$71.90(19.66)$	$80.01(1.41)$	$64.90(8.54)$	$62.50(14.68)$	$61.30(5.37)$
PGM-Explainer	$68.83(4.24)$	$53.31(1.86)$	$63.94(4.95)$	$70.26(7.37)$	$58.52(3.46)$	$61.58(6.58)$	-	-	-	-	-	-
SubgraphX	$92.32(5.62)$	$76.84(8.58)$	$82.38(4.56)$	$64.60(1.67)$	$60.21(0.13)$	$63.93(1.15)$	-	-	-	-	-	-

The above misalignment disqualifies using post-hoc interpretations as the decisive patterns of the learning tasks. However, an interesting question is that if the significant variation in the sensitive patterns of the models gets removed, can we safely treat post-hoc interpretations as approximation of the decisive patterns?

Therefore, we propose employing an ensemble of the post-hoc interpretations for multiple already-trained models as a strategy to enhance the extraction of decisive patterns. In our experiments, we apply each post-hoc method on 10 models trained with different seeds, resulting in 10 importance scores for each point in the point cloud $C$ . Then, we utilize a weighted average aggregation to yield a final score for each point. The weight is determined by the fidelity of each post-hoc explainer, calculated as $\mathop{\min}\{0,\text{Fidelity AUC}-50\}$ and then normalized. Note that this weight neither relies on the classification label $Y$ nor the labels of ground-truth decisive patterns $\mathcal{I}$ . The quality of this emsemble strategy is evaluated in Interpretation ROC-AUC or Precision@20, as shown in Table 2.

On average, the ensemble method significantly improves the identification of decisive patterns based on post-hoc interpretations by 12.97%, 9.42%, 7.02%, and 8.43% on $\operatorname{SynMol}$ , $\operatorname{ActsTrack}$ , $\operatorname{Tau3Mu}$ , and $\operatorname{PLBind}$ datasets, respectively. The most significant boost is observed for the $\operatorname{SynMol}$ dataset, which could potentially arise from the numerous spurious correlations (i.e., correlations between the irrelevant input environment features and the labels) within this dataset [47]. Spurious correlations are likely to be captured by the models, subsequently being extracted as the interpretations by post-hoc methods yet essentially irrelevant to decisive patterns. The ensemble strategy helps with filtering out these irreverent non-decisive patterns. We claim that ensembling the post-hoc interpretations across multiple already-trained models is necessary. To see this, we also evaluate an ensemble of multiple post-hoc interpretations generated based on multiple random seeds but for the same already-trained model (see Supplementary Table 3), which yields much worse performance than that in Table 2.

2.4 Are Sensitive Patterns of Self-Interpretable Models Aligned Well with Decisive Patterns?

Table 3: Interpretation of ROC-AUC values derived from post-hoc methods applied to self-interpretable models trained via LRI-induced methods. In each setting, scores are averaged across three model backbones. An underline indicates that the score is lower than that of the corresponding ERM model. For comprehensive results with individual backbones, refer to the Supplementary Table 4.

Interpretation Method	$\operatorname{SynMol}$			$\operatorname{ActsTrack}$
Interpretation Method	ERM Model	LRI-Bern Model	LRI-Gaussian Model	ERM Model	LRI-Bern Model	LRI-Gaussian Model
GNNLRP	$72.25\pm 9.50$	$\underline{71.89}\pm 9.53$	$74.08\pm 4.22$	$74.20\pm 9.32$	$75.81\pm 8.41$	$76.41\pm 9.13$
GradCAM	$73.60\pm 11.92$	$\underline{73.48}\pm 14.41$	$75.15\pm 15.62$	$73.98\pm 9.22$	$77.32\pm 8.82$	$77.50\pm 12.24$
GradxInput	$75.15\pm 12.23$	$83.07\pm 14.08$	$83.41\pm 12.66$	$66.23\pm 5.30$	$66.52\pm 5.19$	$74.48\pm 6.65$
IG	$75.71\pm 18.80$	$82.10\pm 17.05$	$86.14\pm 9.81$	$66.28\pm 5.17$	$66.61\pm 5.22$	$74.50\pm 6.58$
GNNExplainer	$45.42\pm 24.17$	$47.16\pm 0.29$	$49.84\pm 0.03$	$62.50\pm 11.07$	$63.41\pm 0.10$	$\underline{49.97}\pm 0.03$
PGExplainer	$71.63\pm 64.52$	$77.70\pm 0.96$	$\underline{42.12}\pm 0.45$	$71.52\pm 27.87$	$87.42\pm 0.18$	$85.59\pm 0.18$
Self	$-$	$92.23\pm 8.74$	$96.14\pm 2.98$	$-$	$86.18\pm 5.64$	$92.99\pm 3.61$

Given that the interpretations given by LRI-induced methods also demonstrate high Fidelity AUC (Table 1), it might indicate that the models trained by LRI-induced methods are already sensitive to the decisive patterns. In other words, the sensitive patterns and decisive patterns are potentially well aligned for these models. To verify this conjecture, we apply post-hoc methods to the models trained by LRI-induced methods and evaluate the Interpretation ROC-AUC by comparing the obtained post-hoc interpretations with the decisive patterns of these tasks. The Interpretation ROC-AUC of the model that has the same architecture but goes through standard training pipelines is used as a baseline.

As illustrated in Table 3, using any post-hoc method, the post-hoc interpretations of LRI-induced models consistently demonstrate better alignment with the decisive patterns, compared to the post-hoc interpretations of the same model architectures but trained via standard empirical risk minimization (ERM). Notably, these interpretations often show a significant improvement, with gains reaching up to 15.9%. These observations support the claim that LRI-induced models are inherently sensitive to decisive patterns.

2.5 Model Prediction Accuracy Indicates the Alignment Between the Two Patterns

Here, we explore how prediction accuracy impacts the interpretation results of the models (Q3). Specifically, we train 150 models for each dataset and backbone with various training recipes, resulting in models having a wide range of classification accuracy. Due to the large number of trained models for this study, we run the most efficient four post-hoc methods and summarize the Interpretation ROC-AUC results in Fig. 4(a). Note that for each dataset, we divide the models according to their classification accuracy into five intervals and draw four box plots for the models within each interval based on Interpretation ROC-AUCs given by the four post-hoc methods. In parallel, we also trained the models based on five self-interpretable methods and similarly presented the results in Fig. 4(b).

As shown in Fig. 4(a), when the model’s classification performance improves, the Interpretation ROC-AUC performance of post-hoc methods tends to improve in a similarly linear manner. Similarly, the increasing trend depicted in Fig. 4(b) is evident. This suggests that a model would indeed be more sensitive to decisive patterns when achieving better prediction accuracy. This makes sense because when the model captures the decisive patterns for the learning task, it tends to generalize better. Therefore, high model prediction accuracy can be viewed as a good indicator if one would like to detect the decisive patterns of the learning task by analyzing the sensitive patterns of the model.

3 Discussion

Main Conclusions. This work has systematically investigated two important categories of model interpretation approaches (post-hoc interpretation v.s. self-interpretation) for GDL models, regarding their capabilities of detecting two types of data patterns (sensitive patterns v.s. decisive patterns) that were often confused by previous model interpretation studies. Sensitive patterns are model-specific and post-hoc methods present reasonable performance when detecting them. Our evaluation shows that SubgraphX among all post-hoc methods achieves the best sensitive patterns’ extraction. decisive patterns are task-specific and independent of the learning models. Self-interpretable methods can produce better and more stable interpretation results when detecting decisive patterns. Among self-interpretable methods, LRI-Gaussian often achieves the best performance.

Our investigation reveals the fundamental misalignment between post-hoc interpretations and decisive patterns, which is mainly caused by variations in the sensitive patterns of the pre-trained models. We observe that high model prediction accuracy often serves as a strong indicator of alignment between the sensitive patterns of the pre-trained models and the decisive patterns of the task. We also propose an ensemble method that combines post-hoc interpretations of multiple pre-trained models to improve the alignment between the post-hoc interpretations and decisive patterns. Furthermore, our investigation finds that models trained by some self-interpretable methods may inherently be more sensitive to decisive patterns compared to models trained in standard ERM.

Significance for ML Researchers and Domain Scientists. Our results contribute to the advancement of studying interpretability in ML methods and its applications in scientific domains. The observed misalignment between the two patterns underscores the necessity for subsequent ML researchers to clearly establish their objectives of identifying sensitive patterns or decisive patterns before applying or devising any interpretability methods. Our observations also indicate that it can be inappropriate to directly compare methods designed for different objectives. For example, it can be unfair to compare the ability of post-hoc methods to uncover decisive patterns with self-interpretable methods, suggesting the need for different evaluation frameworks for different methods.

For domain scientists, our study highlights post-hoc methods to be better-suited for validating the reliability of already-trained models, i.e., checking whether a model’s predictive behavior is mainly sensitive to patterns related to established scientific principles, while if the goal is to uncover (unknown) insights and knowledge from the data, self-interpretable methods designed to extract decisive patterns can be a better choice. Moreover, for a task where no models can achieve high prediction accuracy, trying to uncover decisive patterns using interpretability methods may be futile. Due to the relative infancy of GDL as a field, extensive analysis of GDL methods for science remains scarce, with even fewer studies on interpretable GDL for scientific purposes. Therefore, we have established a solid foundation for researchers in this promising direction and have paved the way for the development of a trustworthy GDL pipeline for scientific applications.

Limitations. Our work has several potential limitations in the context of ML for science. First, we have not taken into consideration task-specific backbone models which are popular alternative models in scientific ML and whose interpretability may be of significant value to domain scientists. Second, in addition to the performance metrics used in this study, there exist several other metrics meaningful for evaluating the identification of the two types of patterns, such as the consistency across different backbones [31], the fairness in multilabel classification [76], and the stability for perturbed counterpart [77], etc. These are beyond our focus and may further complement this study in the future. Third, due to the limited number of interpretability methods specially designed to directly work on the geometric features in GDL, our benchmark only examines the selection of a subset of points as the model interpretation, while missing elaborating how the distribution of this subset of points in the space may present the model interpretation. However, GDL models may capture more fine-grained geometric patterns because of the nature of GDL tasks, therefore, the examination of geometric coordinates may provide more scientific insights from a different perspective [47].

4 Methods

In this section, we will provide more details on the scientific applications and the datasets, along with some implementation details of our experiments. We will also describe the various interpretation methods applied in our study.

4.1 More Details on Interpretation Methods

4.1.1 Post-Hoc Methods

Given a pre-trained model, post-hoc methods generate model interpretations without modifying the model’s learned weights. They are primarily categorized into five groups. From each category, we select one of the representative methods and extend it to GDL models.

Gradients-Based Methods [67, 68, 69, 78, 79] compute gradients with respect to the input or intermediate activations to explain trained models. GradxInput [67] calculates the element-wise product between the input features and the gradient with respect to them to measure the importance of different input patterns. GradCAM [68] extends GradxInput by using intermediate activations, multiplying the gradients of activations with the activations themselves to derive importance scores. Integrated Gradients (IG) [69] assigns importance values to each input feature by integrating the gradients with respect to the input across a path from a non-informative input to the actual input. In GDL models, gradients are computed with respect to the input point coordinates for IG and GradXInput. For GradCAM, gradients are derived from intermediate activations.

Decomposition-Based Methods [80, 70, 81, 82] build score decomposition rules to distribute the prediction scores layer by layer in a back-propagation manner to the input space to identify points that contribute the most to the prediction scores. Inspired by layer-wise relevance propagation (LRP) algorithm [80], GNNLRP [70] studies the importance of different walks in the graph (i.e. sequences of edges) to interpret GNNs. The importance of each edge is determined by considering all graph walks that contain it. To adapt the method from GNNs to GDL, we extend the framework to evaluate the significance of points within the point cloud. This is achieved by aggregating the importance scores derived from the walks traversed.

Perturbation Methods [48, 26, 21, 83, 84, 37] generate perturbation masks to select important input by optimizing the output variations of the trained models with respect to different input perturbations. GNNExplainer [48] employs soft masks, learned through mask optimization for individual input graphs, to elucidate the model’s predictions. In contrast, PGExplainer [26] develops a parameterized mask generator that produces approximated discrete masks to interpret the predictions more effectively. SubgraphX [21] delves into subgraph-level interpretation for GNNs, leveraging the Monte Carlo tree search algorithm [85] to identify the most important subgraph for a trained model with efficient node pruning. When adapting these methods to GDL, the perturbations are conducted at point-level instead of edge-level or subgraph-level.

Surrogate Methods [24, 86, 87] employ an interpretable surrogate model to locally approximate the predictions of the complex ML models. PGM-Explainer [24] builds a probabilistic graphical model to fit the local dataset and to interpret the predictions of the original GNN model. The adaption of this method is natural due to the fundamental similarity in the structural representation of data across these domains. Specifically, point clouds in GDL are analogously treated as graphs within GNNs. Other related surrogate methods include GraphLime [86] and RelEx [87], which are not incorporated into our study due to their specific design for node classification tasks.

4.1.2 Self-Interpretable Methods

Different from post-hoc methods, self-interpretable methods may design new self-interpretable modules and integrate such modules into existing backbone models. The combined models are then trained from scratch and are interpretable. We adapt the self-interpretable methods for GNNs to GDL models.

Attention-Based methods [71] utilize the values of attention weights to identify important input patterns. For example, ASAP [71] captures the importance of each node in a given graph, preserving the hierarchical graph structure information by iteratively performing score generation, node selection, and graph coarsening. Other typical attention-based methods include GAT [88] and GATv2 [89]. We extend these attention weights from nodes in graphs to points in point clouds.

IB-Induced Methods [74, 47] usually inject noise to restrict the flow of information and encourage the model to learn to denoise the data, preserving the most relevant information. VGIB [74] injects noise into the node representations via a learned probability for each node. Instead of perturbing representations, LRI-induced methods [47] perturbs inputs by sampling stochastic noise from a learnable distribution, where the distribution can be formulated as a Bernoulli distribution to perturb the existence of input points or as a Gaussian distribution to perturb geometric features. GIB [90] is also an IB-induced method that presents as a preliminary work to VGIB and thus has not been included in our benchmark.

Causality-Based Methods [75] assume that the causal relationships within the data remain unchanged across different environments. Built upon three structural causal models, CIGA [75] aims to maximize the mutual information of graphs in the dataset with the same label and utilizes contrastive learning with supervised sampling for approximation and optimization. DIR [91] also belongs to the category of causality-based methods.

4.2 Scientific Applications and Datasets

Since one of our ultimate goals is to promote scientific discovery, we employ datasets that are derived from real-world scientific applications. These datasets not only have a class label for each sample for model training but also provide labels for each point indicating the ground-truth decisive patterns that determine the class label according to specific scientific principles for evaluating interpretability methods. Below we introduce the 4 datasets used with more details.

ActsTrack [49] is a dataset for particle tracking in HEP, focusing on the reconstruction of charged particles’ properties through position measurements from tracking detectors. This process is vital for identifying particle types, reconstructing collision events, suppressing background noise, and isolating rare events of interest. The output of this task serves as the fundamental input of many downstream analyses in HEP experiments [54, 49]. In the context of evaluating interpretability methods, the task is reformulated to predict the occurrence of a $z\rightarrow\mu\mu$ decay within each point cloud $C$ . Here, each sample $C$ is associated with a binary class label $Y$ indicating the presence or absence of the $z\rightarrow\mu\mu$ decay, and the points representing the $\mu$ ’s from the decay are labeled as ground-truth decisive patterns, since their existence directly indicates the occurrence of the $z\rightarrow\mu\mu$ decay.

Tau3Mu [50] is another HEP dataset aiming at evaluating algorithms designed to detect the rare and challenging signature of charged lepton flavor-violating decays, specifically the $\tau\rightarrow\mu\mu\mu$ decay, using simulated data of muon detector hits from proton-proton collisions. These decays are highly suppressed in the Standard Model of particle physics [55, 56], making their detection indicative of new physics beyond the Standard Model [57, 58]. However, the $\tau\rightarrow\mu\mu\mu$ events are predicted to occur at an extremely low rate, approximately at a branching fraction of $10^{-8}$ , rendering it impractical to collect ample real experiment data for training ML models. Consequently, research in this direction leverages carefully calibrated simulation algorithms to generate labeled data for model training. This reliance underscores the critical importance of validating the trustworthiness of trained models in the context of high-stakes LHC experiments. Within the $\operatorname{Tau3Mu}$ dataset, each point cloud sample $C$ is assigned a binary class label $Y$ , indicating whether or not there occurs a $\tau\rightarrow\mu\mu\mu$ event in $C$ . To evaluate interpretability methods, this dataset has also labeled the points in each $C$ that represent the $\mu$ ’s from the decay as ground-truth decisive patterns.

SynMol [51] centers on molecular property prediction, a critical task to accelerate the discovery and development of new materials and drugs. By learning from vast datasets of molecules and their properties in a data-driven way, ML algorithms may predict the properties of unseen molecules with high accuracy, which can surpass traditional computational methods in both speed and precision [59]. Effective ML models for this task enable scientists to efficiently screen countless compounds and identify promising candidates for in-depth analysis, significantly reducing the time and cost needed by experimental testing. Nevertheless, the opaque nature of many ML models presents challenges in understanding the underlying reasons for the predicted properties, and interpretable methods for this task extend the objective beyond merely accurate property prediction of individual molecules but also require the identification of critical data patterns (e.g., certain functional groups) that induce the predicted properties, which can further enrich our understanding and guide future discoveries. For the $\operatorname{SynMol}$ dataset, the task is to predict molecules’ properties determined by two functional groups: carbonyl and unbranched alkane [51], and atoms within these functional groups are labeled as decisive patterns.

PLBind [52] focuses on predicting protein-ligand binding affinities. This task is crucial for drug discovery and design, as understanding how well a drug (ligand) binds to its target protein can inform the efficacy of a potential therapeutic. ML models that can predict binding affinities accurately can significantly facilitate the drug development pipeline by replacing the less efficient traditional docking simulations. Interpretable ML models are invaluable for elucidating the complex mechanisms of protein-ligand interactions, specifically by identifying the critical regions of interaction, or binding sites, on the protein surface. This insight is instrumental in understanding the binding mechanism, guiding the rational design of more effective and targeted therapeutics. For the $\operatorname{PLBind}$ dataset, binary classifiers are trained to predict protein-ligand pairs with high or low binding affinities, and those amino acids near the binding site are labeled as ground-truth decisive patterns for evaluating interpretability methods.

Table 4: Definition of metrics, where

n

is the total number of samples in a dataset,

y_{i}

denotes the ground-truth class label of the

i^{th}

sample

C^{(i)}

\hat{y}_{i}

denotes the predicted label yielded using the raw input

C^{(i)}

\hat{y}_{i}^{\rho+}

is the predicted label yielded using

C^{(i)}\backslash C_{s}^{(i)}

and the size of

C_{s}^{(i)}

is determined by

\rho

, and

\hat{y}_{i}^{\rho-}

is the prediction output yielded using

C_{s}^{(i)}

given a specified

\rho

. Additionally,

\mathbbm{1}(\cdot)

is an indicator function that outputs

1

if the specified condition is true, and

0

otherwise;

\operatorname{AUC}(\cdot)

computes the area under a given curve.

Metric Name	Definition	Measurement
Fidelity+@ $\rho$	$\frac{1}{n}\sum_{i}^{n}(\mathbbm{1}(\hat{y}_{i}=y_{i})-\mathbbm{1}(\hat{y}_{i}% ^{\rho+}=y_{i}))$	The impact of $C\backslash C_{s}$
Fidelity-@ $\rho$	$\frac{1}{n}\sum_{i}^{n}(\mathbbm{1}(\hat{y}_{i}y_{i})-\mathbbm{1}(\hat{y}_{i}^% {\rho-}=y_{i}))$	The impact of $C_{s}$
Fidelity+ AUC	$\operatorname{AUC}(\text{Fidelity+ Curve})$	The averaged impact of $C\backslash C_{s}$ across various sizes of $C_{s}$
Fidelity- AUC	$\operatorname{AUC}(\text{Fidelity- Curve})$	The averaged impact of $C_{s}$ across various sizes
Fidelity AUC	$(\text{Fidelity+ AUC}-\text{Fidelity- AUC}+1)/2$	The overall impact of the identified important points

4.3 More Specifics on Experiment Settings

Evaluation Metrics. To evaluate when the model interpretations are aligned with sensitive patterns, based on the importance score assigned to each point, a subset of critical points $C_{s}$ is identified by selecting the top-ranked points, and the number of points to be selected is determined by the selection ratio $\rho$ , e.g., for $\rho=0.2$ , the top-ranked 20% critical points in $C$ will form $C_{s}$ . Then, we compute Fidelity+@ $\rho$ and Fidelity-@ $\rho$ [21, 23] to measure the impact of $C_{s}$ on the model’s predictive behavior by collecting the changes in prediction outputs when only unimportant part $C\backslash C_{s}$ (Fidelity+) or important part $C_{s}$ (Fidelity-) is included as inputs. For a more holistic evaluation of the impact of $C_{s}$ with different sizes, we compute Fidelity+@ $\rho$ and Fidelity-@ $\rho$ at different $\rho\in\{0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9\}$ , each resulting in a curve with the x-axis being the value of $\rho$ and y-axis being the corresponding Fidelity+@ $\rho$ or Fidelity-@ $\rho$ . Thus, Fidelity+ AUC and Fidelity- AUC are computed according to the area under each curve. Finally, Fidelity AUC is yielded by combining both Fidelity+ AUC and Fidelity- AUC. The formal definition of these metrics is summarized in Table 4. To evaluate the identification of decisive patterns, we directly compare the obtained importance score for each point (i.e., $\mathcal{W}$ ) and the labeled ground-truth decisive patterns (i.e., $\mathcal{I}$ ) and compute ROC-AUC, reported as Interpretation ROC-AUC.

Data and software availability

Datasets used in this study are freely available on Zenodo at https://doi.org/10.5281/zenodo.7265547. The source code of this study is publicly available on Github at https://github.com/Graph-COM/xgdl.

Acknowledgements

This work is supported by the National Science Foundation (NSF) awards PHY-2117997 and IIS-2239565.

References

[1] K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, and A. Walsh, “Machine learning for molecular and materials science,” Nature, vol. 559, no. 7715, pp. 547–555, 2018.
[2] G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N. Tishby, L. Vogt-Maranto, and L. Zdeborová, “Machine learning and the physical sciences,” Reviews of Modern Physics, vol. 91, no. 4, p. 045002, 2019.
[3] S. Zhong, K. Zhang, M. Bagheri, J. G. Burken, A. Gu, B. Li, X. Ma, B. L. Marrone, Z. J. Ren, J. Schrier, et al., “Machine learning: new ideas and tools in environmental science and engineering,” Environmental Science & Technology, vol. 55, no. 19, pp. 12741–12754, 2021.
[4] K. J. Bergen, P. A. Johnson, M. V. de Hoop, and G. C. Beroza, “Machine learning for data-driven discovery in solid earth geoscience,” Science, vol. 363, no. 6433, p. eaau0323, 2019.
[5] H. Qu and L. Gouskos, “Jet tagging via particle clouds,” Physical Review D, 2020.
[6] X. Ju, D. Murnane, P. Calafiura, N. Choma, S. Conlon, S. Farrell, Y. Xu, M. Spiropulu, J.-R. Vlimant, A. Aurisano, et al., “Performance of a geometric deep learning pipeline for hl-lhc particle tracking,” The European Physical Journal C, vol. 81, pp. 1–14, 2021.
[7] P. Gainza, F. Sverrisson, F. Monti, E. Rodola, D. Boscaini, M. Bronstein, and B. Correia, “Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning,” Nature Methods, 2020.
[8] H. Stärk, O. Ganea, L. Pattanaik, R. Barzilay, and T. Jaakkola, “Equibind: Geometric deep learning for drug binding structure prediction,” in International conference on machine learning, pp. 20503–20521, PMLR, 2022.
[9] Y.-L. Liao, B. Wood, A. Das, and T. Smidt, “Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations,” arxiv:2306.12059, 2023.
[10] G. Zhou, Z. Gao, Q. Ding, H. Zheng, H. Xu, Z. Wei, L. Zhang, and G. Ke, “Uni-mol: a universal 3d molecular representation learning framework,” 2023.
[11] K. Schütt, P.-J. Kindermans, H. E. Sauceda Felix, S. Chmiela, A. Tkatchenko, and K.-R. Müller, “Schnet: A continuous-filter convolutional neural network for modeling quantum interactions,” Advances in neural information processing systems, vol. 30, 2017.
[12] B. **g, S. Eismann, P. Suriana, R. J. L. Townshend, and R. O. Dror, “Learning from protein structure with geometric vector perceptrons,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.
[13] A. Bogatskiy, B. Anderson, J. Offermann, M. Roussi, D. Miller, and R. Kondor, “Lorentz group equivariant neural network for particle physics,” in International Conference on Machine Learning, pp. 992–1002, PMLR, 2020.
[14] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. Battaglia, “Learning to simulate complex physics with graph networks,” in International conference on machine learning, pp. 8459–8468, PMLR, 2020.
[15] F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable machine learning,” arxiv:1702.08608, 2017.
[16] E. Puiutta and E. M. Veith, “Explainable reinforcement learning: A survey,” in International cross-domain conference for machine learning and knowledge extraction, pp. 77–95, Springer, 2020.
[17] A. Madsen, S. Reddy, and S. Chandar, “Post-hoc interpretability for neural nlp: A survey,” ACM Computing Surveys, vol. 55, no. 8, pp. 1–42, 2022.
[18] M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis, B. Kawas, and P. Sen, “A survey of the state of explainable AI for natural language processing,” in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, AACL/IJCNLP 2020, Suzhou, China, December 4-7, 2020 (K. Wong, K. Knight, and H. Wu, eds.), pp. 447–459, Association for Computational Linguistics, 2020.
[19] A. Jacovi and Y. Goldberg, “Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 (D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, eds.), pp. 4198–4205, Association for Computational Linguistics, 2020.
[20] P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis, “Explainable ai: A review of machine learning interpretability methods,” Entropy, vol. 23, no. 1, p. 18, 2020.
[21] H. Yuan, H. Yu, J. Wang, K. Li, and S. Ji, “On explainability of graph neural networks via subgraph explorations,” in International conference on machine learning, pp. 12241–12252, PMLR, 2021.
[22] A. Lucic, M. A. Ter Hoeve, G. Tolomei, M. De Rijke, and F. Silvestri, “Cf-gnnexplainer: Counterfactual explanations for graph neural networks,” in International Conference on Artificial Intelligence and Statistics, pp. 4499–4511, PMLR, 2022.
[23] H. Yuan, H. Yu, S. Gui, and S. Ji, “Explainability in graph neural networks: A taxonomic survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 5, pp. 5782–5799, 2022.
[24] M. Vu and M. T. Thai, “Pgm-explainer: Probabilistic graphical model explanations for graph neural networks,” Advances in neural information processing systems, vol. 33, pp. 12225–12235, 2020.
[25] S. Miao, M. Liu, and P. Li, “Interpretable and generalizable graph learning via stochastic attention mechanism,” International Conference on Machine Learning, 2022.
[26] D. Luo, W. Cheng, D. Xu, W. Yu, B. Zong, H. Chen, and X. Zhang, “Parameterized explainer for graph neural network,” Advances in Neural Information Processing Systems, 2020.
[27] A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, et al., “Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai,” Information fusion, vol. 58, pp. 82–115, 2020.
[28] C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature machine intelligence, vol. 1, no. 5, pp. 206–215, 2019.
[29] T. Laugel, M.-J. Lesot, C. Marsala, X. Renard, and M. Detyniecki, “The dangers of post-hoc interpretability: Unjustified counterfactual explanations,” International Joint Conference on Artificial Intelligence, 2019.
[30] K. Amara, Z. Ying, Z. Zhang, Z. Han, Y. Zhao, Y. Shan, U. Brandes, S. Schemm, and C. Zhang, “Graphframex: Towards systematic evaluation of explainability methods for graph neural networks,” in Learning on Graphs Conference, LoG 2022, 9-12 December 2022, Virtual Event (B. Rieck and R. Pascanu, eds.), vol. 198 of Proceedings of Machine Learning Research, p. 44, PMLR, 2022.
[31] B. Sanchez-Lengeling, J. Wei, B. Lee, E. Reif, P. Wang, W. Qian, K. McCloskey, L. Colwell, and A. Wiltschko, “Evaluating attribution for graph neural networks,” Advances in neural information processing systems, vol. 33, pp. 5898–5910, 2020.
[32] A. Longa, S. Azzolin, G. Santin, G. Cencetti, P. Liò, B. Lepri, and A. Passerini, “Explaining the explainers in graph neural networks: a comparative study,” arxiv:2210.15304, 2022.
[33] J. Chen, K. Amara, J. Yu, and R. Ying, “Generative explanation for graph neural network: Methods and evaluation,” IEEE Data Eng. Bull., vol. 46, no. 2, pp. 64–79, 2023.
[34] J. Chen, S. Wu, A. Gupta, and R. Ying, “D4explainer: In-distribution explanations of graph neural network via discrete denoising diffusion,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[35] J. Adebayo, M. Muelly, H. Abelson, and B. Kim, “Post hoc explanations may be ineffective for detecting unknown spurious correlation,” in International conference on learning representations, 2021.
[36] D. Slack, A. Hilgard, S. Singh, and H. Lakkaraju, “Reliable post hoc explanations: Modeling uncertainty in explainability,” Advances in neural information processing systems, vol. 34, pp. 9391–9404, 2021.
[37] N. Bui, H. T. Nguyen, V. A. Nguyen, and R. Ying, “Explaining graph neural networks via structure-aware interaction index,” International Conference on Machine Learning, 2024.
[38] S. Serrano and N. A. Smith, “Is attention interpretable?,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers (A. Korhonen, D. R. Traum, and L. Màrquez, eds.), pp. 2931–2951, Association for Computational Linguistics, 2019.
[39] S. Wiegreffe and Y. Pinter, “Attention is not not explanation,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019 (K. Inui, J. Jiang, V. Ng, and X. Wan, eds.), pp. 11–20, Association for Computational Linguistics, 2019.
[40] A. K. Mohankumar, P. Nema, S. Narasimhan, M. M. Khapra, B. V. Srinivasan, and B. Ravindran, “Towards transparent and explainable attention models,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 (D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, eds.), pp. 4206–4216, Association for Computational Linguistics, 2020.
[41] B. Bai, J. Liang, G. Zhang, H. Li, K. Bai, and F. Wang, “Why attentions may not be interpretable?,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 25–34, 2021.
[42] C. Wang, B. Han, B. Patel, and C. Rudin, “In pursuit of interpretable, fair and accurate machine learning for criminal recidivism prediction,” Journal of Quantitative Criminology, vol. 39, no. 2, pp. 519–581, 2023.
[43] Y. Li, J. Zhou, S. Verma, and F. Chen, “A survey of explainable graph neural networks: Taxonomy and evaluation metrics,” arxiv:2207.12599, 2022.
[44] B. Wu, J. Li, J. Yu, Y. Bian, H. Zhang, C. Chen, C. Hou, G. Fu, L. Chen, T. Xu, Y. Rong, X. Zheng, J. Huang, R. He, B. Wu, G. Sun, P. Cui, Z. Zheng, Z. Liu, and P. Zhao, “A survey of trustworthy graph learning: Reliability, explainability, and privacy protection,” arxiv:2205.10014, 2022.
[45] J. Kakkad, J. Jannu, K. Sharma, C. Aggarwal, and S. Medya, “A survey on explainability of graph neural networks,” IEEE Data Eng. Bull., vol. 46, no. 2, pp. 35–63, 2023.
[46] H. Zhang, B. Wu, X. Yuan, S. Pan, H. Tong, and J. Pei, “Trustworthy graph neural networks: Aspects, methods, and trends,” Proceedings of the IEEE, vol. 112, p. 97–139, February 2024.
[47] S. Miao, Y. Luo, M. Liu, and P. Li, “Interpretable geometric deep learning via learnable randomness injection,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, 2023.
[48] Z. Ying, D. Bourgeois, J. You, M. Zitnik, and J. Leskovec, “Gnnexplainer: Generating explanations for graph neural networks,” Advances in Neural Information Processing Systems, 2019.
[49] X. Ai, C. Allaire, N. Calace, A. Czirkos, M. Elsing, I. Ene, R. Farkas, L.-G. Gagnon, R. Garg, P. Gessinger, et al., “A common tracking software project,” Computing and Software for Big Science, 2022.
[50] K. DE LEO et al., “Search for the lepton flavor violating $\tau\to 3\mu$ decay in proton-proton collisions at $\sqrt{s}=$ 13 tev,” PHYSICS LETTERS. SECTION B, vol. 853, pp. 1–28, 2024.
[51] K. McCloskey, A. Taly, F. Monti, M. P. Brenner, and L. J. Colwell, “Using attribution to decode binding mechanism in neural network models for chemistry,” Proceedings of the National Academy of Sciences, 2019.
[52] R. Wang, X. Fang, Y. Lu, C.-Y. Yang, and S. Wang, “The pdbbind database: methodologies and updates,” Journal of medicinal chemistry, vol. 48, no. 12, pp. 4111–4119, 2005.
[53] J. Chen and R. Ying, “Tempme: Towards the explainability of temporal graph neural networks via motif discovery,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[54] M. Thomson, Modern particle physics. Cambridge University Press, 2013.
[55] R. Oerter, The theory of almost everything: The standard model, the unsung triumph of modern physics. Penguin, 2006.
[56] P. Blackstone, M. Fael, and E. Passemar, “ $\tau\rightarrow\mu\mu\mu$ at a rate of one out of $10{{}^{14}}$ tau decays?,” The European Physical Journal C, 2020.
[57] L. Calibbi and G. Signorelli, “Charged lepton flavour violation: an experimental and theoretical introduction,” La Rivista del Nuovo Cimento, 2018.
[58] A. Collaboration, “Search for charged-lepton-flavour violation in z-boson decays with the atlas detector,” Nature Physics, 2021.
[59] F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz, G. E. Dahl, O. Vinyals, S. Kearnes, P. F. Riley, and O. A. Von Lilienfeld, “Prediction errors of molecular machine learning models lower than hybrid dft error,” Journal of chemical theory and computation, vol. 13, no. 11, pp. 5255–5264, 2017.
[60] C. Wang and Y. Zhang, “Improving scoring-docking-screening powers of protein–ligand scoring functions using random forest,” Journal of Computational Chemistry, 2017.
[61] M. Karimi, D. Wu, Z. Wang, and Y. Shen, “Deepaffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks,” Bioinformatics, vol. 35, no. 18, pp. 3329–3338, 2019.
[62] V. G. Satorras, E. Hoogeboom, and M. Welling, “E (n) equivariant graph neural networks,” International conference on machine learning, 2021.
[63] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” Acm Transactions on Graphics, 2019.
[64] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” IEEE International Conference on Computer Vision, 2021.
[65] K. Atz, F. Grisoni, and G. Schneider, “Geometric deep learning on molecular representations,” Nature Machine Intelligence, 2021.
[66] L. Gagliardi, A. Raffo, U. Fugacci, S. Biasotti, W. Rocchia, H. Huang, B. B. Amor, Y. Fang, Y. Zhang, X. Wang, et al., “Shrec 2022: Protein–ligand binding site recognition,” Computers & Graphics, 2022.
[67] A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important features through propagating activation differences,” International Conference on Machine Learning, 2017.
[68] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” IEEE Winter Conference on Applications of Computer Vision, 2018.
[69] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” in International conference on machine learning, pp. 3319–3328, PMLR, 2017.
[70] T. Schnake, O. Eberle, J. Lederer, S. Nakajima, K. T. Schütt, K.-R. Müller, and G. Montavon, “Higher-order explanations of graph neural networks via relevant walks,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 11, pp. 7581–7596, 2021.
[71] E. Ranjan, S. Sanyal, and P. Talukdar, “Asap: Adaptive structure aware pooling for learning hierarchical graph representations,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5470–5477, 2020.
[72] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv preprint physics/0004057, 2000.
[73] J. Pearl, Causality. Cambridge university press, 2009.
[74] J. Yu, J. Cao, and R. He, “Improving subgraph recognition with variational graph information bottleneck,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19396–19405, 2022.
[75] Y. Chen, Y. Zhang, Y. Bian, H. Yang, M. Kaili, B. Xie, T. Liu, B. Han, and J. Cheng, “Learning causally invariant representations for out-of-distribution generalization on graphs,” Advances in Neural Information Processing Systems, vol. 35, pp. 22131–22148, 2022.
[76] J. Dai, S. Upadhyay, U. Aivodji, S. H. Bach, and H. Lakkaraju, “Fairness via explanation quality: Evaluating disparities in the quality of post hoc explanations,” in Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, pp. 203–214, 2022.
[77] C. Agarwal, O. Queen, H. Lakkaraju, and M. Zitnik, “Evaluating explainability for graph neural networks,” Scientific Data, vol. 10, no. 1, p. 144, 2023.
[78] B. Knyazev, G. W. Taylor, and M. Amer, “Understanding attention and generalization in graph neural networks,” Advances in neural information processing systems, vol. 32, 2019.
[79] F. Baldassarre and H. Azizpour, “Explainability techniques for graph convolutional networks,” in International Conference on Machine Learning (ICML) Workshops, 2019 Workshop on Learning and Reasoning with Graph-Structured Representations, 2019.
[80] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. Müller, “Layer-wise relevance propagation: an overview,” Explainable AI: interpreting, explaining and visualizing deep learning, pp. 193–209, 2019.
[81] P. Xiong, T. Schnake, G. Montavon, K.-R. Müller, and S. Nakajima, “Efficient computation of higher-order subgraph attribution via message passing,” in International Conference on Machine Learning, pp. 24478–24495, PMLR, 2022.
[82] P. E. Pope, S. Kolouri, M. Rostami, C. E. Martin, and H. Hoffmann, “Explainability methods for graph convolutional neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10772–10781, 2019.
[83] M. S. Schlichtkrull, N. D. Cao, and I. Titov, “Interpreting graph neural networks for NLP with differentiable edge masking,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.
[84] T. Funke, M. Khosla, and A. Anand, “Hard masking for explaining graph neural networks,” 2021.
[85] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., “Mastering the game of go without human knowledge,” nature, vol. 550, no. 7676, pp. 354–359, 2017.
[86] Q. Huang, M. Yamada, Y. Tian, D. Singh, and Y. Chang, “Graphlime: Local interpretable model explanations for graph neural networks,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 7, pp. 6968–6972, 2023.
[87] Y. Zhang, D. Defazio, and A. Ramesh, “Relex: A model-agnostic relational model explainer,” in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 1042–1049, 2021.
[88] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
[89] L. Ma, R. Rabbany, and A. Romero-Soriano, “Graph attention networks with positional embeddings,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 514–527, Springer, 2021.
[90] J. Yu, T. Xu, Y. Rong, Y. Bian, J. Huang, and R. He, “Graph information bottleneck for subgraph recognition,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.
[91] Y. Wu, X. Wang, A. Zhang, X. He, and T. Chua, “Discovering invariant rationales for graph neural networks,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.

1 Classification Performance of the Trained Models

Most experiments in the main manuscript use 10 models trained with different seeds via empirical risk minimization (ERM) or objectives proposed by self-interpretable methods. As suggested by our findings, prediction performance may influence interpretation performance. Therefore, below we report these models’ prediction performance.

$\operatorname{SynMol}$	Classification Accuracy			Classification ROC-AUC
$\operatorname{SynMol}$	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans
ERM	$99.45\pm 0.16$	$99.40\pm 0.25$	$94.63\pm 0.54$	$99.87\pm 0.08$	$99.99\pm 0.01$	$97.96\pm 0.40$
ASAP	$95.16\pm 8.04$	$98.66\pm 0.44$	$87.45\pm 5.32$	$93.07\pm 14.26$	$99.77\pm 0.09$	$80.67\pm 20.86$
CIGA	$89.27\pm 3.87$	$84.41\pm 2.02$	$83.49\pm 1.97$	$92.85\pm 3.93$	$84.55\pm 3.55$	$83.44\pm 1.62$
LRI-Bern	$98.72\pm 0.47$	$98.91\pm 0.30$	$93.29\pm 0.81$	$99.76\pm 0.07$	$99.87\pm 0.04$	$96.91\pm 0.82$
LRI-Gaussian	$99.08\pm 0.23$	$99.19\pm 0.16$	$93.68\pm 0.61$	$99.87\pm 0.03$	$99.93\pm 0.03$	$97.36\pm 0.52$
VGIB	$95.78\pm 5.06$	$91.11\pm 8.65$	$91.65\pm 1.22$	$99.75\pm 0.08$	$99.55\pm 0.19$	$96.10\pm 0.86$

$\operatorname{ActsTrack}$	Classification Accuracy			Classification ROC-AUC
$\operatorname{ActsTrack}$	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans
ERM	$94.85\pm 0.71$	$94.93\pm 1.25$	$92.22\pm 1.30$	$98.74\pm 0.37$	$99.30\pm 0.16$	$98.06\pm 0.41$
ASAP	$93.14\pm 1.70$	$93.39\pm 2.10$	$92.79\pm 1.62$	$98.07\pm 0.99$	$98.90\pm 0.47$	$97.78\pm 0.57$
CIGA	$93.45\pm 2.19$	$84.70\pm 6.32$	$90.35\pm 8.13$	$98.32\pm 0.66$	$95.70\pm 2.37$	$97.79\pm 0.80$
LRI-Bern	$95.09\pm 0.76$	$94.46\pm 1.20$	$93.18\pm 1.32$	$98.72\pm 0.46$	$98.97\pm 0.26$	$98.31\pm 0.28$
LRI-Gaussian	$95.67\pm 0.84$	$94.70\pm 0.88$	$94.19\pm 2.04$	$99.39\pm 0.15$	$99.07\pm 0.27$	$98.88\pm 0.49$
VGIB	$49.14\pm 19.26$	$74.17\pm 14.35$	$90.66\pm 4.05$	$96.70\pm 1.01$	$98.65\pm 0.31$	$97.63\pm 0.92$

$\operatorname{Tau3Mu}$	Classification Accuracy			Classification ROC-AUC
$\operatorname{Tau3Mu}$	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans
ERM	$82.52\pm 0.67$	$83.63\pm 0.18$	$82.54\pm 0.21$	$86.18\pm 1.04$	$87.45\pm 0.12$	$86.10\pm 0.20$
ASAP	$81.45\pm 1.03$	$79.15\pm 2.76$	$76.72\pm 0.83$	$84.27\pm 1.46$	$82.76\pm 2.79$	$78.27\pm 1.02$
CIGA	$80.20\pm 0.99$	$81.77\pm 1.63$	$81.74\pm 0.53$	$85.21\pm 0.30$	$85.50\pm 0.47$	$85.49\pm 0.35$
LRI-Bern	$82.77\pm 0.13$	$83.47\pm 0.28$	$82.72\pm 0.25$	$86.41\pm 0.11$	$87.30\pm 0.10$	$86.49\pm 0.10$
LRI-Gaussian	$83.07\pm 0.16$	$83.88\pm 0.20$	$82.60\pm 0.29$	$86.72\pm 0.07$	$87.64\pm 0.17$	$85.88\pm 0.34$
VGIB	$82.12\pm 0.54$	$82.94\pm 0.57$	$81.52\pm 0.59$	$86.11\pm 0.17$	$87.04\pm 0.14$	$84.59\pm 0.61$

$\operatorname{PLBind}$	Classification Accuracy			Classification ROC-AUC
$\operatorname{PLBind}$	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans
ERM	$85.59\pm 1.66$	$82.65\pm 2.87$	$81.76\pm 3.78$	$88.05\pm 1.44$	$82.65\pm 5.46$	$81.57\pm 2.47$
ASAP	$84.78\pm 1.14$	$84.16\pm 2.00$	$82.98\pm 2.67$	$84.19\pm 1.75$	$82.23\pm 1.89$	$78.89\pm 8.07$
CIGA	$83.71\pm 2.39$	$82.33\pm 2.36$	$84.37\pm 1.35$	$82.60\pm 1.48$	$82.51\pm 2.12$	$82.37\pm 2.10$
LRI-Bern	$85.10\pm 1.85$	$84.65\pm 2.23$	$82.57\pm 2.04$	$85.49\pm 3.77$	$84.70\pm 2.50$	$85.22\pm 1.99$
LRI-Gaussian	$85.80\pm 1.33$	$83.35\pm 3.32$	$84.29\pm 1.77$	$89.17\pm 1.40$	$84.43\pm 3.93$	$86.26\pm 1.81$
VGIB	$82.29\pm 2.14$	$77.51\pm 8.65$	$78.24\pm 5.39$	$78.44\pm 4.24$	$78.72\pm 2.56$	$79.18\pm 3.26$

Table 1: Prediction performance of trained models.

2 Misalignment of the Two Patterns

In the main manuscript, we discussed that the relatively low and highly variable Fidelity AUC may indicate a misalignment between the two patterns. Moreover, the Decisive-Induced Fidelity AUC is significantly lower than the highest Fidelity AUC achieved by post-hoc methods, thereby reinforcing our claim.

Dataset	Best Fidelity AUC			Decisive-Induced Fidelity AUC
Dataset	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans
$\operatorname{SynMol}$	$77.77$	$24.27$	$63.94$	$61.91\pm 10.95$	$1.75\pm 3.38$	$17.62\pm 3.79$
$\operatorname{ActsTrack}$	$90.12$	$84.76$	$93.69$	$85.46\pm 2.48$	$58.48\pm 6.05$	$70.97\pm 2.81$
$\operatorname{Tau3Mu}$	$24.71$	$27.43$	$25.94$	$17.04\pm 2.89$	$21.81\pm 1.93$	$20.07\pm 1.18$
$\operatorname{PLBind}$	$72.86$	$53.66$	$57.29$	$20.71\pm 6.68$	$20.31\pm 4.66$	$8.46\pm 5.71$

Table 2: The degree of alignment between the two patterns measured by Decisive-Induced Fidelity AUC. The best Fidelity AUC is from the best-performing post-hoc methods benchmarked in the main manuscript.

3 Ensemble Strategy Improves the Alignment

In the main manuscript, we demonstrated that our ensemble strategy, which combines multiple pre-trained models, is effective in enhancing the alignment between post-hoc interpretations and the decisive patterns. To differentiate our approach from the naive ensemble methods commonly utilized in machine learning, we conducted a comparative study, involving a naive ensemble applied to multiple explainers yielded with different seeds but for a single trained model (instead of different trained models). The comparative analysis substantiates our assertion that the success of our ensemble strategy stems from the overlap** of sensitive patterns across various already-trained models, which yield interpretations more aligned with decisive patterns.

SynMol	Ensemble Model			Ensemble Explainer
SynMol	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans
GNNExplainer	$70.98$	$45.49$	$39.92$	$58.77\pm 16.94$	$51.14\pm 6.37$	$26.86\pm 2.93$
PGExplainer	$96.20$	$94.69$	$90.51$	$81.36\pm 22.98$	$42.49\pm 39.38$	$89.14\pm 1.11$
PGM-Explainer	$68.83$	$53.31$	$63.94$	$67.84\pm 1.53$	$51.55\pm 2.33$	$62.06\pm 1.42$
SubgraphX	$92.32$	$76.84$	$82.38$	$90.29\pm 1.30$	$71.12\pm 8.35$	$79.11\pm 1.26$

ActsTrack	Ensemble Model			Ensemble Explainer
ActsTrack	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans
GNNExplainer	$62.47$	$75.73$	$80.08$	$52.03\pm 4.59$	$64.43\pm 4.07$	$71.57\pm 2.66$
PGExplainer	$63.59$	$95.01$	$91.71$	$34.95\pm 26.61$	$92.91\pm 1.65$	$89.19\pm 2.17$
PGM-Explainer	$70.26$	$58.52$	$61.58$	$68.28\pm 1.25$	$56.84\pm 2.14$	$59.13\pm 2.13$
SubgraphX	$64.60$	$60.21$	$63.93$	$62.64\pm 2.28$	$59.82\pm 1.20$	$63.02\pm 1.06$

Tau3Mu	Ensemble Model			Ensemble Explainer
Tau3Mu	EGNN	DGCNN	PointTrans	EGNN	DGCNN	PointTrans
GNNExplainer	$77.52$	$57.22$	$41.64$	$72.17\pm 2.44$	$52.36\pm 3.33$	$31.14\pm 0.80$
PGExplainer	$78.60$	$71.90$	$80.01$	$76.67\pm 1.57$	$51.23\pm 20.35$	$79.28\pm 0.29$

Table 3: Interpretation ROC-AUC performance of different ensemble schemes for post-hoc methods: Ensemble Model refers to the setting reported in our main manuscript, where the explainers are trained with the same seed as the models. Ensemble Explainer refers to a setting where the already-trained model is fixed (trained with seed 0), and we ensemble multiple explainers trained with different seeds. The results are reported as mean

\pm

standard deviation.

4 Post-Hoc Interpretations on Self-Interpretable Models

In the main manuscript, we demonstrated that post-hoc methods can produce interpretations that align better with decisive patterns when applied to self-interpretable models rather than vanilla models trained with ERM, indicating that sensitive patterns of self-interpretable models may align well with decisive patterns. For simplicity, we only showed the average results across three backbones, and below are the complete results.

Model & Explainer		$\operatorname{SynMol}$			$\operatorname{ActsTrack}$
Model & Explainer		EGNN	DGCNN	Point Transformer	EGNN	DGCNN	Point Transformer
Pre-trained Models	GNNLRP	$81.75\pm 4.01$	$84.61\pm 3.81$	$50.38\pm 1.68$	$86.01\pm 2.31$	$86.40\pm 5.08$	$50.2\pm 1.93$
	GradCAM	$57.82\pm 4.42$	$78.89\pm 3.84$	$84.1\pm 3.66$	$69.38\pm 2.72$	$75.25\pm 3.67$	$77.32\pm 2.83$
	GradxInput	$76.03\pm 4.82$	$71.39\pm 5.89$	$78.03\pm 1.52$	$68.74\pm 1.84$	$65.17\pm 1.56$	$64.78\pm 1.9$
	IG	$78.59\pm 7.83$	$64.31\pm 9.14$	$84.23\pm 1.83$	$68.78\pm 1.82$	$65.27\pm 1.47$	$64.8\pm 1.88$
GNNExplainer	$58.94\pm 15.89$	$51.03\pm 5.58$	$26.28\pm 2.70$	$51.77\pm 4.41$	$64.34\pm 4.05$	$71.38\pm 2.61$
PGExplainer	$77.92\pm 22.04$	$49.56\pm 39.82$	$87.41\pm 2.66$	$33.54\pm 23.17$	$92.63\pm 1.57$	$88.39\pm 3.13$
LRI-Bern Induced Models	GNNLRP	$80.96\pm 3.88$	$84.33\pm 4.28$	$50.39\pm 1.37$	$88.06\pm 4.24$	$89.16\pm 3.04$	$50.21\pm 1.13$
	GradCAM	$57.03\pm 8.16$	$78.08\pm 4.73$	$85.32\pm 1.52$	$68.94\pm 3.63$	$80.7\pm 2.32$	$82.33\pm 2.87$
	GradxInput	$87.85\pm 6.86$	$82.49\pm 6.16$	$78.87\pm 1.06$	$70.56\pm 2.15$	$63.71\pm 1.38$	$65.3\pm 1.66$
	IG	$75.27\pm 9.96$	$86.83\pm 4.64$	$84.21\pm 2.45$	$70.68\pm 2.15$	$63.82\pm 1.41$	$65.32\pm 1.66$
	GNNExplainer	$68.92\pm 0.15$	$46.66\pm 0.11$	$25.90\pm 0.03$	$32.56\pm 0.04$	$77.55\pm 0.04$	$80.13\pm 0.02$
	PGExplainer	$78.89\pm 0.38$	$74.85\pm 0.33$	$79.35\pm 0.25$	$78.60\pm 0.12$	$\underline{93.51}\pm 0.01$	$\underline{90.16}\pm 0.01$
	Self	$92.04\pm 3.0$	$94.2\pm 4.53$	$90.46\pm 1.21$	$80.97\pm 2.07$	$90.74\pm 1.72$	$86.84\pm 1.85$
LRI-Gaussian Induced Models	GNNLRP	$85.37\pm 1.16$	$86.41\pm 1.37$	$50.47\pm 1.69$	$90.48\pm 3.88$	$88.66\pm 4.33$	$50.08\pm 0.92$
	GradCAM	$69.01\pm 7.36$	$78.27\pm 3.47$	$78.18\pm 4.79$	$82.3\pm 5.35$	$66.98\pm 4.46$	$83.2\pm 2.43$
	GradxInput	$83.98\pm 7.0$	$85.72\pm 3.2$	$80.52\pm 2.46$	$73.41\pm 1.87$	$74.52\pm 1.15$	$75.52\pm 3.63$
	IG	$88.73\pm 2.1$	$87.21\pm 5.86$	$82.48\pm 1.85$	$73.42\pm 1.77$	$74.55\pm 1.17$	$75.54\pm 3.64$
	GNNExplainer	$49.79\pm 0.01$	$49.88\pm 0.01$	$49.84\pm 0.01$	$49.82\pm 0.01$	$50.05\pm 0.01$	$50.05\pm 0.01$
	PGExplainer	$36.53\pm 0.17$	$51.87\pm 0.12$	$37.97\pm 0.16$	$83.81\pm 0.07$	$87.47\pm 0.04$	$85.50\pm 0.07$
	Self	$97.13\pm 0.79$	$98.23\pm 1.0$	$93.06\pm 1.19$	$92.93\pm 1.58$	$94.18\pm 0.88$	$91.85\pm 1.15$

Table 4: Interpretation ROC-AUC computed using the interpretations given by post-hoc methods for self-interpretable models trained by LRI-induced methods.