PE: A Poincare Explanation Method for Fast Text Hierarchy Generation

Qian Chen1, Dongyang Li1, Xiaofeng He1,2 , Hongzhao Li3, Hongyu Yi3
1 School of Computer Science and Technology, East China Normal University, Shanghai, China
2 NPPA Key Laboratory of Publishing Integration Development, ECNUP, Shanghai, China
3 Sichuan Caizi Software Information Network Co., Ltd
{qianchen901005,fromdongyang}@gmail.com, [email protected], [email protected],[email protected]
Abstract

The black-box nature of deep learning models in NLP hinders their widespread application. The research focus has shifted to Hierarchical Attribution (HA) for its ability to model feature interactions. Recent works model non-contiguous combinations with a time-costly greedy search in Eculidean spaces, neglecting underlying linguistic information in feature representations. In this work, we introduce a novel method, namely Poincare Explanation (PE), for modeling feature interactions with hyperbolic spaces in a time efficient manner. Specifically, we take building text hierarchies as finding spanning trees in hyperbolic spaces. First we project the embeddings into hyperbolic spaces to elicit inherit semantic and syntax hierarchical structures. Then we propose a simple yet effective strategy to calculate Shapley score. Finally we build the the hierarchy with proving the constructing process in the projected space could be viewed as building a minimum spanning tree and introduce a time efficient building algorithm. Experimental results demonstrate the effectiveness of our approach.

1 Introduction

Deep learning models have been ubiquitous in Natural Language Processing (NLP) areas accompanied by the explosion of the parameters, leading to increased opaqueness. Consequently, a series of interpretability studies have emerged Abnar and Zuidema (2020); Geva et al. (2021); He et al. (2022), among them feature attribution methods stand out owing to fidelity and loyalty axioms and straightforward applicability Guidotti et al. (2018).

Previous feature-based works are limited to single words or phrases Miglani et al. (2020). However, Mardaoui and Garreau (2021) point out that LIME’s Ribeiro et al. (2016) performance on simple models is not plausible 111A figure illustration is provided in Appendix E. To model feature interactions, Hierarchical Attribution (HA) Chen et al. (2020); Ju et al. (2023) has been introduced, with a attribution-then-cluster stage in which constructs feature interaction process by distributing text group scores at different levels222A vivid HA example is provided in Appendix D.. From bottom to the up, HA categorizes all words into different clusters, ending with a tree structure.

Refer to caption
Figure 1: Pearson correlation ρ𝜌\rhoitalic_ρ results from ** et al. (2020) with BERT and LSTM on SST-2 and Yelp datasets. A higher correlation coefficient indicates a stronger ability of the method to identify important words.

However, building feature hierarchies is not a trivial thing. Existing methods have three following problems. P-1: Detecting contiguous text spans to replace all possible interactions Singh et al. (2019); Chen et al. (2020). Only using spans might lose long-range dependencies in text Vaswani et al. (2017). For example, in the positive example Even in moments of sorrow, certain memories can evoke happiness, (Even, sorrow) is vital and non-adjacent. P-2: Current algorithms estimating the importance of feature combinations are accompanied by lengthy optimization processes Ju et al. (2023); Chen et al. (2020). For example, HE Ju et al. (2023) estimates the importance of words using LIME algorithm and then enumerates word combinations to construct the hierarchy, with a cubic time complexity333For convenience of comparison, we ignore the time taken by linear regression in LIME algorithm and detailed discussion is in Section 6.. ASIV Lu et al. (2023) uses directional Shapley value to model the direction of feature interactions, while estimating Shapley value requires exponential time. P-3: Previous methods cannot model the linguistic information including syntax and semantic information. Syntax and semantics can help to construct a hierarchical tree. For syntax, ** et al. (2020) build hierarchies directly on Dependency Parsing Trees (DPT) and compute Pearson Correlation (i.e.ρ𝜌\rhoitalic_ρ). The results in Figure 1 demonstrate syntax could contribute to building explainable hierarchies by reaching a higher correlation. For semantic, we take Figure 2 as an example, the hierarchy in hyperbolic space has already achieved preliminary interpretability with the proximity corresponding the polarity.

Refer to caption
Figure 2: Left: The projection illustration for positive example It was an interesting but somewhat draggy movie.” The centre represents the prototype for the positive label. Right: A negative example It was a draggy but somewhat interesting movie.” The center point stands for the negative label.

As the input text length continues to increase, efficiently modeling the interaction of non-contiguous features has become a key challenge in promoting HA. Building a hierarchical attribution tree based on the input text is essentially a hierarchical clustering problem. The definition is as follows: given words and their pairwise similarities, the goal is to construct a hierarchy over clusters (word groups). PE approaches this problem by following three steps. First, to model linguistic hierarchical information, we project word embeddings into hyperbolic spaces to uncover hidden semantics and syntax structures. Next, inspired by cooperative game theory Owen (2013), we regard words as players and clusters as coalitions and introduce a simple yet effective strategy to estimate the Shapley score contribution. Finally we calculate pairwise similarities and propose an algorithm that conceptualizes the bottom-up clustering process as generating a minimum spanning tree.

Our contributions are summarized as follows:

  • We propose a method, PE, using hyperbolic geometry for generating hierarchical explanations, revealing the feature interaction process.

  • PE introduces a fast algorithm for generating hierarchical attribution trees that model non-contiguous feature interactions.

  • We evaluate the proposed method on three datasets with BERT Devlin et al. (2019), and the results demonstrate the effectiveness.

2 Related Work

Feature importance explanation methods mainly assign attribution scores to features Qiang et al. (2022); Ferrando et al. (2022); Modarressi et al. (2023). Methods can be classified into two categories: single-feature explanation type and multi-feature explanation type.

2.1 Single-Feature Explanation

Earlier researches focus on single feature attribution Ribeiro et al. (2016); Sundararajan et al. (2017); Kokalj et al. (2021). For example, LIME Ribeiro et al. (2016) aims to fit the local area of the model by linear regression with sampled data points ending with linear weights as attribution scores. Gradient&Input (Grad×\times×Inp) Shrikumar et al. (2017b) combines the gradient norm with Shapley value Shapley et al. (1953). Deeplift Shrikumar et al. (2017a) depends on activation difference to calculate attribution scores. IG Sundararajan et al. (2017); Sanyal and Ren (2021); Enguehard (2023) uses path integral to compute the contribution of the single feature to the output. It is noticeable that IG is the unique path method to satisfy the completeness and symmetry-preserving axioms. There exist several variants of IG. DIG Sanyal and Ren (2021) regards similar words as interpolation points to estimate the integrated gradients value. SIG Enguehard (2023) computes the importance of each word in a sentence while kee** all other words fixed. However, scoring individual features is incompatible with interactions between features.

2.2 Multi-Feature Explanation

Multi-feature explanation methods aim to model feature interactions in deep learning architectures. For example, Dhamdhere et al. (2020) proposes a variant of Shapley value to measure the interactions. Zhang et al. (2021a) defines the multivariant Shapley value to analyze interactions between two sets of players. Enouen and Liu (2022) proposes a sparse interaction additive network to select feature groups. Tsang et al. (2020) proposes an Archipelago framework to measure feature attribution and interaction through ArchAttribute and ArchDetect. Lu et al. (2023) proposes ASIV to model asymmetric higher-order feature interactions. To illustrate the feature interplay process completely, the explanation of feature interaction could be articulated within a hierarchical framework. HEDEG Chen et al. (2020) designs a top-down model-agnostic hierarchical explanation method, with neglecting non-contiguous interactions. Ju et al. (2023) addresses the connecting rule limitation in HEDGE, and proposes a greedy algorithm , HE, for generating hierarchical explanations, which is time-costly. And they all neglect lingustice information including syntax and semantics.

3 Background

We first give a review of hyperbolic geometry.

Poincare ball A common representation model in hyperbolic space is the Poincare ball, denoted as (cm,g𝒙)superscriptsubscript𝑐𝑚superscriptsubscript𝑔𝒙(\mathcal{B}_{c}^{m},g_{\boldsymbol{x}}^{\mathcal{B}})( caligraphic_B start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_B end_POSTSUPERSCRIPT ), where c𝑐citalic_c is a constant greater than 00. cm={𝒙mc𝒙2<1}superscriptsubscript𝑐𝑚conditional-set𝒙superscript𝑚𝑐superscriptnorm𝒙21\mathcal{B}_{c}^{m}=\{\boldsymbol{x}\in\mathbb{R}^{m}\mid c\left\|\boldsymbol{% x}\right\|^{2}<1\}caligraphic_B start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = { bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∣ italic_c ∥ bold_italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < 1 } is a Riemannian manifold, and g𝒙=(λ𝒙c)2𝑰msuperscriptsubscript𝑔𝒙superscriptsuperscriptsubscript𝜆𝒙𝑐2subscript𝑰𝑚g_{\boldsymbol{x}}^{\mathcal{B}}=(\lambda_{\boldsymbol{x}}^{c})^{2}\boldsymbol% {I}_{m}italic_g start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_B end_POSTSUPERSCRIPT = ( italic_λ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is its metric tensor, λ𝒙c=2/(1c𝒙2)superscriptsubscript𝜆𝒙𝑐21𝑐superscriptnorm𝒙2\lambda_{\boldsymbol{x}}^{c}=2/(1-c\left\|\boldsymbol{x}\right\|^{2})italic_λ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = 2 / ( 1 - italic_c ∥ bold_italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is the conformal factor and c𝑐citalic_c is the negative curvature of the hyperbolic space. PE uses the standard Poincare ball with c=1𝑐1c=1italic_c = 1. The distance for 𝒙,𝒚cm𝒙𝒚superscriptsubscript𝑐𝑚\boldsymbol{x},\boldsymbol{y}\in\mathcal{B}_{c}^{m}bold_italic_x , bold_italic_y ∈ caligraphic_B start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is:

d(𝒙,𝒚)=2tanh1𝒙c𝒚,subscript𝑑𝒙𝒚2superscript1normsubscriptdirect-sum𝑐𝒙𝒚d_{\mathcal{B}}(\boldsymbol{x},\boldsymbol{y})=2\tanh^{-1}{\left\|-\boldsymbol% {x}\oplus_{c}\boldsymbol{y}\right\|},italic_d start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_y ) = 2 roman_tanh start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ - bold_italic_x ⊕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_italic_y ∥ , (1)

where csubscriptdirect-sum𝑐\oplus_{c}⊕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denotes the Möbius addition. We use csubscripttensor-product𝑐\otimes_{c}⊗ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to denote the Möbius matrix multiplication. The Möbius addition for 𝒙𝒙\boldsymbol{x}bold_italic_x, 𝒚m𝒚superscript𝑚\boldsymbol{y}\in\mathbb{R}^{m}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is defined as Demirel (2013):

𝒙c𝒚=(1+2𝒙,𝒚+𝒚2)𝒙+(1𝒙2)𝒚1+2𝒙,𝒚+𝒙2𝒚2.subscriptdirect-sum𝑐𝒙𝒚12𝒙𝒚superscriptnorm𝒚2𝒙1superscriptnorm𝒙2𝒚12𝒙𝒚superscriptnorm𝒙2superscriptnorm𝒚2\boldsymbol{x}\oplus_{c}\boldsymbol{y}=\frac{(1+2\langle\boldsymbol{x},% \boldsymbol{y}\rangle+\|\boldsymbol{y}\|^{2})\boldsymbol{x}+(1-\|\boldsymbol{x% }\|^{2})\boldsymbol{y}}{1+2\langle\boldsymbol{x},\boldsymbol{y}\rangle+\|% \boldsymbol{x}\|^{2}\|\boldsymbol{y}\|^{2}}.bold_italic_x ⊕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_italic_y = divide start_ARG ( 1 + 2 ⟨ bold_italic_x , bold_italic_y ⟩ + ∥ bold_italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) bold_italic_x + ( 1 - ∥ bold_italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) bold_italic_y end_ARG start_ARG 1 + 2 ⟨ bold_italic_x , bold_italic_y ⟩ + ∥ bold_italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (2)

Given a linear projection 𝑨:mp:𝑨superscript𝑚superscript𝑝\boldsymbol{A}:\mathbb{R}^{m}\rightarrow\mathbb{R}^{p}bold_italic_A : blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and 𝒙cm𝒙superscriptsubscript𝑐𝑚\boldsymbol{x}\in\mathcal{B}_{c}^{m}bold_italic_x ∈ caligraphic_B start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, then the Möbius matrix multiplicationis defined as Demirel (2013):

𝑨c𝒙=tanh(𝑨𝒙𝒙tanh1(𝒙))𝑨𝒙𝑨𝒙.subscripttensor-product𝑐𝑨𝒙norm𝑨𝒙norm𝒙superscript1delimited-∥∥𝒙𝑨𝒙norm𝑨𝒙\begin{split}&\boldsymbol{A}\otimes_{c}\boldsymbol{x}=\tanh(\frac{\|% \boldsymbol{Ax}\|}{\|\boldsymbol{x}\|}\tanh^{-1}(\|\boldsymbol{x}\|))\frac{% \boldsymbol{Ax}}{\|\boldsymbol{Ax}\|}.\end{split}start_ROW start_CELL end_CELL start_CELL bold_italic_A ⊗ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_italic_x = roman_tanh ( divide start_ARG ∥ bold_italic_A bold_italic_x ∥ end_ARG start_ARG ∥ bold_italic_x ∥ end_ARG roman_tanh start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ∥ bold_italic_x ∥ ) ) divide start_ARG bold_italic_A bold_italic_x end_ARG start_ARG ∥ bold_italic_A bold_italic_x ∥ end_ARG . end_CELL end_ROW (3)

Cooperative Game Theory We use N𝑁Nitalic_N to denote a set of players (i.e. token set). A game is a pair Γ=(N,v)Γ𝑁𝑣\Gamma=(N,v)roman_Γ = ( italic_N , italic_v ) and v:2N:𝑣superscript2𝑁v:2^{N}\rightarrow\mathbb{R}italic_v : 2 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT → blackboard_R is the characteristic function. A coalition is any subset of N𝑁Nitalic_N. In a cooperative game, players can form coalitions, and each coalition SN𝑆𝑁S\subseteq Nitalic_S ⊆ italic_N has a value v(S)𝑣𝑆v(S)italic_v ( italic_S ).

4 Methodology

This section provides a detailed introduction to the three parts of PE. First, we need to score each feature; then, based on these scores, we construct a hierarchy. In Section 4.1, we consider semantic and syntax factors. Besides we facilitate feature Shapley contribution calculation in Section 4.2. In Section 4.3, we combine these factors to score each feature and propose a fast algorithm for constructing the hierarchy.

4.1 Poincare Projection

In this paper, we choose Probing Hewitt and Manning (2019) to recover information from embeddings. Namely, we train two matrices to project the Eculidean embeddings to hyperbolic spaces. For a classification task, given a sequence Xi={xj}1jnsubscript𝑋𝑖subscriptsubscript𝑥𝑗1𝑗𝑛X_{i}=\{x_{j}\}_{1\leq j\leq n}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_n end_POSTSUBSCRIPT and a trained model f𝑓fitalic_f, n𝑛nitalic_n is the sequence length. y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG represents the predicted label, and f()𝑓f(\cdot)italic_f ( ⋅ ) represents the model’s output probability for the predicted label.

4.1.1 Label Aware Semantic Probing

In this subsection, we extract the semantics from the embeddings through probing. We project the embeddings into a hyperbolic space using a transformation matrix. In this space, the distribution of examples with different semantics will change according to their semantic variations. First, we feed the sequence Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into a pre-trained language model to obtain the contextualized representations 𝑬in×dinsubscript𝑬𝑖superscript𝑛subscript𝑑𝑖𝑛\boldsymbol{E}_{i}\in\mathbb{R}^{n\times d_{in}}bold_italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, with dinsubscript𝑑𝑖𝑛d_{in}italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT denotes the output dim. Next, the sentence embedding 𝒔idinsubscript𝒔𝑖superscriptsubscript𝑑𝑖𝑛\boldsymbol{s}_{i}\in\mathbb{R}^{d_{in}}bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is obtained by the hidden representations of the special tag (e.g.[CLS]), which is the first token of the sequence and used for classification tasks. Our probing matrix consists of two types: 𝑨se,𝑨sydin×doutsubscript𝑨𝑠𝑒subscript𝑨𝑠𝑦superscriptsubscript𝑑𝑖𝑛subscript𝑑𝑜𝑢𝑡\boldsymbol{A}_{se},\boldsymbol{A}_{sy}\in\mathbb{R}^{d_{in}\times d_{out}}bold_italic_A start_POSTSUBSCRIPT italic_s italic_e end_POSTSUBSCRIPT , bold_italic_A start_POSTSUBSCRIPT italic_s italic_y end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (doutsubscript𝑑𝑜𝑢𝑡d_{out}italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT denotes the projection dim) for probing label-aware semantic information and syntax information. For semantics, we can obtain the projected representation:

𝒔ise=𝑨sec𝒔i.superscriptsubscript𝒔𝑖𝑠𝑒subscripttensor-product𝑐subscript𝑨𝑠𝑒subscript𝒔𝑖\boldsymbol{s}_{i}^{se}=\boldsymbol{A}_{se}\otimes_{c}\boldsymbol{s}_{i}.bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e end_POSTSUPERSCRIPT = bold_italic_A start_POSTSUBSCRIPT italic_s italic_e end_POSTSUBSCRIPT ⊗ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (4)

Also we can obtain the token presentation:

𝒆jse=𝑨sec𝒆j.superscriptsubscript𝒆𝑗𝑠𝑒subscripttensor-product𝑐subscript𝑨𝑠𝑒subscript𝒆𝑗\boldsymbol{e}_{j}^{se}=\boldsymbol{A}_{se}\otimes_{c}\boldsymbol{e}_{j}.bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e end_POSTSUPERSCRIPT = bold_italic_A start_POSTSUBSCRIPT italic_s italic_e end_POSTSUBSCRIPT ⊗ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT . (5)

To train the probing matrices, we draw inspiration from prototype networks Snell et al. (2017), assuming that there exist k𝑘kitalic_k centroids representing labels in the hyperbolic space. The closer a point is to a centroid, the higher the probability that it belongs to that category. Specifically, instead of using mean pooling to calculate the prototypes, we directly initialize the prototype embeddings in hyperbolic space, denoted as 𝝎={𝒄k}𝝎subscript𝒄𝑘\boldsymbol{\omega}=\{\boldsymbol{c}_{k}\}bold_italic_ω = { bold_italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } (𝒄ksubscript𝒄𝑘\boldsymbol{c}_{k}bold_italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the k𝑘kitalic_k-th label centroid). Given a distance dsubscript𝑑d_{\mathcal{B}}italic_d start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT, the prototypes produce a distribution over classes for a point 𝒙𝒙\boldsymbol{x}bold_italic_x based on a softmax over distances to prototypes in the embedding space:

𝒫(y=k𝝎)=exp(d(𝒔ise,𝒄k))kexp(d(𝒔ise,𝒄k)).𝒫𝑦conditional𝑘𝝎subscript𝑑superscriptsubscript𝒔𝑖𝑠𝑒subscript𝒄𝑘subscriptsuperscript𝑘subscript𝑑superscriptsubscript𝒔𝑖𝑠𝑒subscript𝒄superscript𝑘\mathcal{P}(y=k\mid\boldsymbol{\omega})=\frac{\exp(-d_{\mathcal{B}}(% \boldsymbol{s}_{i}^{se},\boldsymbol{c}_{k}))}{\sum_{k^{\prime}}{\exp(-d_{% \mathcal{B}}(\boldsymbol{s}_{i}^{se},\boldsymbol{c}_{k^{\prime}}))}}.caligraphic_P ( italic_y = italic_k ∣ bold_italic_ω ) = divide start_ARG roman_exp ( - italic_d start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ( - italic_d start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ) end_ARG . (6)

We minimize the negative log-probability J(𝝎)=log𝒫(y=k𝝎)𝐽𝝎𝑙𝑜𝑔𝒫𝑦conditional𝑘𝝎J(\boldsymbol{\omega})=-log\mathcal{P}(y=k\mid\boldsymbol{\omega})italic_J ( bold_italic_ω ) = - italic_l italic_o italic_g caligraphic_P ( italic_y = italic_k ∣ bold_italic_ω ) of the true class k𝑘kitalic_k via RiemannianAdam Kochurov et al. (2017).

4.1.2 Syntax Probing

Similarly, in this subsection, we obtain syntax through probing. The difference is that for syntax, we focus on tokens. In the projected hyperbolic space, the distance of the token embeddings from the origin and the distance between tokens correspond to the depth of the tokens and their distance in the DPT respectively. We project word embeddings first:

𝒆jsy=𝑨syc𝒆j,superscriptsubscript𝒆𝑗𝑠𝑦subscripttensor-product𝑐subscript𝑨𝑠𝑦subscript𝒆𝑗\boldsymbol{e}_{j}^{sy}=\boldsymbol{A}_{sy}\otimes_{c}\boldsymbol{e}_{j},bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_y end_POSTSUPERSCRIPT = bold_italic_A start_POSTSUBSCRIPT italic_s italic_y end_POSTSUBSCRIPT ⊗ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (7)

where 𝒆j=𝑬j,:subscript𝒆𝑗subscript𝑬𝑗:\boldsymbol{e}_{j}=\boldsymbol{E}_{j,:}bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_E start_POSTSUBSCRIPT italic_j , : end_POSTSUBSCRIPT. How to parameterize a dependency tree from dense embeddings is non-trivial. Following Hewitt and Manning (2019), we define two metrics to measure the deviation from the standard: using the distance between two words in embedding space to represent the distance of word nodes in the dependency tree, and using the distance of a word from the origin to represent the depth of the word node. We use the following two loss functions:

dis=1n2j,j[n]|dDPT(xj,xj)d(𝒆jsy,𝒆jsy)2|,subscriptdis1superscript𝑛2subscript𝑗superscript𝑗delimited-[]𝑛subscript𝑑𝐷𝑃𝑇subscript𝑥𝑗subscript𝑥superscript𝑗subscript𝑑superscriptsuperscriptsubscript𝒆𝑗𝑠𝑦superscriptsubscript𝒆superscript𝑗𝑠𝑦2\mathcal{L}_{\text{dis}}=\frac{1}{n^{2}}\sum\limits_{j,j^{\prime}\in[n]}{|d_{% DPT}(x_{j},x_{j^{\prime}})-d_{\mathcal{B}}(\boldsymbol{e}_{j}^{sy},\boldsymbol% {e}_{j^{\prime}}^{sy})^{2}|},caligraphic_L start_POSTSUBSCRIPT dis end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ italic_n ] end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_D italic_P italic_T end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_y end_POSTSUPERSCRIPT , bold_italic_e start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_y end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | , (8)
dep=1nj[n]|dDPT(xj)d(𝒆jsy,𝟎)2|.subscriptdep1𝑛subscript𝑗delimited-[]𝑛subscript𝑑𝐷𝑃𝑇subscript𝑥𝑗subscript𝑑superscriptsuperscriptsubscript𝒆𝑗𝑠𝑦02\mathcal{L}_{\text{dep}}=\frac{1}{n}\sum\limits_{j\in[n]}{|d_{DPT}(x_{j})-d_{% \mathcal{B}}(\boldsymbol{e}_{j}^{sy},\boldsymbol{0})^{2}|}.caligraphic_L start_POSTSUBSCRIPT dep end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ [ italic_n ] end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_D italic_P italic_T end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_y end_POSTSUPERSCRIPT , bold_0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | . (9)

where dDPT(xj,xj)subscript𝑑𝐷𝑃𝑇subscript𝑥𝑗subscript𝑥superscript𝑗d_{DPT}(x_{j},x_{j^{\prime}})italic_d start_POSTSUBSCRIPT italic_D italic_P italic_T end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) and dDPT(xj)subscript𝑑𝐷𝑃𝑇subscript𝑥𝑗d_{DPT}(x_{j})italic_d start_POSTSUBSCRIPT italic_D italic_P italic_T end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) represent the distance of words and the depth of words respectively. And d(𝒆jsy,𝟎)subscript𝑑superscriptsubscript𝒆𝑗𝑠𝑦0d_{\mathcal{B}}(\boldsymbol{e}_{j}^{sy},\boldsymbol{0})italic_d start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_y end_POSTSUPERSCRIPT , bold_0 ) denotes the distance between 𝒆jsysuperscriptsubscript𝒆𝑗𝑠𝑦\boldsymbol{e}_{j}^{sy}bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_y end_POSTSUPERSCRIPT and the origin in the projected hyperbolic space.

4.2 Shapley Contribution Estimation

According to cooperative game theory, we regard the input as a set of players N𝑁Nitalic_N, where each element of the set corresponds to a word, and the process of hierarchical clustering is viewed as a game, with clusters containing more than two words considered a coalition. Following Zhang et al. (2021b), we define the characteristic function as v=f𝑣𝑓v=fitalic_v = italic_f. Given a game Γ=(N,v)Γ𝑁𝑣\Gamma=(N,v)roman_Γ = ( italic_N , italic_v ), a fair payment scheme rewards each player according to its contribution. The Shapley value removes the dependence on ordering by taking the average over all possible orderings for fairness. The Shapley value of player j𝑗jitalic_j in a game is as follows:

ϕj=1|N|!πΠ(N)[v(Qjπ{j})v(Qjπ)],subscriptitalic-ϕ𝑗1𝑁subscript𝜋Π𝑁delimited-[]𝑣superscriptsubscript𝑄𝑗𝜋𝑗𝑣superscriptsubscript𝑄𝑗𝜋\phi_{j}=\frac{1}{|N|!}\sum\limits_{\pi\in\Pi(N)}[v(Q_{j}^{\pi}\cup\{j\})-v(Q_% {j}^{\pi})],italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_N | ! end_ARG ∑ start_POSTSUBSCRIPT italic_π ∈ roman_Π ( italic_N ) end_POSTSUBSCRIPT [ italic_v ( italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ∪ { italic_j } ) - italic_v ( italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] , (10)

where Π(N)Π𝑁\Pi(N)roman_Π ( italic_N ) is the set of all permutations of the players, Qjπsuperscriptsubscript𝑄𝑗𝜋Q_{j}^{\pi}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is the set of players preceding player j𝑗jitalic_j (i.e. token j𝑗jitalic_j) in permutation π𝜋\piitalic_π. v(S)𝑣𝑆v(S)italic_v ( italic_S ) is the value that the coalition of players SN𝑆𝑁S\subseteq Nitalic_S ⊆ italic_N can achieve together. In practical, Monte Carlo sampling is used:

ϕj^=1Rr=1Rv(Qjπr{j})v(Qjπr)^subscriptitalic-ϕ𝑗1𝑅superscriptsubscript𝑟1𝑅𝑣superscriptsubscript𝑄𝑗subscript𝜋𝑟𝑗𝑣superscriptsubscript𝑄𝑗subscript𝜋𝑟\hat{\phi_{j}}=\frac{1}{R}\sum\limits_{r=1}^{R}{v(Q_{j}^{\pi_{r}}\cup\{j\})-v(% Q_{j}^{\pi_{r}})}over^ start_ARG italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_v ( italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∪ { italic_j } ) - italic_v ( italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (11)

where πrsubscript𝜋𝑟\pi_{r}italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denotes the r𝑟ritalic_r-th permutation in Π(N)Π𝑁\Pi(N)roman_Π ( italic_N ). Unfortunately, Monte Carlo sampling methods can exhibit slow convergence Mitchell et al. (2022).

It is noticeable that attention mechanism of Transformer is permutation invariant Vaswani et al. (2017); Xilong et al. (2023), and the sinusoidal position embedding is only related to the specific position, not to the word. Moreover, after being trained with a Language Modeling task, the model has the ability to fill in the blanks based on context. Therefore, we assume that it is unnecessary to enumerate exponential combinations of words and the contribution of preceding permutation set (e.g.π(<r)annotated𝜋absent𝑟\pi(<r)italic_π ( < italic_r )) is included in larger subsequent permutation sets (e.g.π(r)𝜋𝑟\pi(r)italic_π ( italic_r )). Therefore, we directly calculate contribution as follows:

ϕ~j=v(N)v(N{j})=f(X)f(X{xj})subscript~italic-ϕ𝑗𝑣𝑁𝑣𝑁𝑗𝑓𝑋𝑓𝑋subscript𝑥𝑗\begin{split}&\tilde{\phi}_{j}=v(N)-v(N\setminus\{j\})\\ &=f(X)-f(X\setminus\{x_{j}\})\end{split}start_ROW start_CELL end_CELL start_CELL over~ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_v ( italic_N ) - italic_v ( italic_N ∖ { italic_j } ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_f ( italic_X ) - italic_f ( italic_X ∖ { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } ) end_CELL end_ROW (12)

where N{j}𝑁𝑗N\setminus\{j\}italic_N ∖ { italic_j } denotes the player set excluding player j𝑗jitalic_j and X{xj}𝑋subscript𝑥𝑗X\setminus\{x_{j}\}italic_X ∖ { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } denotes the input excluding token xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

4.3 Minimum Spanning Tree

Our goal is to identify a hierarchy tree T𝑇Titalic_T that aligns with semantic similarities, syntax similarities, and the contributions of individual elements. Building upon Dasgupta (2016), we use the following cost:

CD(T;e)=j,j[n]ejj|leaves(T[jj])|,subscript𝐶𝐷𝑇𝑒subscript𝑗superscript𝑗delimited-[]𝑛subscript𝑒𝑗superscript𝑗𝑙𝑒𝑎𝑣𝑒𝑠𝑇delimited-[]𝑗superscript𝑗C_{D}(T;e)=\sum_{j,j^{\prime}\in[n]}e_{jj^{\prime}}|leaves(T[j\vee j^{\prime}]% )|,italic_C start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_T ; italic_e ) = ∑ start_POSTSUBSCRIPT italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ italic_n ] end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_l italic_e italic_a italic_v italic_e italic_s ( italic_T [ italic_j ∨ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ) | , (13)

where ej,jsubscript𝑒𝑗superscript𝑗e_{j,j^{\prime}}italic_e start_POSTSUBSCRIPT italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denotes the pairwise similarities, leaves(T[jj])𝑙𝑒𝑎𝑣𝑒𝑠𝑇delimited-[]𝑗superscript𝑗leaves(T[j\vee j^{\prime}])italic_l italic_e italic_a italic_v italic_e italic_s ( italic_T [ italic_j ∨ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ) is leaves of T[jj]𝑇delimited-[]𝑗superscript𝑗T[j\vee j^{\prime}]italic_T [ italic_j ∨ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ], which is the subtree rooted at jj𝑗superscript𝑗j\vee j^{\prime}italic_j ∨ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, jj𝑗superscript𝑗j\vee j^{\prime}italic_j ∨ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the parent node of j𝑗jitalic_j and jsuperscript𝑗j^{\prime}italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as shown in Figure 3. Due to the unfolding dilemma of leaves(T[ij])𝑙𝑒𝑎𝑣𝑒𝑠𝑇delimited-[]𝑖𝑗leaves(T[i\vee j])italic_l italic_e italic_a italic_v italic_e italic_s ( italic_T [ italic_i ∨ italic_j ] ) process, we adopt following expansion by Wang and Wang (2018):

CD(T;e)=jju[n][ejj+eju+ejuejju(T)]+2jjejj,subscript𝐶𝐷𝑇𝑒subscript𝑗superscript𝑗𝑢delimited-[]𝑛delimited-[]subscript𝑒𝑗superscript𝑗subscript𝑒𝑗𝑢subscript𝑒superscript𝑗𝑢subscript𝑒𝑗superscript𝑗𝑢𝑇2subscript𝑗superscript𝑗subscript𝑒𝑗superscript𝑗\begin{split}C_{D}(T;e)&=\sum\limits_{jj^{\prime}u\in[n]}[e_{jj^{\prime}}+e_{% ju}+e_{j^{\prime}u}\\ &\quad-e_{jj^{\prime}u}(T)]+2\sum\limits_{jj^{\prime}}e_{jj^{\prime}},\end{split}start_ROW start_CELL italic_C start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_T ; italic_e ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_u ∈ [ italic_n ] end_POSTSUBSCRIPT [ italic_e start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT italic_j italic_u end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_u end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - italic_e start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_u end_POSTSUBSCRIPT ( italic_T ) ] + 2 ∑ start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , end_CELL end_ROW (14)

where

ejju(T)=ejj𝟙[{j,ju}]+eju𝟙[{j,uj}]+eju𝟙[{j,uj}],subscript𝑒𝑗superscript𝑗𝑢𝑇subscript𝑒𝑗superscript𝑗1delimited-[]conditional-set𝑗superscript𝑗𝑢subscript𝑒𝑗𝑢1delimited-[]conditional-set𝑗𝑢superscript𝑗subscript𝑒superscript𝑗𝑢1delimited-[]conditional-setsuperscript𝑗𝑢𝑗\begin{split}e_{jj^{\prime}u}(T)=&e_{jj^{\prime}}\mathbbm{1}[\{j,j^{\prime}% \mid u\}]+e_{ju}\mathbbm{1}[\{j,u\mid j^{\prime}\}]\\ &\quad+e_{j^{\prime}u}\mathbbm{1}[\{j^{\prime},u\mid j\}],\end{split}start_ROW start_CELL italic_e start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_u end_POSTSUBSCRIPT ( italic_T ) = end_CELL start_CELL italic_e start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_1 [ { italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_u } ] + italic_e start_POSTSUBSCRIPT italic_j italic_u end_POSTSUBSCRIPT blackboard_1 [ { italic_j , italic_u ∣ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_e start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_u end_POSTSUBSCRIPT blackboard_1 [ { italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u ∣ italic_j } ] , end_CELL end_ROW (15)

where {j,ju}conditional-set𝑗superscript𝑗𝑢\{j,j^{\prime}\mid u\}{ italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_u } means the jj𝑗superscript𝑗j\vee j^{\prime}italic_j ∨ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the descendant of jju𝑗superscript𝑗𝑢j\vee j^{\prime}\vee uitalic_j ∨ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∨ italic_u, illustrated in Figure 3. The same for {j,uj}conditional-set𝑗𝑢superscript𝑗\{j,u\mid j^{\prime}\}{ italic_j , italic_u ∣ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } and {j,uj}conditional-setsuperscript𝑗𝑢𝑗\{j^{\prime},u\mid j\}{ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u ∣ italic_j }.

Refer to caption
Figure 3: Three different binary tree types rooted from jju𝑗superscript𝑗𝑢j\vee j^{\prime}\vee uitalic_j ∨ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∨ italic_u.

We aim to find the binary tree Tsuperscript𝑇T^{*}italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT:

T=argminall binary trees TCD(T;e).superscript𝑇subscriptargminall binary trees 𝑇subscript𝐶𝐷𝑇𝑒T^{*}=\mathop{\text{argmin}}_{\text{all binary trees }T}C_{D}(T;e).italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = argmin start_POSTSUBSCRIPT all binary trees italic_T end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_T ; italic_e ) . (16)

Directly optimizing this cost presents a combinatorial optimization problem. We introduce the following decomposition:

ejj=ϕ~(jj)+α1d(𝒆jse,𝒆jse)+12α2(d(𝒆jsy,𝟎)+d(𝒆jsy,𝟎)),subscript𝑒𝑗superscript𝑗~italic-ϕ𝑗superscript𝑗subscript𝛼1subscript𝑑superscriptsubscript𝒆𝑗𝑠𝑒superscriptsubscript𝒆superscript𝑗𝑠𝑒12subscript𝛼2subscript𝑑superscriptsubscript𝒆𝑗𝑠𝑦0subscript𝑑superscriptsubscript𝒆superscript𝑗𝑠𝑦0\begin{split}e_{jj^{\prime}}=&-\tilde{\phi}(j\vee j^{\prime})+\alpha_{1}d_{% \mathcal{B}}(\boldsymbol{e}_{j}^{se},\boldsymbol{e}_{j^{\prime}}^{se})\\ &+\frac{1}{2}\alpha_{2}(d_{\mathcal{B}}(\boldsymbol{e}_{j}^{sy},\boldsymbol{0}% )+d_{\mathcal{B}}(\boldsymbol{e}_{j^{\prime}}^{sy},\boldsymbol{0})),\end{split}start_ROW start_CELL italic_e start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = end_CELL start_CELL - over~ start_ARG italic_ϕ end_ARG ( italic_j ∨ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e end_POSTSUPERSCRIPT , bold_italic_e start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_y end_POSTSUPERSCRIPT , bold_0 ) + italic_d start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_y end_POSTSUPERSCRIPT , bold_0 ) ) , end_CELL end_ROW (17)

where α1,α2[0,1]subscript𝛼1,subscript𝛼201\alpha_{1}\text{,}\alpha_{2}\in[0,1]italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ 0 , 1 ].

Under that we prove the optimal tree Tsuperscript𝑇T^{*}italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a like-minimum spanning tree of Equation14.444The difference from the original minimum spanning tree is located in the last paragraph of Appendix A. The proof can be found in Appendix A. Ultimately we introduce the following decoding algorithm:

Algorithm 1 Building Algorithm
0:  Label hyperbolic embeddings 𝑬se={𝑬1se,,𝑬nse}superscript𝑬𝑠𝑒superscriptsubscript𝑬1𝑠𝑒superscriptsubscript𝑬𝑛𝑠𝑒\boldsymbol{E}^{se}=\{\boldsymbol{E}_{1}^{se},\cdots,\boldsymbol{E}_{n}^{se}\}bold_italic_E start_POSTSUPERSCRIPT italic_s italic_e end_POSTSUPERSCRIPT = { bold_italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e end_POSTSUPERSCRIPT , ⋯ , bold_italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e end_POSTSUPERSCRIPT }, syntax hyperbolic embeddings 𝑬sy={𝑬1sy,,𝑬nsy}superscript𝑬𝑠𝑦superscriptsubscript𝑬1𝑠𝑦superscriptsubscript𝑬𝑛𝑠𝑦\boldsymbol{E}^{sy}=\{\boldsymbol{E}_{1}^{sy},\cdots,\boldsymbol{E}_{n}^{sy}\}bold_italic_E start_POSTSUPERSCRIPT italic_s italic_y end_POSTSUPERSCRIPT = { bold_italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_y end_POSTSUPERSCRIPT , ⋯ , bold_italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_y end_POSTSUPERSCRIPT }
0:  Binary tree T𝑇Titalic_T with n𝑛nitalic_n leafs
1:  T({xj}:xjX)T\leftarrow(\{x_{j}\}:x_{j}\in X)italic_T ← ( { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } : italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_X )
2:  Initialize a PriorityQueue ΥΥ\Upsilonroman_Υ
3:  Υ{(xj,xj):pairs sorted byejj}Υconditional-setsubscript𝑥𝑗subscript𝑥superscript𝑗pairs sorted bysubscript𝑒𝑗superscript𝑗\Upsilon\leftarrow\{(x_{j},x_{j^{\prime}}):\text{pairs sorted by}\,e_{jj^{% \prime}}\}roman_Υ ← { ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) : pairs sorted by italic_e start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT }
4:  while ΥΥ\Upsilon\neq\varnothingroman_Υ ≠ ∅ do
5:     xj,xjΥ.subscript𝑥𝑗subscript𝑥superscript𝑗Υx_{j},x_{j^{\prime}}\leftarrow\Upsilon.italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ← roman_Υ .front, Υ.Υ\Upsilon.roman_Υ .pop
6:     if xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and xjsubscript𝑥superscript𝑗x_{j^{\prime}}italic_x start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT not in T𝑇Titalic_T then
7:        TT{xjxj}𝑇𝑇subscript𝑥𝑗subscript𝑥superscript𝑗T\leftarrow T\cup\{x_{j}\vee x_{j^{\prime}}\}italic_T ← italic_T ∪ { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∨ italic_x start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT }
8:        Υ.Υ\Upsilon.roman_Υ .push(xixjsubscript𝑥𝑖subscript𝑥𝑗x_{i}\vee x_{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∨ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT)
9:     end if
10:  end while

5 Experiments

5.1 Experimental Setups

Datasets To evaluate the effectiveness of PE, we perform comprehensive experiments on three representative text classification datasets: “Rotten Tomatoes” Pang and Lee (2005), “TREC” Li and Roth (2002), “Yelp” Zhang et al. (2015). Detailed statistics are in Table 1.

  Datasets Train/Dev/Test C L
  Rotten Tomatoes 10K/2K/2K 2 64
  TREC 5000/452/500 6 64
  Yelp 10K/2K/1K 2 256
 
Table 1: Statistics of three datasets. C: number of classes, L: average text length

Metrics Following prior literature DeYoung et al. (2020), we use AOPC metric, which is the average difference of the change in predicted class probability before and after removing top K𝐾Kitalic_K words.

AOPC=1nK(f(xi)f(x~iK))AOPC1𝑛subscript𝐾𝑓subscript𝑥𝑖𝑓superscriptsubscript~𝑥𝑖𝐾\text{AOPC}=\frac{1}{n}\sum_{K}(f(x_{i})-f(\tilde{x}_{i}^{K}))AOPC = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_f ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ) ) (18)

Higher is better. And we evaluate two different strategies: del𝑑𝑒𝑙delitalic_d italic_e italic_l and pad𝑝𝑎𝑑paditalic_p italic_a italic_d. Concretely, We assign values to words through the following formula:

scorei=ϕ~(j)β1d(𝒆jse,𝒄k)β2d(𝒆jsy,𝟎),subscriptscore𝑖~italic-ϕ𝑗subscript𝛽1subscript𝑑superscriptsubscript𝒆𝑗𝑠𝑒subscript𝒄𝑘subscript𝛽2subscript𝑑superscriptsubscript𝒆𝑗𝑠𝑦0\text{score}_{i}=\tilde{\phi}(j)-\beta_{1}d_{\mathcal{B}}(\boldsymbol{e}_{j}^{% se},\boldsymbol{c}_{k})-\beta_{2}d_{\mathcal{B}}(\boldsymbol{e}_{j}^{sy},% \boldsymbol{0}),score start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over~ start_ARG italic_ϕ end_ARG ( italic_j ) - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_y end_POSTSUPERSCRIPT , bold_0 ) , (19)

where 𝒄ksubscript𝒄𝑘\boldsymbol{c}_{k}bold_italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the prototype of predicted label k𝑘kitalic_k in the semantic hyperbolic space, 𝟎0\boldsymbol{0}bold_0 is the origin in the syntatic hyperbolic space, β1,β2[0,1]subscript𝛽1subscript𝛽201\beta_{1},\beta_{2}\in[0,1]italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ 0 , 1 ].

Infrastructures All experiments are processed on one 15 core 2.6GHz CPU (Intel(R) Xeon(R) Platinum 8358P) and one RTX3090 GPU.

Baselines We compare PE with three hierarchical attribution methods: HEDGE Chen et al. (2020), HELIMEsubscriptHE𝐿𝐼𝑀𝐸\text{HE}_{LIME}HE start_POSTSUBSCRIPT italic_L italic_I italic_M italic_E end_POSTSUBSCRIPT , HELOOsubscriptHE𝐿𝑂𝑂\text{HE}_{LOO}HE start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT Ju et al. (2023) and three feature interaction methods: SOC ** et al. (2020), Bivariate Shapley (BS)Masoomi et al. (2022) and ASIV Lu et al. (2023).

Datasets Methods Rotten Tomatoes TREC
AOPCdelsubscriptAOPC𝑑𝑒𝑙\text{AOPC}_{del}AOPC start_POSTSUBSCRIPT italic_d italic_e italic_l end_POSTSUBSCRIPT AOPCpadsubscriptAOPC𝑝𝑎𝑑\text{AOPC}_{pad}AOPC start_POSTSUBSCRIPT italic_p italic_a italic_d end_POSTSUBSCRIPT AOPCdelsubscriptAOPC𝑑𝑒𝑙\text{AOPC}_{del}AOPC start_POSTSUBSCRIPT italic_d italic_e italic_l end_POSTSUBSCRIPT AOPCpadsubscriptAOPC𝑝𝑎𝑑\text{AOPC}_{pad}AOPC start_POSTSUBSCRIPT italic_p italic_a italic_d end_POSTSUBSCRIPT
10%percent1010\%10 % 20%percent2020\%20 % Avg 10%percent1010\%10 % 20%percent2020\%20 % Avg 10%percent1010\%10 % 20%percent2020\%20 % Avg 10%percent1010\%10 % 20%percent2020\%20 % Avg
  SOC 0.102 0.117 0.110±0.003subscript0.110plus-or-minus0.0030.110_{\pm 0.003}0.110 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT 0.149 0.153 0.151±0.002subscript0.151plus-or-minus0.0020.151_{\pm 0.002}0.151 start_POSTSUBSCRIPT ± 0.002 end_POSTSUBSCRIPT 0.074 0.087 0.081±0.001subscript0.081plus-or-minus0.0010.081_{\pm 0.001}0.081 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT 0.097 0.099 0.098±0.001subscript0.098plus-or-minus0.0010.098_{\pm 0.001}0.098 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT
HEDGE 0.087 0.134 0.111±0.011subscript0.111plus-or-minus0.0110.111_{\pm 0.011}0.111 start_POSTSUBSCRIPT ± 0.011 end_POSTSUBSCRIPT 0.084 0.194 0.139±0.009subscript0.139plus-or-minus0.0090.139_{\pm 0.009}0.139 start_POSTSUBSCRIPT ± 0.009 end_POSTSUBSCRIPT 0.068 0.079 0.074±0.004subscript0.074plus-or-minus0.0040.074_{\pm 0.004}0.074 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT 0.095 0.101 0.098±0.008subscript0.098plus-or-minus0.0080.098_{\pm 0.008}0.098 start_POSTSUBSCRIPT ± 0.008 end_POSTSUBSCRIPT
HELIMEsubscriptHE𝐿𝐼𝑀𝐸\text{HE}_{LIME}HE start_POSTSUBSCRIPT italic_L italic_I italic_M italic_E end_POSTSUBSCRIPT 0.075 0.195 0.135±0.005subscript0.135plus-or-minus0.0050.135_{\pm 0.005}0.135 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT 0.076 0.193 0.135±0.009subscript0.135plus-or-minus0.0090.135_{\pm 0.009}0.135 start_POSTSUBSCRIPT ± 0.009 end_POSTSUBSCRIPT 0.063 0.072 0.068±0.003subscript0.068plus-or-minus0.0030.068_{\pm 0.003}0.068 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT 0.059 0.066 0.063±0.007subscript0.063plus-or-minus0.0070.063_{\pm 0.007}0.063 start_POSTSUBSCRIPT ± 0.007 end_POSTSUBSCRIPT
HELOOsubscriptHE𝐿𝑂𝑂\text{HE}_{LOO}HE start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT 0.062 0.117 0.090±0.004subscript0.090plus-or-minus0.0040.090_{\pm 0.004}0.090 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT 0.061 0.119 0.090±0.004subscript0.090plus-or-minus0.0040.090_{\pm 0.004}0.090 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT 0.081 0.092 0.087±0.001subscript0.087plus-or-minus0.0010.087_{\pm 0.001}0.087 start_POSTSUBSCRIPT ± 0.001 end_POSTSUBSCRIPT 0.075 0.086 0.081±0.005subscript0.081plus-or-minus0.0050.081_{\pm 0.005}0.081 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT
BS 0.109 0.121 0.116±0.013subscript0.116plus-or-minus0.0130.116_{\pm 0.013}0.116 start_POSTSUBSCRIPT ± 0.013 end_POSTSUBSCRIPT 0.103 0.185 0.144±0.009subscript0.144plus-or-minus0.0090.144_{\pm 0.009}0.144 start_POSTSUBSCRIPT ± 0.009 end_POSTSUBSCRIPT 0.099 0.104 0.102±0.003subscript0.102plus-or-minus0.0030.102_{\pm 0.003}0.102 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT 0.097 0.105 0.101±0.005subscript0.101plus-or-minus0.0050.101_{\pm 0.005}0.101 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT
ASIV 0.101 0.113 0.107±0.005subscript0.107plus-or-minus0.0050.107_{\pm 0.005}0.107 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT 0.098 0.181 0.140±0.008subscript0.140plus-or-minus0.0080.140_{\pm 0.008}0.140 start_POSTSUBSCRIPT ± 0.008 end_POSTSUBSCRIPT 0.093 0.106 0.199±0.006subscript0.199plus-or-minus0.0060.199_{\pm 0.006}0.199 start_POSTSUBSCRIPT ± 0.006 end_POSTSUBSCRIPT 0.092 0.113 0.103±0.003subscript0.103plus-or-minus0.0030.103_{\pm 0.003}0.103 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT
PE 0.304 0.352 0.328±0.011subscript0.328plus-or-minus0.011\textbf{0.328}_{\pm 0.011}0.328 start_POSTSUBSCRIPT ± 0.011 end_POSTSUBSCRIPT 0.364 0.313 0.339±0.003subscript0.339plus-or-minus0.003\textbf{0.339}_{\pm 0.003}0.339 start_POSTSUBSCRIPT ± 0.003 end_POSTSUBSCRIPT 0.214 0.220 0.217±0.007subscript0.217plus-or-minus0.007\textbf{0.217}_{\pm 0.007}0.217 start_POSTSUBSCRIPT ± 0.007 end_POSTSUBSCRIPT 0.183 0.174 0.179±0.004subscript0.179plus-or-minus0.004\textbf{0.179}_{\pm 0.004}0.179 start_POSTSUBSCRIPT ± 0.004 end_POSTSUBSCRIPT
Table 2: AOPC comparison results of PE with baselines on the Rotten Tomatoes and TREC dataset.
Datasets Yelp
Methods AOPCdelsubscriptAOPC𝑑𝑒𝑙\text{AOPC}_{del}AOPC start_POSTSUBSCRIPT italic_d italic_e italic_l end_POSTSUBSCRIPT AOPCpadsubscriptAOPC𝑝𝑎𝑑\text{AOPC}_{pad}AOPC start_POSTSUBSCRIPT italic_p italic_a italic_d end_POSTSUBSCRIPT 𝒕¯¯𝒕\overline{\boldsymbol{t}}over¯ start_ARG bold_italic_t end_ARG
10%percent1010\%10 % 20%percent2020\%20 % 10%percent1010\%10 % 20%percent2020\%20 %
  HEDGE 0.077 0.084 0.074 0.089 70.312±0.074subscript70.312plus-or-minus0.07470.312_{\pm 0.074}70.312 start_POSTSUBSCRIPT ± 0.074 end_POSTSUBSCRIPT
HELIMEsubscriptHE𝐿𝐼𝑀𝐸\text{HE}_{LIME}HE start_POSTSUBSCRIPT italic_L italic_I italic_M italic_E end_POSTSUBSCRIPT 0.056 0.075 0.065 0.076 20.383±0.054subscript20.383plus-or-minus0.05420.383_{\pm 0.054}20.383 start_POSTSUBSCRIPT ± 0.054 end_POSTSUBSCRIPT
HELOOsubscriptHE𝐿𝑂𝑂\text{HE}_{LOO}HE start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT 0.040 0.071 0.059 0.064 16.201±0.079subscript16.201plus-or-minus0.07916.201_{\pm 0.079}16.201 start_POSTSUBSCRIPT ± 0.079 end_POSTSUBSCRIPT
PE 0.110 0.138 0.112 0.143 2.230±0.042subscript2.230plus-or-minus0.042\textbf{2.230}_{\pm 0.042}2.230 start_POSTSUBSCRIPT ± 0.042 end_POSTSUBSCRIPT
Table 3: AOPC and time efficiency comparision results of PE and baselines on the Yelp dataset. 𝒕¯¯𝒕\overline{\boldsymbol{t}}over¯ start_ARG bold_italic_t end_ARG denotes the average time of building HA tree per input in seconds.

5.2 General Experimental Results

We first evaluate our method using the AOPC metric across three datasets, as shown in Tables 2 and 3. Firstly, our method, PE, consistently surpasses the baseline in binary and multiclass tasks for both short and long texts. For instance, PE outperforms HELOOsubscriptHE𝐿𝑂𝑂\text{HE}_{LOO}HE start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT by 0.235 in Table 2 and by 0.067 in Table 3 of AOPCdelsubscriptAOPC𝑑𝑒𝑙\text{AOPC}_{del}AOPC start_POSTSUBSCRIPT italic_d italic_e italic_l end_POSTSUBSCRIPT,20%percent2020\%20 %, Rotten Tomatoes / Yelp setting. Second, in comparison to recent works such as SOC and HELOOsubscriptHE𝐿𝑂𝑂\text{HE}_{LOO}HE start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT, our method’s primary advantage lies in its computation efficiency. We conduct an analysis comparing the average time of various approaches to construct HA trees. The results in Table 3 indicate that PE substantially outperforms its counterparts in terms of speed, being twice as fast as SOC and six times faster than HELIMEsubscriptHE𝐿𝐼𝑀𝐸\text{HE}_{LIME}HE start_POSTSUBSCRIPT italic_L italic_I italic_M italic_E end_POSTSUBSCRIPT.

5.3 Ablation Study

We conduct ablation experiments with three modified baselines from PE: PE w/o prob corresponding ϕ~(i)=0~italic-ϕ𝑖0\tilde{\phi}(i)=0over~ start_ARG italic_ϕ end_ARG ( italic_i ) = 0, PE w/o semantic corresponding β1=0subscript𝛽10\beta_{1}=0italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 and PE w/o syntax corresponding β2=0subscript𝛽20\beta_{2}=0italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.

As shown in Figure 4, both PE and variants outperform w/o prob baselines, demonstrating our approach’s effectiveness in directly calculating contributions in Equation 12. Moreover, we observe that both in del𝑑𝑒𝑙delitalic_d italic_e italic_l and pad𝑝𝑎𝑑paditalic_p italic_a italic_d settings, the utility of estimating contribution is more striking than the other two components in Equation 19. The reason may be that context has a greater impact on output than single semantics and syntax. It is noticeable that syntax slightly outperforms semantics, we hypothesis that the reason might be related to the nature of the tasks in the TREC dataset, as the labels tend to associate with syntactic structures Li and Roth (2002).

Refer to caption
Figure 4: Evaluation results of Ablation Study.

5.4 Case Study

For qualitative analysis, we present two typical examples from the Rotten Tomatoes dataset to illustrate the role of PE in modeling the interaction of discontinuous features and we show more examples in Appendix B. In the first example, we compare the results of PE and HELOOsubscriptHE𝐿𝑂𝑂\text{HE}_{LOO}HE start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT in interpreting BERT model. Figure 5 provides two hierarchical explanation examples for a positive and negative review, each generated by PE and HELOOsubscriptHE𝐿𝑂𝑂\text{HE}_{LOO}HE start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT respectively. In Figure 5(a), it can be seen that PE accurately captures the combination of words with positive sentiment polarity: delightful, out, and humor, and captures the key combination of out and humor at step 1. Additionally, this example includes a word with negative polarity: stereotypes, where it can be observed that HELOOsubscriptHE𝐿𝑂𝑂\text{HE}_{LOO}HE start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT captures its combination with in and delightful, missing the combination with out and humor. In Figure 5(b), PE captures the combination of slightest and wit in the first phase and complements it with the combination of lacking at step 2. HE captures the combination of combination and animation at step 1, and it adds lacking at step 2. We can infer that PE is able to capture the feature combination more related to the label at a shallow level, which demonstrates the effectiveness of our method.

Additionally, to more vividly demonstrate the role of semantics and syntax in building hierarchical explanations, we illustrate with two examples from the TREC dataset. As shown in Figure 6(a), when α2=0.5subscript𝛼20.5\alpha_{2}=0.5italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.5, at the level L3subscript𝐿3L_{3}italic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, PE combines center, temperature, the, earth together. However, when α2=0subscript𝛼20\alpha_{2}=0italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0, PE combines the, temperature, the, earth together. In the dependency parse tree of the sentence what is the temperature of the center of the earth, the distance to root is greater than center. This indicates that incorporating syntactic information is meaningful for constructing convincing hierarchical explanations.

Refer to caption
(a) A positive example My big fat greek wedding uses stereotypes in a delightful blend of sweet romance and lovingly dished out humor.
Refer to caption
(b) A negative example Just another combination of bad animation and mindless violence lacking the slightest bit of wit or charm.
Figure 5: PE,HELOOsubscriptHE𝐿𝑂𝑂\text{HE}_{LOO}HE start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT for BERT on two examples from the Rotten Tomatoes dataset. The subtree in the upper right corner is generated by PE and the lower is produced by HELOOsubscriptHE𝐿𝑂𝑂\text{HE}_{LOO}HE start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT.
Refer to caption
(a) An example What is the temperature at the center of the earth?, which the predicted label is numeric value.
Refer to caption
(b) A dependency parsing tree generated by Spacy Honnibal and Montani (2017).
Figure 6: PE for BERT on the example from the TREC dataset. The cluster on the left side of the third level L3subscript𝐿3L_{3}italic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is the results for α2=0.5subscript𝛼20.5\alpha_{2}=0.5italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.5, and the right side is the result for α2=0subscript𝛼20\alpha_{2}=0italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.

6 Analysis of Time Complexity

In this section, we delve into the time complexity associated with HA methods, which can be divided into two parts: the complexity of generating attribution scores, denoted as OattrsubscriptO𝑎𝑡𝑡𝑟\textbf{O}_{attr}O start_POSTSUBSCRIPT italic_a italic_t italic_t italic_r end_POSTSUBSCRIPT, and the complexity of generating the hierarchy from the scores, denoted as OhierarchysubscriptO𝑖𝑒𝑟𝑎𝑟𝑐𝑦\textbf{O}_{hierarchy}O start_POSTSUBSCRIPT italic_h italic_i italic_e italic_r italic_a italic_r italic_c italic_h italic_y end_POSTSUBSCRIPT. As shown in Table 4, we elaborate on the time complexity of various methods. For score computation, HEDGE utilizes the Monte Carlo sampling algorithm, with the number of samples denoted by M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, leading to a time complexity of O(nM1)𝑂𝑛subscript𝑀1O(nM_{1})italic_O ( italic_n italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). HELOOsubscriptHE𝐿𝑂𝑂\text{HE}_{LOO}HE start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT uses the LOO algorithm Lipton (2018), with a time complexity of O(n2M1)𝑂superscript𝑛2subscript𝑀1O(n^{2}M_{1})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), where M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the maximum number of iterations of the LOO algorithm. HELIMEsubscriptHE𝐿𝐼𝑀𝐸\text{HE}_{LIME}HE start_POSTSUBSCRIPT italic_L italic_I italic_M italic_E end_POSTSUBSCRIPT method employs the LIME algorithm, with ridge regression solving complexity of O(n3M2)𝑂superscript𝑛3subscript𝑀2O(n^{3}M_{2})italic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the number of sampled instances. The time complexity of PE for solving scores is O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

  Methods OattrsubscriptO𝑎𝑡𝑡𝑟\textbf{O}_{attr}O start_POSTSUBSCRIPT italic_a italic_t italic_t italic_r end_POSTSUBSCRIPT OhierarchysubscriptO𝑖𝑒𝑟𝑎𝑟𝑐𝑦\textbf{O}_{hierarchy}O start_POSTSUBSCRIPT italic_h italic_i italic_e italic_r italic_a italic_r italic_c italic_h italic_y end_POSTSUBSCRIPT
  HEDGE (2020) O(nM1)𝑂𝑛subscript𝑀1O(nM_{1})italic_O ( italic_n italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) O(n3)𝑂superscript𝑛3O(n^{3})italic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )
  HELOOsubscriptHE𝐿𝑂𝑂\text{HE}_{LOO}HE start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT (2023) O(n2M2)𝑂superscript𝑛2subscript𝑀2O(n^{2}M_{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) O(n3)𝑂superscript𝑛3O(n^{3})italic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )
  HELIMEsubscriptHE𝐿𝐼𝑀𝐸\text{HE}_{LIME}HE start_POSTSUBSCRIPT italic_L italic_I italic_M italic_E end_POSTSUBSCRIPT (2023) O(n3M3)𝑂superscript𝑛3subscript𝑀3O(n^{3}M_{3})italic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) O(n3)𝑂superscript𝑛3O(n^{3})italic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )
  PE (ours) O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) O(n2logn)𝑂superscript𝑛2𝑙𝑜𝑔𝑛O(n^{2}logn)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l italic_o italic_g italic_n )
 
Table 4: Comparison results of HA methods about capturing non-contiguous interactions and their time complexity. The relationship between the number of samples in the table and the value of n𝑛nitalic_n is: M1M2>M3nmuch-greater-thansubscript𝑀1subscript𝑀2subscript𝑀3much-greater-than𝑛M_{1}\gg M_{2}>M_{3}\gg nitalic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≫ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ≫ italic_n.

7 Conclusion

In this paper, we introduce PE, a computationally efficient method employing hyperbolic geometry for modeling feature interactions. More concretely, we use two hyperbolic projection matrices to embed the semantic and syntax information and devise a simple strategy to estimate the contributions of feature groups. Finally we design an algorithm to decode the hierarchical tree in an O(n2logn)𝑂superscript𝑛2𝑙𝑜𝑔𝑛O(n^{2}logn)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l italic_o italic_g italic_n ) time complexity. Based on the experimental results of three typical text classification datasets, we demonstrate the effectiveness of our method.

8 Limitations

The limitations of our work include: 1) Although our method boasts low time complexity, the use of the probing method to train additional model parameters, including two Poincare projection matrices, somewhat limits the generalizability of our approach. 2) In our experiments, we decompose the weights of the edges of the HA tree according to Equation 17. Whether there exists a optimal decomposition formula remains for future investigation.

References

  • Abnar and Zuidema (2020) Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  • Chen et al. (2020) Hanjie Chen, Guangtao Zheng, and Yangfeng Ji. 2020. Generating hierarchical explanations on text classification via feature interaction detection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  • Dasgupta (2016) Sanjoy Dasgupta. 2016. A cost function for similarity-based hierarchical clustering. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing.
  • Demirel (2013) Oğuzhan Demirel. 2013. A characterization of möbius transformations by use of hyperbolic triangles. Journal of Mathematical Analysis and Applications.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL2018.
  • DeYoung et al. (2020) Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. 2020. ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  • Dhamdhere et al. (2020) Kedar Dhamdhere, Ashish Agarwal, and Mukund Sundararajan. 2020. The shapley taylor interaction index. In Proceedings of the 37th International Conference on Machine Learning.
  • Enguehard (2023) Joseph Enguehard. 2023. Sequential integrated gradients: a simple but effective method for explaining language models. In Findings of the Association for Computational Linguistics: ACL 2023.
  • Enouen and Liu (2022) James Enouen and Yan Liu. 2022. Sparse interaction additive networks via feature interaction detection and sparse selection. In Advances in Neural Information Processing Systems.
  • Ferrando et al. (2022) Javier Ferrando, Gerard I. Gállego, and Marta R. Costa-jussà. 2022. Measuring the mixing of contextual information in the transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  • Geva et al. (2021) Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
  • Graham and Hell (1985) Ronald L Graham and Pavol Hell. 1985. On the history of the minimum spanning tree problem. Annals of the History of Computing.
  • Guidotti et al. (2018) Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. A survey of methods for explaining black box models.
  • He et al. (2022) Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2022. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations.
  • Hewitt and Manning (2019) John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
  • Honnibal and Montani (2017) Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
  • ** et al. (2020) Xisen **, Zhongyu Wei, Junyi Du, Xiangyang Xue, and Xiang Ren. 2020. Towards hierarchical importance attribution: Explaining compositional semantics for neural sequence models. In International Conference on Learning Representations.
  • Ju et al. (2023) Yiming Ju, Yuanzhe Zhang, Kang Liu, and Jun Zhao. 2023. A hierarchical explanation generation method based on feature interaction detection. In Findings of the Association for Computational Linguistics: ACL 2023.
  • Kochurov et al. (2017) Max Kochurov, Rasul Karimov, and Serge Kozlukov. 2017. Geoopt: Riemannian optimization in pytorch. In International Conference on Machine Learning, GRLB Workshop.
  • Kokalj et al. (2021) Enja Kokalj, Blaž Škrlj, Nada Lavrač, Senja Pollak, and Marko Robnik-Šikonja. 2021. BERT meets shapley: Extending SHAP explanations to transformer-based classifiers. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation.
  • Li and Roth (2002) Xin Li and Dan Roth. 2002. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics.
  • Lipton (2018) Zachary C. Lipton. 2018. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue.
  • Lu et al. (2023) Xiaolei Lu, Jianghong Ma, and Haode Zhang. 2023. Asymmetric feature interaction for interpreting model predictions. In Findings of the Association for Computational Linguistics: ACL 2023.
  • Mardaoui and Garreau (2021) Dina Mardaoui and Damien Garreau. 2021. An analysis of lime for text data. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics.
  • Masoomi et al. (2022) Aria Masoomi, Davin Hill, Zhonghui Xu, Craig P Hersh, Edwin K. Silverman, Peter J. Castaldi, Stratis Ioannidis, and Jennifer Dy. 2022. Explanations of black-box models based on directional feature interactions. In International Conference on Learning Representations.
  • Miglani et al. (2020) Vivek Miglani, Narine Kokhlikyan, Bilal Alsallakh, Miguel Martin, and Orion Reblitz-Richardson. 2020. Investigating saturation effects in integrated gradients. Human Interpretability Workshop at ICML.
  • Mitchell et al. (2022) Rory Mitchell, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. 2022. Sampling permutations for shapley value estimation. J. Mach. Learn. Res.
  • Modarressi et al. (2023) Ali Modarressi, Mohsen Fayyaz, Ehsan Aghazadeh, Yadollah Yaghoobzadeh, and Mohammad Taher Pilehvar. 2023. DecompX: Explaining transformers decisions by propagating token decomposition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  • Owen (2013) Guillermo Owen. 2013. Game theory. Emerald Group Publishing.
  • Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05).
  • Qiang et al. (2022) Yao Qiang, Deng Pan, Chengyin Li, Xin Li, Rhongho Jang, and Dongxiao Zhu. 2022. Attcat: Explaining transformers via attentive class activation tokens. In Advances in Neural Information Processing Systems.
  • Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "why should i trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • Sanyal and Ren (2021) Soumya Sanyal and Xiang Ren. 2021. Discretized integrated gradients for explaining language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
  • Shapley et al. (1953) Lloyd S Shapley et al. 1953. A value for n-person games.
  • Shrikumar et al. (2017a) Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017a. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning - Volume 70.
  • Shrikumar et al. (2017b) Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. 2017b. Not just a black box: Learning important features through propagating activation differences.
  • Singh et al. (2019) Chandan Singh, W. James Murdoch, and Bin Yu. 2019. Hierarchical interpretations for neural network predictions. In International Conference on Learning Representations.
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and RichardS. Zemel. 2017. Prototypical networks for few-shot learning. In Neural Information Processing Systems,Neural Information Processing Systems.
  • Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In International conference on machine learning.
  • Tsang et al. (2020) Michael Tsang, Sirisha Rambhatla, and Yan Liu. 2020. How does this interaction affect me? interpretable attribution for feature interactions. In Advances in Neural Information Processing Systems.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS2017.
  • Wang and Wang (2018) Dingkang Wang and Yusu Wang. 2018. An improved cost function for hierarchical cluster trees.
  • Xilong et al. (2023) Zhang Xilong, Liu Ruochen, Liu **, and Liang Xuefeng. 2023. Interpreting positional information in perspective of word order. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  • Zhang et al. (2021a) Die Zhang, Hao Zhang, Huilin Zhou, Xiaoyi Bao, Da Huo, Ruizhao Chen, Xu Cheng, Mengyue Wu, and Quanshi Zhang. 2021a. Building interpretable interaction trees for deep nlp models.
  • Zhang et al. (2021b) Die Zhang, Hao Zhang, Huilin Zhou, Xiaoyi Bao, Da Huo, Ruizhao Chen, Xu Cheng, Mengyue Wu, and Quanshi Zhang. 2021b. Building interpretable interaction trees for deep nlp models. Proceedings of the AAAI Conference on Artificial Intelligence.
  • Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems.

Appendix A Proof

First, we prove that the conclusion holds for n=3𝑛3n=3italic_n = 3, and we generalize to the case of n>3𝑛3n>3italic_n > 3 using induction.

Notation Due to the specificity of the binary tree we are solving for, a unique candidate tree can correspond to a node permutation π𝜋\piitalic_π. For a tree with n𝑛nitalic_n leaves, we define πnsubscript𝜋𝑛\pi_{n}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as the corresponding permutation.

We denote the constructed permutation πnsuperscriptsubscript𝜋𝑛\pi_{n}^{*}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and prefix permutation πmsuperscriptsubscript𝜋𝑚\pi_{m}^{*}italic_π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in Algorithm 1.

Base Case We here start the discussion from the left case in Figure 10. The cost can be expanded into:

CD(π3;e)=ijk(eik+ejk)+2ijeij=ijk2eij+eik+ejksubscript𝐶𝐷superscriptsubscript𝜋3𝑒subscript𝑖𝑗𝑘subscript𝑒𝑖𝑘subscript𝑒𝑗𝑘2subscript𝑖𝑗subscript𝑒𝑖𝑗subscript𝑖𝑗𝑘2subscript𝑒𝑖𝑗subscript𝑒𝑖𝑘subscript𝑒𝑗𝑘\begin{split}C_{D}(\pi_{3}^{*};e)&=\sum\limits_{ijk}(e_{ik}+e_{jk})+2\sum% \limits_{ij}e_{ij}\\ &=\sum_{ijk}2e_{ij}+e_{ik}+e_{jk}\end{split}start_ROW start_CELL italic_C start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_e ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ) + 2 ∑ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT 2 italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT end_CELL end_ROW (20)

Notice that eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is smallest among eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, eiksubscript𝑒𝑖𝑘e_{ik}italic_e start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT, ejksubscript𝑒𝑗𝑘e_{jk}italic_e start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT and among {i,jk}conditional-set𝑖𝑗𝑘\{i,j\mid k\}{ italic_i , italic_j ∣ italic_k }, {i,kj}conditional-set𝑖𝑘𝑗\{i,k\mid j\}{ italic_i , italic_k ∣ italic_j }, {j,ki}conditional-set𝑗𝑘𝑖\{j,k\mid i\}{ italic_j , italic_k ∣ italic_i }, only one will hold true. We can conclude that π3superscriptsubscript𝜋3\pi_{3}^{*}italic_π start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the solution that minimizes the cost.

Induction Step We assume that the tree corresponding to the permutation πmsubscript𝜋𝑚\pi_{m}italic_π start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT has the smallest cost. To prove that πm+1subscript𝜋𝑚1\pi_{m+1}italic_π start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT is also the smallest. We use a proof by contradiction to demonstrate that πm+1subscript𝜋𝑚1\pi_{m+1}italic_π start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT corresponds to the tree with the smallest cost. We define the tree’s level as L1,,Ln1subscript𝐿1subscript𝐿𝑛1L_{1},\cdots,L_{n-1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_L start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT in Figure 10. Firstly, we introduce the following lemma:

Lemma We denote the γ𝛾\gammaitalic_γ-th step permutation produced in Algorithm 1 as πγsuperscriptsubscript𝜋𝛾\pi_{\gamma}^{*}italic_π start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and its corresponding tree cost as C(πγ)𝐶superscriptsubscript𝜋𝛾C(\pi_{\gamma}^{*})italic_C ( italic_π start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Now, if we swap the nodes at level Lssubscript𝐿𝑠L_{s}italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, s<t𝑠𝑡s<titalic_s < italic_t, and the resulting sequence πγsuperscriptsuperscriptsubscript𝜋𝛾{\pi_{\gamma}^{*}}^{\prime}italic_π start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, then C(πγ)>C(πγ)𝐶superscriptsuperscriptsubscript𝜋𝛾𝐶superscriptsubscript𝜋𝛾C({\pi_{\gamma}^{*}}^{\prime})>C(\pi_{\gamma}^{*})italic_C ( italic_π start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > italic_C ( italic_π start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

Proof.

We consider the cost after the swap as three parts: the triples that do not include s𝑠sitalic_s and t𝑡titalic_t, the part of the triples that include s𝑠sitalic_s and the part that include t𝑡titalic_t, denoted as C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. For ease of proof, we denote the sequence to the left of s𝑠sitalic_s as A=πγ,1:s1𝐴subscriptsuperscript𝜋:𝛾1𝑠1A=\pi^{*}_{\gamma,1:s-1}italic_A = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ , 1 : italic_s - 1 end_POSTSUBSCRIPT, and the sequence between s𝑠sitalic_s and t𝑡titalic_t as B=πγ,s+1:t1𝐵subscriptsuperscript𝜋:𝛾𝑠1𝑡1B=\pi^{*}_{\gamma,s+1:t-1}italic_B = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ , italic_s + 1 : italic_t - 1 end_POSTSUBSCRIPT. Obviously C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT remains unchanged, as for C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, before and after the swap:

C2=i,jA,se()+iA,s,jBe()+s,i,jBe(),subscript𝐶2subscriptformulae-sequence𝑖𝑗𝐴𝑠𝑒subscriptformulae-sequence𝑖𝐴𝑠𝑗𝐵𝑒subscript𝑠𝑖𝑗𝐵𝑒C_{2}=\sum\limits_{i,j\in A,s}e(\cdot)+\sum\limits_{i\in A,s,j\in B}e(\cdot)+% \sum\limits_{s,i,j\in B}e(\cdot),italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ italic_A , italic_s end_POSTSUBSCRIPT italic_e ( ⋅ ) + ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A , italic_s , italic_j ∈ italic_B end_POSTSUBSCRIPT italic_e ( ⋅ ) + ∑ start_POSTSUBSCRIPT italic_s , italic_i , italic_j ∈ italic_B end_POSTSUBSCRIPT italic_e ( ⋅ ) , (21)
C2=i,jA,se()+iA,jB,se()+i,jB,se()superscriptsubscript𝐶2subscriptformulae-sequence𝑖𝑗𝐴𝑠𝑒subscriptformulae-sequence𝑖𝐴𝑗𝐵𝑠𝑒subscriptformulae-sequence𝑖𝑗𝐵𝑠𝑒C_{2}^{\prime}=\sum\limits_{i,j\in A,s}e(\cdot)+\sum\limits_{i\in A,j\in B,s}e% (\cdot)+\sum\limits_{i,j\in B,s}e(\cdot)italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ italic_A , italic_s end_POSTSUBSCRIPT italic_e ( ⋅ ) + ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A , italic_j ∈ italic_B , italic_s end_POSTSUBSCRIPT italic_e ( ⋅ ) + ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ italic_B , italic_s end_POSTSUBSCRIPT italic_e ( ⋅ ) (22)

By subtracting, we obtain:

C2C2=(iA,jB,se()iA,s,jBe())+(i,jB,se()s,i,jBe())0.superscriptsubscript𝐶2subscript𝐶2subscriptformulae-sequence𝑖𝐴𝑗𝐵𝑠𝑒subscriptformulae-sequence𝑖𝐴𝑠𝑗𝐵𝑒subscriptformulae-sequence𝑖𝑗𝐵𝑠𝑒subscript𝑠𝑖𝑗𝐵𝑒0\begin{split}C_{2}^{\prime}-C_{2}&=(\sum\limits_{i\in A,j\in B,s}e(\cdot)-\sum% \limits_{i\in A,s,j\in B}e(\cdot))+\\ &(\sum\limits_{i,j\in B,s}e(\cdot)-\sum\limits_{s,i,j\in B}e(\cdot))\geq 0.% \end{split}start_ROW start_CELL italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL = ( ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A , italic_j ∈ italic_B , italic_s end_POSTSUBSCRIPT italic_e ( ⋅ ) - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A , italic_s , italic_j ∈ italic_B end_POSTSUBSCRIPT italic_e ( ⋅ ) ) + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ( ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ italic_B , italic_s end_POSTSUBSCRIPT italic_e ( ⋅ ) - ∑ start_POSTSUBSCRIPT italic_s , italic_i , italic_j ∈ italic_B end_POSTSUBSCRIPT italic_e ( ⋅ ) ) ≥ 0 . end_CELL end_ROW (23)

Similarly we obtain:

C3C3=(iA,t,jBe()iA,jB,te())+(t,i,jBe()i,jB,te())0.superscriptsubscript𝐶3subscript𝐶3subscriptformulae-sequence𝑖𝐴𝑡𝑗𝐵𝑒subscriptformulae-sequence𝑖𝐴𝑗𝐵𝑡𝑒subscript𝑡𝑖𝑗𝐵𝑒subscriptformulae-sequence𝑖𝑗𝐵𝑡𝑒0\begin{split}C_{3}^{\prime}-C_{3}&=(\sum\limits_{i\in A,t,j\in B}e(\cdot)-\sum% \limits_{i\in A,j\in B,t}e(\cdot))+\\ &(\sum\limits_{t,i,j\in B}e(\cdot)-\sum\limits_{i,j\in B,t}e(\cdot))\geq 0.% \end{split}start_ROW start_CELL italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL = ( ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A , italic_t , italic_j ∈ italic_B end_POSTSUBSCRIPT italic_e ( ⋅ ) - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A , italic_j ∈ italic_B , italic_t end_POSTSUBSCRIPT italic_e ( ⋅ ) ) + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ( ∑ start_POSTSUBSCRIPT italic_t , italic_i , italic_j ∈ italic_B end_POSTSUBSCRIPT italic_e ( ⋅ ) - ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ italic_B , italic_t end_POSTSUBSCRIPT italic_e ( ⋅ ) ) ≥ 0 . end_CELL end_ROW (24)

Now we prove that πm+1subscript𝜋𝑚1\pi_{m+1}italic_π start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT is smallest. If πm+1subscript𝜋𝑚1\pi_{m+1}italic_π start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT is not the smallest, then the node at the last level can be the smallest by swap** with a previous node. There are two cases: when the swapped node is from the first level (e.g. j𝑗jitalic_j), in this case, the difference in cost before and after the swap becomes:

ΔC=(iC,m+1,jDe()iC,jD,m+1e())+(t,i,jDe()i,jD,te())0,Δ𝐶subscriptformulae-sequence𝑖𝐶𝑚1𝑗𝐷𝑒subscriptformulae-sequence𝑖𝐶𝑗𝐷𝑚1𝑒subscript𝑡𝑖𝑗𝐷𝑒subscriptformulae-sequence𝑖𝑗𝐷𝑡𝑒0\begin{split}\Delta C&=(\sum\limits_{i\in C,m+1,j\in D}e(\cdot)-\sum\limits_{i% \in C,j\in D,m+1}e(\cdot))+\\ &(\sum\limits_{t,i,j\in D}e(\cdot)-\sum\limits_{i,j\in D,t}e(\cdot))\geq 0,% \end{split}start_ROW start_CELL roman_Δ italic_C end_CELL start_CELL = ( ∑ start_POSTSUBSCRIPT italic_i ∈ italic_C , italic_m + 1 , italic_j ∈ italic_D end_POSTSUBSCRIPT italic_e ( ⋅ ) - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_C , italic_j ∈ italic_D , italic_m + 1 end_POSTSUBSCRIPT italic_e ( ⋅ ) ) + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ( ∑ start_POSTSUBSCRIPT italic_t , italic_i , italic_j ∈ italic_D end_POSTSUBSCRIPT italic_e ( ⋅ ) - ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ italic_D , italic_t end_POSTSUBSCRIPT italic_e ( ⋅ ) ) ≥ 0 , end_CELL end_ROW (25)

where C=πm+1,1𝐶superscriptsubscript𝜋𝑚11C=\pi_{m+1,1}^{*}italic_C = italic_π start_POSTSUBSCRIPT italic_m + 1 , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, D=πm+1,3:m𝐷superscriptsubscript𝜋:𝑚13𝑚D=\pi_{m+1,3:m}^{*}italic_D = italic_π start_POSTSUBSCRIPT italic_m + 1 , 3 : italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Similarly, when the swapped node is located in other levels, the cost after the swap will not decrease. This means that in C(πm+1)𝐶subscript𝜋𝑚1C(\pi_{m+1})italic_C ( italic_π start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ) cannot be smaller through swap** other leaves from different levels, thus πm+1subscript𝜋𝑚1\pi_{m+1}italic_π start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT is smallest.

Refer to caption
Figure 7: Examples for π3subscript𝜋3\pi_{3}italic_π start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, π4subscript𝜋4\pi_{4}italic_π start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and πnsubscript𝜋𝑛\pi_{n}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

The primary difference is that the edge weights in our graph Graham and Hell (1985) are not all known in advance but are dynamically generated.

Appendix B Visualization

Refer to caption
(a) A negative example The redeeming feature of Chan’s films has always been the action, but the stunts in the tuxedo seem tired and what’s worse, routine.
Refer to caption
(b) A positive example The production values are of the highest and the performances attractive without being memorable.
Figure 8: PE for BERT on two examples from the Rotten Tomatoes dataset.
Refer to caption
(a) A negative example Service here sucks \n I love the food still \n\n but the service is so bad.
Refer to caption
(b) A positive example Flavors are great but every time I come this location it is disgusting machines are dirty.
Figure 9: PE for BERT on two examples from Yelp dataset.

Appendix C Implementation Details

In this work, all language models are implemented by Transformers. All our experiments are performed on one A800. The results are reported with 5 random seeds.

For fine tuning the projection matrix 𝑷csubscript𝑷𝑐\boldsymbol{P}_{c}bold_italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, we iterate 5 epochs using RiemanianAdam optimizer and learning rate is initialized as 1e-3, the batch size is 32. For fine tuning the projection matrix 𝑷ssubscript𝑷𝑠\boldsymbol{P}_{s}bold_italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we use the Penn Treebank dataset we iterate 40 epochs using Adam optimizer and learning rate is initialized as 1e-3. We set doutsubscript𝑑𝑜𝑢𝑡d_{out}italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT as 64. We use grid search to search α1,α2,β1,β2{0,0.1,0.2,0.3,0.4,0.5}subscript𝛼1subscript𝛼2subscript𝛽1subscript𝛽200.10.20.30.40.5\alpha_{1},\alpha_{2},\beta_{1},\beta_{2}\in\{0,0.1,0.2,0.3,0.4,0.5\}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ { 0 , 0.1 , 0.2 , 0.3 , 0.4 , 0.5 }.

Appendix D HA Example

Refer to caption
Figure 10: A hierarchy example from HEDGE Chen et al. (2020). The background color of the words and phrases represents emotional polarity, with cool colors indicating positive and warm colors indicating negative.

Appendix E Lime Explanation

Refer to caption
Figure 11: A LIME explanation example from a random forest classifier. It can be observed that two stop words (i.e.“is” and “were”) are identified as positive and negative emotional polarities, respectively.