License: CC BY 4.0
arXiv:2309.15312v3 [cs.LG] 20 Dec 2023

MAPTree: Beating “Optimal” Decision Trees with Bayesian Decision Trees

Colin Sullivan\equalcontrib1, Mo Tiwari\equalcontrib1, Sebastian Thrun1
Abstract

Decision trees remain one of the most popular machine learning models today, largely due to their out-of-the-box performance and interpretability. In this work, we present a Bayesian approach to decision tree induction via maximum a posteriori inference of a posterior distribution over trees. We first demonstrate a connection between maximum a posteriori inference of decision trees and AND/OR search. Using this connection, we propose an AND/OR search algorithm, dubbed MAPTree, which is able to recover the maximum a posteriori tree. Lastly, we demonstrate the empirical performance of the maximum a posteriori tree both on synthetic data and in real world settings. On 16 real world datasets, MAPTree either outperforms baselines or demonstrates comparable performance but with much smaller trees. On a synthetic dataset, MAPTree also demonstrates greater robustness to noise and better generalization than existing approaches. Finally, MAPTree recovers the maxiumum a posteriori tree faster than existing sampling approaches and, in contrast with those algorithms, is able to provide a certificate of optimality. The code for our experiments is available at https://github.com/ThrunGroup/maptree.

1 Introduction

Decision trees are amongst the most widely used machine learning models today due to their empirical performance, generality, and interpretability. A decision tree is a binary tree in which each internal node corresponds to an if/then/else comparison on a feature value; a label for a datapoint is produced by determining the corresponding leaf node into which it falls. The predicted label is usually the majority vote (respectively, mean) of the label of training datapoints at the leaf node in classification (respectively, regression).

Despite recent advances in neural networks, decision trees remain a popular choice amongst machine learning practitioners. Decision trees form the backbone of more complex ensemble models such as Random Forest (Breiman 2001) and XGBoost (Chen and Guestrin 2016), which have been the leading models in many machine learning competitions and often outperform neural networks on tabular data (Grinsztajn, Oyallon, and Varoquaux 2022). Decision trees naturally work with complex data where the features can be of mixed data types, e.g., binary, categorical, or continuous. Furthermore, decision trees are highly interpretable and the prediction-generating process can be inspected, which can be a necessity in domains such as law and healthcare. Furthermore, inference in decision trees is highly efficient as it relies only on efficient feature value comparisons. Given decision trees’ popularity, an improvement upon existing decision tree approaches would have widespread impact.

Contributions: In this work, we:

  • Formalize a connection between maximum a posteriori inference of Bayesian Classification and Regression Trees (BCART) and AND/OR search problems,

  • Propose an algorithm, dubbed MAPTree, for search on AND/OR graphs that recovers the maximum a posteriori tree of the BCART posterior over decision trees,

  • Demonstrate that MAPTree is significantly faster than previous sampling-based approaches,

  • Demonstrate that the tree recovered by MAPTree either a) outperforms current state-of-the-art algorithms in performance, or b) demonstrates comparable performance but with smaller trees, and

  • Provide a heavily optimized C++ implementation that is also callable from Python for practitioners.

2 Related Work

In this work, we focus on the construction of individual decision trees. We compare our proposed algorithm with four main classes of prior algorithms: greedy algorithms, “Optimal” Decision Trees (ODTs), “Optimal” Sparse Decision Trees (OSDTs), and sampling-based approaches.

The most popular method for constructing decision trees is a greedy approach that recursively splits nodes based on a heuristic such as Gini impurity or entropy (in classification) or mean-squared error (in regression) (Quinlan 1986). However, individual decision trees constructed in this manner often overfit the training data; ensemble methods such as Random Forest and XGBoost attempt to ameliorate overfitting but are significantly more complex than a single decision tree (Breiman 2001; Chen and Guestrin 2016).

So-called “optimal” decision trees reformulate the problem of decision tree induction as a global optimization problem, i.e., to find the tree that maximizes global objective function, such as training accuracy, of a given maximum depth (Bertsimas and Dunn 2017; Verwer and Zhang 2019; Verhaeghe et al. 2020; Aglin, Nijssen, and Schaus 2020). Though this problem is NP-Hard in general (Hyafil and Rivest 1976), existing approaches can find the global optimum of shallow trees (depth 5absent5\leq 5≤ 5) on medium-sized datasets with thousands of datapoints and tens of features. The original ODT approaches were based on mixed integer programming or binary linear program formulations (Verhaeghe et al. 2020; Nijssen and Fromont 2007; Bertsimas and Dunn 2017; Verwer and Zhang 2019). Other work attempts to improve upon these methods using caching branch-and-bound search (Aglin, Nijssen, and Schaus 2020), constraint programming with AND/OR search (Verhaeghe et al. 2020), or dynamic programming with bounds (van der Linden, de Weerdt, and Demirović 2022). ODTs have been shown to outperform their greedily constructed counterparts with smaller trees (Verhaeghe et al. 2020; Verwer and Zhang 2019) but still suffer from several drawbacks. First, choosing the maximum depth hyperparameter is nontrivial, even with cross-validation, and the maximum depth cannot be set too large as the runtime of these algorithms scales exponentially with depth. Furthermore, ODTs often suffer from overfitting, especially when the maximum depth is set too large. Amongst ODT approaches, Verhaeghe et al. (2020) formulates the search for an optimal decision tree in terms of an AND/OR graph and is most similar to ours, but still suffers from the aforementioned drawbacks. Additionally, many ODT algorithms exhibit poor anytime behavior (Kiossou et al. 2023). Optimal sparse decision trees attempt to adapt ODT approaches to train smaller and sparser trees by incorporating a sparsity penalty in their objectives. As a result, OSDTs are smaller and less prone to overfitting than ODTs (Hu, Rudin, and Seltzer 2019; Lin et al. 2020). These approaches, however, often underfit the data (Hu, Rudin, and Seltzer 2019; Lin et al. 2020).

Another class of approaches, called Bayesian Classification and Regression Trees (BCART), introduce a posterior over tree structures given the data and sample trees from this posterior. Initially, BCART methods were observed to generate better trees than greedy methods (Denison, Mallick, and Smith 1998). Many variations to the BCART methodology were developed using sampling methods based on Markov-Chain Monte Carlo (MCMC), such as Metropolis-Hastings (Pratola 2016) and others (Geels, Pratola, and Herbei 2022; Lakshminarayanan, Roy, and Teh 2013). These methods, however, often suffer from exponentially long mixing times in practice and become stuck in local minima (Kim and Rockova 2023). In one study, the posterior over trees was represented as a lattice over itemsets (Nijssen 2008). This approach discovered the maximum a posteriori tree within the hypothesis space of decision trees. However, this approach required enumerating and storing the entire space of decision trees and therefore placed stringent constraints on the search space of possible trees, based on leaf node support and maximum depth. Our method utilises the same posterior over tree structures introduced by BCART. In contrast with prior work, however, we are able to recover the provably maximum a posteriori tree from this posterior in the unconstrained setting.

3 Preliminaries and Notation

In this paper, we focus on the binary classification task, though our techniques extend to multi-class classification and regression. We also focus on binary datasets, as is common in the decision tree literature (Verhaeghe et al. 2020; Nijssen 2008; Nijssen and Fromont 2007) since many datasets can be binarized via bucketing, one-hot encoding, and other techniques.

General notation: We assume we are given a binary dataset 𝒳{0,1}N×F𝒳superscript01𝑁𝐹\mathcal{X}\in\{0,1\}^{N\times F}caligraphic_X ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N × italic_F end_POSTSUPERSCRIPT with N𝑁Nitalic_N samples, F𝐹Fitalic_F features, and associated binary labels 𝒴{0,1}N𝒴superscript01𝑁\mathcal{Y}\in\{0,1\}^{N}caligraphic_Y ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. We let [u]{1,,u}delimited-[]𝑢1𝑢[u]\coloneqq\{1,\ldots,u\}[ italic_u ] ≔ { 1 , … , italic_u }, [N]delimited-[]𝑁\mathcal{I}\subseteq[N]caligraphic_I ⊆ [ italic_N ] the indices of a subsample of the dataset, and (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denote the i𝑖iitalic_ith sample and its label. We define 𝒳|{xi:i}𝒳evaluated-at𝒳conditional-setsubscript𝑥𝑖𝑖𝒳\mathcal{X}|_{\mathcal{I}}\coloneqq\{x_{i}:i\in\mathcal{I}\}\subset\mathcal{X}caligraphic_X | start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ≔ { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i ∈ caligraphic_I } ⊂ caligraphic_X, 𝒴|{yi:i}𝒴evaluated-at𝒴conditional-setsubscript𝑦𝑖𝑖𝒴\mathcal{Y}|_{\mathcal{I}}\coloneqq\{y_{i}:i\in\mathcal{I}\}\subset\mathcal{Y}caligraphic_Y | start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ≔ { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i ∈ caligraphic_I } ⊂ caligraphic_Y, and |f=k{i:i\mathcal{I}|_{f=k}\coloneqq\{i:i\in\mathcal{I}caligraphic_I | start_POSTSUBSCRIPT italic_f = italic_k end_POSTSUBSCRIPT ≔ { italic_i : italic_i ∈ caligraphic_I and (xi)f=k}(x_{i})_{f}=k\}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_k }, for k{0,1}𝑘01k\in\{0,1\}italic_k ∈ { 0 , 1 }. Finally, we let ck()superscript𝑐𝑘c^{k}(\mathcal{I})italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( caligraphic_I ) be the count of points in \mathcal{I}caligraphic_I with label k{0,1}𝑘01k\in\{0,1\}italic_k ∈ { 0 , 1 }, i.e., ck()=|{i:i and yi=k}|superscript𝑐𝑘conditional-set𝑖𝑖 and subscript𝑦𝑖𝑘c^{k}(\mathcal{I})=|\{i:i\in\mathcal{I}\text{ and }y_{i}=k\}|italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( caligraphic_I ) = | { italic_i : italic_i ∈ caligraphic_I and italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k } | and 𝒱()𝒱\mathcal{V}(\mathcal{I})caligraphic_V ( caligraphic_I ) be the set of nontrivial features splits of the samples in \mathcal{I}caligraphic_I, i.e., the set of features such that neither |f=0evaluated-at𝑓0\mathcal{I}|_{f=0}caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT nor |f=1evaluated-at𝑓1\mathcal{I}|_{f=1}caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT is nonempty.

Tree notation: We let T={n1,n2,,nM+L}𝑇subscript𝑛1subscript𝑛2subscript𝑛𝑀𝐿T=\{n_{1},n_{2},\dots,n_{M+L}\}italic_T = { italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_n start_POSTSUBSCRIPT italic_M + italic_L end_POSTSUBSCRIPT } be a binary classification tree represented as a collection of its nodes and use n𝑛nitalic_n to refer to a node in T𝑇Titalic_T, m𝑚mitalic_m to refer to one of the M𝑀Mitalic_M internal nodes in T𝑇Titalic_T, and l𝑙litalic_l to refer to one of the L𝐿Litalic_L leaf nodes in T𝑇Titalic_T. Furthermore, we use (n)𝑛\mathcal{I}(n)caligraphic_I ( italic_n ) to denote the indices of the samples in 𝒳𝒳\mathcal{X}caligraphic_X that reach node n𝑛nitalic_n in T𝑇Titalic_T, namely {i:xispace(n)}conditional-set𝑖subscript𝑥𝑖space𝑛\{i:x_{i}\in\text{space}(n)\}{ italic_i : italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ space ( italic_n ) }, where space(n)𝑛(n)( italic_n ) is the subset of feature space that reaches node n𝑛nitalic_n in T𝑇Titalic_T. We also use clksubscriptsuperscript𝑐𝑘𝑙c^{k}_{l}italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to denote the count of points assigned to leaf l𝑙litalic_l with label k{0,1}𝑘01k\in\{0,1\}italic_k ∈ { 0 , 1 } (i.e., clk=ck(I(l))subscriptsuperscript𝑐𝑘𝑙superscript𝑐𝑘𝐼𝑙c^{k}_{l}=c^{k}(I(l))italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_I ( italic_l ) )), Tinternal={m1,m2,,mM}Tsubscript𝑇internalsubscript𝑚1subscript𝑚2subscript𝑚𝑀𝑇T_{\text{internal}}=\{m_{1},m_{2},\dots,m_{M}\}\subset Titalic_T start_POSTSUBSCRIPT internal end_POSTSUBSCRIPT = { italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } ⊂ italic_T to denote the set of internal nodes in tree T𝑇Titalic_T, and Tleaves={l1,l2,,lL}Tsubscript𝑇leavessubscript𝑙1subscript𝑙2subscript𝑙𝐿𝑇T_{\text{leaves}}=\{l_{1},l_{2},\dots,l_{L}\}\subset Titalic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT = { italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT } ⊂ italic_T is the set of all leaf nodes in tree T𝑇Titalic_T. Finally, we use d(n)𝑑𝑛d(n)italic_d ( italic_n ) to denote the depth of node n𝑛nitalic_n in T𝑇Titalic_T.

3.1 AND/OR Graph Search

We briefly recapitulate the concept of AND/OR graphs and a search algorithm for AND/OR graphs, AO*. AND/OR graph search can be viewed as a generalization of the shortest path problem that allows nodes consisting of independent subproblems to be decomposed and solved separately. Thus, a solution of an AND/OR graph is not a path but rather a subgraph 𝒮𝒮\mathcal{S}caligraphic_S with cost, denoted cost(𝒮)𝒮(\mathcal{S})( caligraphic_S ), equal to the sum across the costs of its edges. AND/OR graphs contain two types of nodes: terminal nodes and nonterminal nodes. Nonterminal nodes can be further subdivided into AND nodes and OR nodes, with a special OR node designated as the root or start node r𝑟ritalic_r. For a given AND/OR graph 𝒢𝒢\mathcal{G}caligraphic_G, a solution graph 𝒮𝒮\mathcal{S}caligraphic_S on an AND/OR graph is a connected subset of nodes of 𝒢𝒢\mathcal{G}caligraphic_G in which:

  1. 1.

    r𝒮𝑟𝒮r\in\mathcal{S}italic_r ∈ caligraphic_S,

  2. 2.

    for every AND node a𝒮𝑎𝒮a\in\mathcal{S}italic_a ∈ caligraphic_S, all the immediate children of a𝑎aitalic_a are also in 𝒮𝒮\mathcal{S}caligraphic_S, and

  3. 3.

    for every non-terminal OR node o𝒮𝑜𝒮o\in\mathcal{S}italic_o ∈ caligraphic_S exactly one of o𝑜oitalic_o’s children is also in 𝒮𝒮\mathcal{S}caligraphic_S.

Intuitively, the children of an AND node a𝑎aitalic_a represent subtasks that must all be solved for a𝑎aitalic_a to be satisfied (e.g., simultaneous prerequisites), and the children of an OR node o𝑜oitalic_o represent mutually exclusive satisfying choices.

Refer to caption
Figure 1: An example (general) AND/OR graph, with AND nodes drawn as squares, and OR nodes drawn as solid circles, and terminal nodes drawn as dashed circles. The minimal cost solution is highlighted in red and has cost 0+0+3+4+1+2=10003412100+0+3+4+1+2=100 + 0 + 3 + 4 + 1 + 2 = 10. This diagram demonstrates an AND/OR graph where the root node r𝑟ritalic_r is an AND node; in MAPTree, the root node is an OR node.

One of the most popular AND/OR graph search algorithms is AO* (Mahanti and Bagchi 1985, 1983). The AO* algorithm explores potential paths in an AND/OR graph in a best-first fashion, guided by a heuristic. When a new node is explored, its children are revealed and the cost for that node and all of its ancestors is updated; the search then continues. This process is repeated until the the root node is marked as solved, indicating that no immediately accessible nodes could lead to an increase in heuristic value. The AO* algorithm is guaranteed to find the minimal cost solution if the heuristic is admissible, i.e., the heuristic estimate of cost is always less than or equal to the actual cost of a node. For more details on the AO* algorithm, we refer the reader to (Mahanti and Bagchi 1985). An example AND/OR graph is given in Figure 1 with its minimal cost solution shown in red.

Additional AND/OR graph notation: In addition to the notation defined above, we use t𝑡titalic_t to refer to a terminal node. When searching over an AND/OR graph, we use 𝒢𝒢\mathcal{G}caligraphic_G to refer to the implicit (entire) AND/OR graph and 𝒢𝒢superscript𝒢𝒢\mathcal{G}^{\prime}\subset\mathcal{G}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊂ caligraphic_G to explicit (explored) AND/OR graph, as in prior work.

3.2 Bayesian Classification and Regression Trees (BCART)

Bayesian Decision Trees are a family of statistical models of decision trees introduced in Chipman, George, and McCulloch (1998) and Denison, Mallick, and Smith (1998). A Bayesian Decision Tree (BDT) is a pair (T,Θ)𝑇Θ(T,\Theta)( italic_T , roman_Θ ) where T𝑇Titalic_T is a tree and Θ=(θl1,θl2,,θlL)Θsubscript𝜃subscript𝑙1subscript𝜃subscript𝑙2subscript𝜃subscript𝑙𝐿\Theta=(\theta_{l_{1}},\theta_{l_{2}},\dots,\theta_{l_{L}})roman_Θ = ( italic_θ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) parameterizes the independent probability distributions over labels in the leaf nodes of tree T𝑇Titalic_T. We are interested in the binary classification setting, where each θlsubscript𝜃𝑙\theta_{l}italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT parameterizes a Bernoulli distribution Ber(θl)Bersubscript𝜃𝑙\text{Ber}(\theta_{l})Ber ( italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) with θl[0,1]subscript𝜃𝑙01\theta_{l}\in[0,1]italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. We denote by Beta(ρ1,ρ0)Betasuperscript𝜌1superscript𝜌0\text{Beta}(\rho^{1},\rho^{0})Beta ( italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) the Beta distribution with parameters ρ1,ρ0+superscript𝜌1superscript𝜌0superscript\rho^{1},\rho^{0}\in\mathbb{R}^{+}italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and by B(c1,c0)𝐵superscript𝑐1superscript𝑐0B(c^{1},c^{0})italic_B ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) the Beta function.

We note that a BDT’s tree T𝑇Titalic_T partitions the data such that the sample subsets (l1)subscript𝑙1\mathcal{I}(l_{1})caligraphic_I ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), (l2),(lL)subscript𝑙2subscript𝑙𝐿\mathcal{I}(l_{2}),\mathcal{I}(l_{L})caligraphic_I ( italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , caligraphic_I ( italic_l start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) fall into leaves l1,l2,,lLsubscript𝑙1subscript𝑙2subscript𝑙𝐿l_{1},l_{2},\dots,l_{L}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. Furthermore, a BDT defines a probability distribution over the respective labels occurring in their leaves: each label in leaf l𝑙litalic_l is sampled from Ber(θl)subscript𝜃𝑙(\theta_{l})( italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT )). Every BDT therefore induces a likelihood function, given in Theorem 1.

Theorem 1.

The likelihood of a BDT (T,Θ)𝑇normal-Θ(T,\Theta)( italic_T , roman_Θ ) generating labels 𝒴𝒴\mathcal{Y}caligraphic_Y given features 𝒳𝒳\mathcal{X}caligraphic_X is

P(𝒴|𝒳,T,Θ)𝑃conditional𝒴𝒳𝑇Θ\displaystyle P(\mathcal{Y}|\mathcal{X},T,\Theta)italic_P ( caligraphic_Y | caligraphic_X , italic_T , roman_Θ ) =lT𝑙𝑒𝑎𝑣𝑒𝑠iI(l)θlyi(1θl)1yiabsentsubscriptproduct𝑙subscript𝑇𝑙𝑒𝑎𝑣𝑒𝑠subscriptproduct𝑖𝐼𝑙superscriptsubscript𝜃𝑙subscript𝑦𝑖superscript1subscript𝜃𝑙1subscript𝑦𝑖\displaystyle=\prod_{l\in T_{\text{leaves}}}\prod_{i\in I(l)}\theta_{l}^{y_{i}% }\left(1-\theta_{l}\right)^{1-y_{i}}= ∏ start_POSTSUBSCRIPT italic_l ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i ∈ italic_I ( italic_l ) end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (1)
=lT𝑙𝑒𝑎𝑣𝑒𝑠θlcl1(1θl)cl0absentsubscriptproduct𝑙subscript𝑇𝑙𝑒𝑎𝑣𝑒𝑠superscriptsubscript𝜃𝑙subscriptsuperscript𝑐1𝑙superscript1subscript𝜃𝑙subscriptsuperscript𝑐0𝑙\displaystyle=\prod_{l\in T_{\text{leaves}}}\theta_{l}^{c^{1}_{l}}\left(1-% \theta_{l}\right)^{c^{0}_{l}}= ∏ start_POSTSUBSCRIPT italic_l ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (2)

The specific formulation of BCART also assumes a prior distribution over ΘΘ\Thetaroman_Θ, i.e., that θBeta(ρ1,ρ0)similar-to𝜃Betasuperscript𝜌1superscript𝜌0\theta\sim\text{Beta}(\rho^{1},\rho^{0})italic_θ ∼ Beta ( italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) for each θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ. With this assumption, we can derive the likelihood function P(𝒴|𝒳,T)𝑃conditional𝒴𝒳𝑇P(\mathcal{Y}|\mathcal{X},T)italic_P ( caligraphic_Y | caligraphic_X , italic_T ); see Theorem 2.

Theorem 2.

Assume that each θ𝐵𝑒𝑡𝑎(ρ1,ρ0)similar-to𝜃𝐵𝑒𝑡𝑎superscript𝜌1superscript𝜌0\theta\sim\text{Beta}(\rho^{1},\rho^{0})italic_θ ∼ Beta ( italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) for each θΘ𝜃normal-Θ\theta\in\Thetaitalic_θ ∈ roman_Θ. Then the likelihood of a tree T𝑇Titalic_T generating labels 𝒴𝒴\mathcal{Y}caligraphic_Y given features 𝒳𝒳\mathcal{X}caligraphic_X is

P(𝒴|𝒳,T)=lT𝑙𝑒𝑎𝑣𝑒𝑠B(cl1+ρ1,cl0+ρ0)B(ρ1,ρ0)𝑃conditional𝒴𝒳𝑇subscriptproduct𝑙subscript𝑇𝑙𝑒𝑎𝑣𝑒𝑠𝐵subscriptsuperscript𝑐1𝑙superscript𝜌1subscriptsuperscript𝑐0𝑙superscript𝜌0𝐵superscript𝜌1superscript𝜌0\displaystyle P(\mathcal{Y}|\mathcal{X},T)=\prod_{l\in T_{\text{leaves}}}\frac% {B(c^{1}_{l}+\rho^{1},c^{0}_{l}+\rho^{0})}{B(\rho^{1},\rho^{0})}italic_P ( caligraphic_Y | caligraphic_X , italic_T ) = ∏ start_POSTSUBSCRIPT italic_l ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_B ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_B ( italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_ARG (3)

Theorems 1 and 2 are proven in the appendices; we note they have been observed in different forms in prior work (Chipman, George, and McCulloch 1998).

For notational convenience, we define a leaf count likelihood function leaf(c1,c0)subscriptleafsuperscript𝑐1superscript𝑐0\ell_{\text{leaf}}(c^{1},c^{0})roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) for integers c1superscript𝑐1c^{1}italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and c0superscript𝑐0c^{0}italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT:

leaf(c1,c0)B(c1+ρ1,c0+ρ0)B(ρ1,ρ0)subscriptleafsuperscript𝑐1superscript𝑐0𝐵superscript𝑐1superscript𝜌1superscript𝑐0superscript𝜌0𝐵superscript𝜌1superscript𝜌0\displaystyle\ell_{\text{leaf}}(c^{1},c^{0})\coloneqq\frac{B(c^{1}+\rho^{1},c^% {0}+\rho^{0})}{B(\rho^{1},\rho^{0})}roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ≔ divide start_ARG italic_B ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT + italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_B ( italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_ARG (4)

and we can rewrite Equation 3 as

P(𝒴|𝒳,T)𝑃conditional𝒴𝒳𝑇\displaystyle P(\mathcal{Y}|\mathcal{X},T)italic_P ( caligraphic_Y | caligraphic_X , italic_T ) =lTleavesleaf(cl1,cl0)absentsubscriptproduct𝑙subscript𝑇leavessubscriptleafsubscriptsuperscript𝑐1𝑙subscriptsuperscript𝑐0𝑙\displaystyle=\prod_{l\in T_{\text{leaves}}}\ell_{\text{leaf}}(c^{1}_{l},c^{0}% _{l})= ∏ start_POSTSUBSCRIPT italic_l ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) (5)

In this work, we utilize the original prior over trees from (Chipman, George, and McCulloch 1998), given in Definition 3.

Definition 3.

The original BCART prior distribution over trees is

P(T|𝒳)𝑃conditional𝑇𝒳\displaystyle P(T|\mathcal{X})italic_P ( italic_T | caligraphic_X ) =(lTleavespleaf(d(l),(l)))×\displaystyle=\left(\prod_{l\in T_{\text{leaves}}}p_{\text{leaf}}(d(l),% \mathcal{I}(l))\right)\times= ( ∏ start_POSTSUBSCRIPT italic_l ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_d ( italic_l ) , caligraphic_I ( italic_l ) ) ) ×
(mTinternalpinner(d(m),(m)))subscriptproduct𝑚subscript𝑇internalsubscript𝑝inner𝑑𝑚𝑚\displaystyle\hskip 30.00005pt\left(\prod_{m\in T_{\text{internal}}}p_{\text{% inner}}(d(m),\mathcal{I}(m))\right)( ∏ start_POSTSUBSCRIPT italic_m ∈ italic_T start_POSTSUBSCRIPT internal end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT ( italic_d ( italic_m ) , caligraphic_I ( italic_m ) ) )

where

pleaf(d,)subscript𝑝leaf𝑑\displaystyle p_{\text{leaf}}(d,\mathcal{I})italic_p start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_d , caligraphic_I ) ={1,𝒱()=1psplit(d),𝒱()absentcases1𝒱1subscript𝑝split𝑑𝒱\displaystyle=\begin{cases}1,&\mathcal{V}(\mathcal{I})=\emptyset\\ 1-p_{\text{split}}(d),&\mathcal{V}(\mathcal{I})\neq\emptyset\end{cases}= { start_ROW start_CELL 1 , end_CELL start_CELL caligraphic_V ( caligraphic_I ) = ∅ end_CELL end_ROW start_ROW start_CELL 1 - italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) , end_CELL start_CELL caligraphic_V ( caligraphic_I ) ≠ ∅ end_CELL end_ROW (6)
pinner(d,)subscript𝑝inner𝑑\displaystyle p_{\text{inner}}(d,\mathcal{I})italic_p start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT ( italic_d , caligraphic_I ) ={0,𝒱()=psplit(d)/|𝒱()|,𝒱()absentcases0𝒱subscript𝑝split𝑑𝒱𝒱\displaystyle=\begin{cases}0,&\mathcal{V}(\mathcal{I})=\emptyset\\ p_{\text{split}}(d)/|\mathcal{V}(\mathcal{I})|,&\mathcal{V}(\mathcal{I})\neq% \emptyset\end{cases}= { start_ROW start_CELL 0 , end_CELL start_CELL caligraphic_V ( caligraphic_I ) = ∅ end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) / | caligraphic_V ( caligraphic_I ) | , end_CELL start_CELL caligraphic_V ( caligraphic_I ) ≠ ∅ end_CELL end_ROW (7)

and

psplit(d)subscript𝑝split𝑑\displaystyle p_{\text{split}}(d)italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) =α(1+d)βabsent𝛼superscript1𝑑𝛽\displaystyle=\alpha(1+d)^{-\beta}= italic_α ( 1 + italic_d ) start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT (8)

Intuitively, psplit(d)subscript𝑝split𝑑p_{\text{split}}(d)italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) is the prior probability of any node splitting and is allocating equally amongst valid splits. This choice of prior, P(T|𝒳)𝑃conditional𝑇𝒳P(T|\mathcal{X})italic_P ( italic_T | caligraphic_X ), combined with the likelihood function in Equation 5 induces the posterior distribution over trees P(T|𝒴,𝒳)𝑃conditional𝑇𝒴𝒳P(T|\mathcal{Y},\mathcal{X})italic_P ( italic_T | caligraphic_Y , caligraphic_X ):

P(T|𝒴,𝒳)P(𝒴|𝒳,T)P(T|𝒳)proportional-to𝑃conditional𝑇𝒴𝒳𝑃conditional𝒴𝒳𝑇𝑃conditional𝑇𝒳P(T|\mathcal{Y},\mathcal{X})\propto P(\mathcal{Y}|\mathcal{X},T)P(T|\mathcal{X})italic_P ( italic_T | caligraphic_Y , caligraphic_X ) ∝ italic_P ( caligraphic_Y | caligraphic_X , italic_T ) italic_P ( italic_T | caligraphic_X ) (9)

Throughout our analysis, we treat the dataset (𝒳,𝒴)𝒳𝒴(\mathcal{X},\mathcal{Y})( caligraphic_X , caligraphic_Y ) as fixed.

Refer to caption
Figure 2: Example of the defined BCART AND/OR graph 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT. OR nodes are represented as circles with solid borders, terminal nodes as circles with dashed borders, and AND nodes as squares. In this dataset, two feature splits are possible at the root node (f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and no further splits are possible at deeper nodes. The best solution on this AND/OR graph is highlighted in red and corresponds with a stump which splits the root node, corresponding to the entire dataset, on feature f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

4 Connecting BCART with AND/OR Graphs

Given a dataset (𝒳,𝒴)𝒳𝒴(\mathcal{X},\mathcal{Y})( caligraphic_X , caligraphic_Y ), we will now construct a special AND/OR graph 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT. We will then show that a minimal cost solution graph on 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT corresponds directly with the maximum a posteriori tree given our choice of prior distributions P(T|𝒳)𝑃conditional𝑇𝒳P(T|\mathcal{X})italic_P ( italic_T | caligraphic_X ) and P(Θ)𝑃ΘP(\Theta)italic_P ( roman_Θ ). Using this construction, the problem of finding the maximum a posteriori tree of our posterior is reduced to that of finding the minimum cost solution graph on 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT.

Definition 4 (BCART AND/OR graph 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT).

Given a dataset (𝒳,𝒴)𝒳𝒴(\mathcal{X},\mathcal{Y})( caligraphic_X , caligraphic_Y ), construct the AND/OR graph 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT as follows:

  1. 1.

    For every possible subset [N]delimited-[]𝑁\mathcal{I}\subset[N]caligraphic_I ⊂ [ italic_N ] and depth d{0,,F}𝑑0𝐹d\in\{0,\dots,F\}italic_d ∈ { 0 , … , italic_F }, create an OR node o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT.

  2. 2.

    For every OR node o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT created in Step 1, create a terminal node t,dsubscript𝑡𝑑t_{\mathcal{I},d}italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT and draw an edge from o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT to t,dsubscript𝑡𝑑t_{\mathcal{I},d}italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT with cost 𝚌𝚘𝚜𝚝(o,d,t,d)=logpleaf(d,)logleaf(c1(),c0())𝚌𝚘𝚜𝚝subscript𝑜𝑑subscript𝑡𝑑subscript𝑝leaf𝑑subscriptleafsuperscript𝑐1superscript𝑐0\texttt{cost}(o_{\mathcal{I},d},t_{\mathcal{I},d})=-\log p_{\text{leaf}}(d,% \mathcal{I})-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),c^{0}(\mathcal{I}))cost ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) = - roman_log italic_p start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_d , caligraphic_I ) - roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ).

  3. 3.

    For every OR node o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT created in Step 1, create F𝐹Fitalic_F AND nodes a,d,1,,a,d,Fsubscript𝑎𝑑1subscript𝑎𝑑𝐹a_{\mathcal{I},d,1},\ldots,a_{\mathcal{I},d,F}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_F end_POSTSUBSCRIPT and drawn an edge from o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT to each a,d,fsubscript𝑎𝑑𝑓a_{\mathcal{I},d,f}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT with cost 𝚌𝚘𝚜𝚝(o,d,a,d,f)=logpinner(d)𝚌𝚘𝚜𝚝subscript𝑜𝑑subscript𝑎𝑑𝑓subscript𝑝inner𝑑\texttt{cost}(o_{\mathcal{I},d},a_{\mathcal{I},d,f})=-\log p_{\text{inner}}(d)cost ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT ) = - roman_log italic_p start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT ( italic_d ).

  4. 4.

    For every pair a,d,fsubscript𝑎𝑑𝑓a_{\mathcal{I},d,f}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT and o,d+1subscript𝑜superscript𝑑1o_{\mathcal{I}^{\prime},d+1}italic_o start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_d + 1 end_POSTSUBSCRIPT where |f=k=evaluated-at𝑓𝑘superscript\mathcal{I}|_{f=k}=\mathcal{I}^{\prime}caligraphic_I | start_POSTSUBSCRIPT italic_f = italic_k end_POSTSUBSCRIPT = caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for some f[F]𝑓delimited-[]𝐹f\in[F]italic_f ∈ [ italic_F ] and k{0,1}𝑘01k\in\{0,1\}italic_k ∈ { 0 , 1 }, draw an edge from a,d,fsubscript𝑎𝑑𝑓a_{\mathcal{I},d,f}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT to o,d+1subscript𝑜superscript𝑑1o_{\mathcal{I^{\prime}},d+1}italic_o start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_d + 1 end_POSTSUBSCRIPT with cost 𝚌𝚘𝚜𝚝(a,d,f,o,d+1)=0𝚌𝚘𝚜𝚝subscript𝑎𝑑𝑓subscript𝑜superscript𝑑10\texttt{cost}(a_{\mathcal{I},d,f},o_{\mathcal{I^{\prime}},d+1})=0cost ( italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_d + 1 end_POSTSUBSCRIPT ) = 0.

  5. 5.

    Let o[n],0subscript𝑜delimited-[]𝑛0o_{[n],0}italic_o start_POSTSUBSCRIPT [ italic_n ] , 0 end_POSTSUBSCRIPT, the OR node representing all sample indices, be the unique root node r𝑟ritalic_r of 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT.

  6. 6.

    Remove all OR nodes representing empty subsets and their neighbors.

  7. 7.

    Remove all nodes not connected to the root node r𝑟ritalic_r.

We note that 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT contains F×2n𝐹superscript2𝑛F\times 2^{n}italic_F × 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT OR Nodes, F×2n𝐹superscript2𝑛F\times 2^{n}italic_F × 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT terminal nodes (one for each OR Node), and F2×2nsuperscript𝐹2superscript2𝑛F^{2}\times 2^{n}italic_F start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT AND nodes (F𝐹Fitalic_F for each OR Node) and so is finite.

Intuitively, each OR node o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT in 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT corresponds with the subproblem of discovering a maximum a posteriori subtree starting from depth d𝑑ditalic_d and over the subset of samples \mathcal{I}caligraphic_I from dataset 𝒳,𝒴𝒳𝒴\mathcal{X},\mathcal{Y}caligraphic_X , caligraphic_Y. Each AND node a,d,fsubscript𝑎𝑑𝑓a_{\mathcal{I},d,f}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT then represents the same subproblem but given that a decision was already made to split on feature f𝑓fitalic_f at the root node of this subtree. A valid solution graph on 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT corresponds with a binary classification tree T𝑇Titalic_T on the dataset (𝒳,𝒴)𝒳𝒴(\mathcal{X},\mathcal{Y})( caligraphic_X , caligraphic_Y ) and the value of a solution is related to the posterior probability of T𝑇Titalic_T given by P(T|𝒴,𝒳)𝑃conditional𝑇𝒴𝒳P(T|\mathcal{Y},\mathcal{X})italic_P ( italic_T | caligraphic_Y , caligraphic_X ). We formalize these properties in Theorems 5 and 6.

Theorem 5.

Every solution graph on AND/OR graphs induces a unique binary decision tree. Furthermore, every decision tree can be represented as a unique solution graph under this correspondence. Thus, there is natural bijection between solution graphs on 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT and binary decision trees.

Theorem 6.

Under the natural bijection described in Theorem 5, given a solution graph 𝒮𝒮\mathcal{S}caligraphic_S and its corresponding tree T𝑇Titalic_T, we have that 𝚌𝚘𝚜𝚝(𝒮)=logP(T,𝒴|𝒳)𝚌𝚘𝚜𝚝𝒮𝑃𝑇conditional𝒴𝒳\texttt{cost}(\mathcal{S})=-\log P(T,\mathcal{Y}|\mathcal{X})cost ( caligraphic_S ) = - roman_log italic_P ( italic_T , caligraphic_Y | caligraphic_X ). Therefore the minimal cost solution over 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT corresponds with a maximum a posteriori tree.

The bijection constructed in Theorems 5 and 6 is depicted in Figure 3. Due to space constraints, we defer a formal description of this bijection to Appendix A.

Refer to caption
Figure 3: Example map between an example solution of the AND/OR graph 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT depicted in Figure 2 and its corresponding binary classification tree. We see that the resulting tree is a stump which splits on feature f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT at the root.

5 MAPTree

Algorithm 1 MAPTree

Input: Root OR Node r𝑟ritalic_r, cost function cost, and heuristic function hhitalic_h for AND/OR graph 𝒢𝒢\mathcal{G}caligraphic_G
Output: Solution graph 𝒮𝒮\mathcal{S}caligraphic_S

1:  𝒢:={r}assignsuperscript𝒢𝑟\mathcal{G}^{\prime}:=\{r\}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := { italic_r }
2:  :=assign\mathcal{E}:=\emptysetcaligraphic_E := ∅
3:  LB[r]:=h(r)assign𝐿𝐵delimited-[]𝑟𝑟LB[r]:=h(r)italic_L italic_B [ italic_r ] := italic_h ( italic_r )
4:  UB[r]:=assign𝑈𝐵delimited-[]𝑟UB[r]:=\inftyitalic_U italic_B [ italic_r ] := ∞
5:  while LB[r]<UB[r]𝐿𝐵delimited-[]𝑟𝑈𝐵delimited-[]𝑟LB[r]<UB[r]italic_L italic_B [ italic_r ] < italic_U italic_B [ italic_r ] and time remaining do
6:     o:=assign𝑜absento:=italic_o := findNodeToExpand(r𝑟ritalic_r, cost, \mathcal{E}caligraphic_E, LB𝐿𝐵LBitalic_L italic_B, UB𝑈𝐵UBitalic_U italic_B)
7:     Let t𝑡titalic_t be the terminal node child of o𝑜oitalic_o
8:     :={o}assign𝑜\mathcal{E}:=\mathcal{E}\cup\{o\}caligraphic_E := caligraphic_E ∪ { italic_o }
9:     𝒢:=𝒢{t}assignsuperscript𝒢superscript𝒢𝑡\mathcal{G}^{\prime}:=\mathcal{G}^{\prime}\cup\{t\}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ { italic_t }
10:     Let {a1,,aF}subscript𝑎1subscript𝑎𝐹\{a_{1},\dots,a_{F}\}{ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } be the AND node children of o𝑜oitalic_o
11:     for all af{a1,,aF}subscript𝑎𝑓subscript𝑎1subscript𝑎𝐹a_{f}\in\{a_{1},\dots,a_{F}\}italic_a start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∈ { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } do
12:        Let {of=0,of=1}subscript𝑜𝑓0subscript𝑜𝑓1\{o_{f=0},o_{f=1}\}{ italic_o start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT } be the OR node children of afsubscript𝑎𝑓a_{f}italic_a start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT
13:        LB[of=0]:=h(of=0)assign𝐿𝐵delimited-[]subscript𝑜𝑓0subscript𝑜𝑓0LB[o_{f=0}]:=h(o_{f=0})italic_L italic_B [ italic_o start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT ] := italic_h ( italic_o start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT )
14:        LB[of=1]:=h(of=1)assign𝐿𝐵delimited-[]subscript𝑜𝑓1subscript𝑜𝑓1LB[o_{f=1}]:=h(o_{f=1})italic_L italic_B [ italic_o start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT ] := italic_h ( italic_o start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT )
15:        vf=0(lb)=𝚌𝚘𝚜𝚝(af,of=0)+h(of=0)subscriptsuperscript𝑣𝑙𝑏𝑓0𝚌𝚘𝚜𝚝subscript𝑎𝑓subscript𝑜𝑓0subscript𝑜𝑓0v^{(lb)}_{f=0}=\texttt{cost}(a_{f},o_{f=0})+h(o_{f=0})italic_v start_POSTSUPERSCRIPT ( italic_l italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT = cost ( italic_a start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT ) + italic_h ( italic_o start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT )
16:        vf=1(lb)=𝚌𝚘𝚜𝚝(af,of=1)+h(of=1)subscriptsuperscript𝑣𝑙𝑏𝑓1𝚌𝚘𝚜𝚝subscript𝑎𝑓subscript𝑜𝑓1subscript𝑜𝑓1v^{(lb)}_{f=1}=\texttt{cost}(a_{f},o_{f=1})+h(o_{f=1})italic_v start_POSTSUPERSCRIPT ( italic_l italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT = cost ( italic_a start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT ) + italic_h ( italic_o start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT )
17:        LB[af]:=vf=0+vf=1assign𝐿𝐵delimited-[]subscript𝑎𝑓subscript𝑣𝑓0subscript𝑣𝑓1LB[a_{f}]:=v_{f=0}+v_{f=1}italic_L italic_B [ italic_a start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ] := italic_v start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT + italic_v start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT
18:        𝒢:=𝒢{a,of=0,of=1}assignsuperscript𝒢superscript𝒢𝑎subscript𝑜𝑓0subscript𝑜𝑓1\mathcal{G}^{\prime}:=\mathcal{G}^{\prime}\cup\{a,o_{f=0},o_{f=1}\}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ { italic_a , italic_o start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT }
19:     end for
20:     updateLowerBounds(o𝑜oitalic_o, cost, LB𝐿𝐵LBitalic_L italic_B)
21:     updateUpperBounds(o𝑜oitalic_o, cost, UB𝑈𝐵UBitalic_U italic_B)
22:  end while
23:  return  getSolution(r𝑟ritalic_r, cost, UB𝑈𝐵UBitalic_U italic_B)
Algorithm 2 getSolution

Input: ORNode o𝑜oitalic_o, cost function cost, and upper bounds UB𝑈𝐵UBitalic_U italic_B
Output: Solution graph 𝒮𝒮\mathcal{S}caligraphic_S

1:  Let {a1,,aF}subscript𝑎1subscript𝑎𝐹\{a_{1},\dots,a_{F}\}{ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } be the AND node children of o𝑜oitalic_o
2:  Let t𝑡titalic_t be the terminal node child of o𝑜oitalic_o
3:  af*:=argminc{a1,,aF}(𝚌𝚘𝚜𝚝(o,c)+UB[c])assignsubscript𝑎superscript𝑓subscript𝑐subscript𝑎1subscript𝑎𝐹𝚌𝚘𝚜𝚝𝑜𝑐𝑈𝐵delimited-[]𝑐a_{f^{*}}:=\arg\min_{c\in\{a_{1},\dots,a_{F}\}}\left(\texttt{cost}(o,c)+UB[c]\right)italic_a start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT := roman_arg roman_min start_POSTSUBSCRIPT italic_c ∈ { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ( cost ( italic_o , italic_c ) + italic_U italic_B [ italic_c ] )
4:  𝒮:={o}assign𝒮𝑜\mathcal{S}:=\{o\}caligraphic_S := { italic_o }
5:  if 𝚌𝚘𝚜𝚝(o,t)+UB[t]𝚌𝚘𝚜𝚝(o,af*)+UB[af*]𝚌𝚘𝚜𝚝𝑜𝑡𝑈𝐵delimited-[]𝑡𝚌𝚘𝚜𝚝𝑜subscript𝑎superscript𝑓𝑈𝐵delimited-[]subscript𝑎superscript𝑓\texttt{cost}(o,t)+UB[t]\leq\texttt{cost}(o,a_{f^{*}})+UB[a_{f^{*}}]cost ( italic_o , italic_t ) + italic_U italic_B [ italic_t ] ≤ cost ( italic_o , italic_a start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) + italic_U italic_B [ italic_a start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] then
6:     𝒮:=𝒮{t}assign𝒮𝒮𝑡\mathcal{S}:=\mathcal{S}\cup\{t\}caligraphic_S := caligraphic_S ∪ { italic_t }
7:  else
8:     Let of*=0,of*=1subscript𝑜superscript𝑓0subscript𝑜superscript𝑓1o_{f^{*}=0},o_{f^{*}=1}italic_o start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = 0 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT be the children of af*subscript𝑎superscript𝑓a_{f^{*}}italic_a start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
9:     𝒮:=𝒮{af*}assign𝒮𝒮subscript𝑎superscript𝑓\mathcal{S}:=\mathcal{S}\cup\{a_{f^{*}}\}caligraphic_S := caligraphic_S ∪ { italic_a start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT }
10:     𝒮0:=assignsubscript𝒮0absent\mathcal{S}_{0}:=caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := getSolution(of*=0subscript𝑜superscript𝑓0o_{f^{*}=0}italic_o start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = 0 end_POSTSUBSCRIPT)
11:     𝒮1:=assignsubscript𝒮1absent\mathcal{S}_{1}:=caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := getSolution(of*=1subscript𝑜superscript𝑓1o_{f^{*}=1}italic_o start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT)
12:     𝒮:=𝒮𝒮0𝒮1assign𝒮𝒮subscript𝒮0subscript𝒮1\mathcal{S}:=\mathcal{S}\cup\mathcal{S}_{0}\cup\mathcal{S}_{1}caligraphic_S := caligraphic_S ∪ caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∪ caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
13:  end if
14:  return  𝒮𝒮\mathcal{S}caligraphic_S
Algorithm 3 findNodeToExpand

Input: Root node r𝑟ritalic_r, cost function cost, set of expanded nodes \mathcal{E}caligraphic_E, lower bounds LB𝐿𝐵LBitalic_L italic_B and upper bounds UB𝑈𝐵UBitalic_U italic_B
Output: ORNode o𝑜oitalic_o

1:  o:=rassign𝑜𝑟o:=ritalic_o := italic_r
2:  while o𝑜o\in\mathcal{E}italic_o ∈ caligraphic_E do
3:     Let {a1,,aF}subscript𝑎1subscript𝑎𝐹\{a_{1},\dots,a_{F}\}{ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } be the children of o𝑜oitalic_o
4:     a*:=argminc{a1,,aF}(𝚌𝚘𝚜𝚝(o,c)+LB[c])assignsuperscript𝑎subscript𝑐subscript𝑎1subscript𝑎𝐹𝚌𝚘𝚜𝚝𝑜𝑐𝐿𝐵delimited-[]𝑐a^{*}:=\arg\min_{c\in\{a_{1},\dots,a_{F}\}}\left(\texttt{cost}(o,c)+LB[c]\right)italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT := roman_arg roman_min start_POSTSUBSCRIPT italic_c ∈ { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ( cost ( italic_o , italic_c ) + italic_L italic_B [ italic_c ] )
5:     Let o0*,o1*subscriptsuperscript𝑜0subscriptsuperscript𝑜1o^{*}_{0},o^{*}_{1}italic_o start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_o start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT be the children of a*superscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT
6:     if UB[o0*]𝑈𝐵delimited-[]subscriptsuperscript𝑜0UB[o^{*}_{0}]italic_U italic_B [ italic_o start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] - LB[o0*]>UB[o1*]𝐿𝐵delimited-[]subscriptsuperscript𝑜0𝑈𝐵delimited-[]subscriptsuperscript𝑜1LB[o^{*}_{0}]>UB[o^{*}_{1}]italic_L italic_B [ italic_o start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] > italic_U italic_B [ italic_o start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] - LB[o1*]𝐿𝐵delimited-[]subscriptsuperscript𝑜1LB[o^{*}_{1}]italic_L italic_B [ italic_o start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] then
7:        o:=o0*assign𝑜subscriptsuperscript𝑜0o:=o^{*}_{0}italic_o := italic_o start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
8:     else
9:        o:=o1*assign𝑜subscriptsuperscript𝑜1o:=o^{*}_{1}italic_o := italic_o start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
10:     end if
11:  end while
12:  return  o𝑜oitalic_o
Algorithm 4 updateLowerBounds

Input: ORNode l𝑙litalic_l, cost function cost, lower bounds LB𝐿𝐵LBitalic_L italic_B

1:  𝒱={l}𝒱𝑙\mathcal{V}=\{l\}caligraphic_V = { italic_l }
2:  while |V|>0𝑉0|V|>0| italic_V | > 0 do
3:     Remove a node o𝑜oitalic_o from 𝒱𝒱\mathcal{V}caligraphic_V with maximal depth
4:     Let {a1,,aF}subscript𝑎1subscript𝑎𝐹\{a_{1},\dots,a_{F}\}{ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } be the AND node children of o𝑜oitalic_o
5:     Let t𝑡titalic_t be the terminal node child of o𝑜oitalic_o
6:     vsplit(lb)=minc{a1,,aF}(𝚌𝚘𝚜𝚝(o,c)+LB[c])subscriptsuperscript𝑣𝑙𝑏splitsubscript𝑐subscript𝑎1subscript𝑎𝐹𝚌𝚘𝚜𝚝𝑜𝑐𝐿𝐵delimited-[]𝑐v^{(lb)}_{\text{split}}=\min_{c\in\{a_{1},\dots,a_{F}\}}\left(\texttt{cost}(o,% c)+LB[c]\right)italic_v start_POSTSUPERSCRIPT ( italic_l italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT split end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_c ∈ { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ( cost ( italic_o , italic_c ) + italic_L italic_B [ italic_c ] )
7:     v(lb)=min{vsplit(lb),𝚌𝚘𝚜𝚝(o,t)}superscript𝑣𝑙𝑏subscriptsuperscript𝑣𝑙𝑏split𝚌𝚘𝚜𝚝𝑜𝑡v^{(lb)}=\min\{v^{(lb)}_{\text{split}},\texttt{cost}(o,t)\}italic_v start_POSTSUPERSCRIPT ( italic_l italic_b ) end_POSTSUPERSCRIPT = roman_min { italic_v start_POSTSUPERSCRIPT ( italic_l italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT split end_POSTSUBSCRIPT , cost ( italic_o , italic_t ) }
8:     if v(lb)>LB[o]superscript𝑣𝑙𝑏𝐿𝐵delimited-[]𝑜v^{(lb)}>LB[o]italic_v start_POSTSUPERSCRIPT ( italic_l italic_b ) end_POSTSUPERSCRIPT > italic_L italic_B [ italic_o ] then
9:        LB[o]:=v(lb)assign𝐿𝐵delimited-[]𝑜superscript𝑣𝑙𝑏LB[o]:=v^{(lb)}italic_L italic_B [ italic_o ] := italic_v start_POSTSUPERSCRIPT ( italic_l italic_b ) end_POSTSUPERSCRIPT
10:        Add all parents of o𝑜oitalic_o to 𝒱𝒱\mathcal{V}caligraphic_V
11:     end if
12:  end while
Algorithm 5 updateUpperBounds

Input: ORNode l𝑙litalic_l, cost function cost, upper bounds UB𝑈𝐵UBitalic_U italic_B

1:  𝒱={l}𝒱𝑙\mathcal{V}=\{l\}caligraphic_V = { italic_l }
2:  while |V|>0𝑉0|V|>0| italic_V | > 0 do
3:     Remove a node o𝑜oitalic_o from 𝒱𝒱\mathcal{V}caligraphic_V with maximal depth
4:     Let {a1,,aF}subscript𝑎1subscript𝑎𝐹\{a_{1},\dots,a_{F}\}{ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } be the AND node children of o𝑜oitalic_o
5:     Let t𝑡titalic_t be the terminal node child of o𝑜oitalic_o
6:     vsplit(ub)=minc{a1,,aF}(𝚌𝚘𝚜𝚝(o,c)+UB[c])subscriptsuperscript𝑣𝑢𝑏splitsubscript𝑐subscript𝑎1subscript𝑎𝐹𝚌𝚘𝚜𝚝𝑜𝑐𝑈𝐵delimited-[]𝑐v^{(ub)}_{\text{split}}=\min_{c\in\{a_{1},\dots,a_{F}\}}\left(\texttt{cost}(o,% c)+UB[c]\right)italic_v start_POSTSUPERSCRIPT ( italic_u italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT split end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_c ∈ { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ( cost ( italic_o , italic_c ) + italic_U italic_B [ italic_c ] )
7:     v(ub)=min{vsplit(ub),𝚌𝚘𝚜𝚝(o,t)}superscript𝑣𝑢𝑏subscriptsuperscript𝑣𝑢𝑏split𝚌𝚘𝚜𝚝𝑜𝑡v^{(ub)}=\min\{v^{(ub)}_{\text{split}},\texttt{cost}(o,t)\}italic_v start_POSTSUPERSCRIPT ( italic_u italic_b ) end_POSTSUPERSCRIPT = roman_min { italic_v start_POSTSUPERSCRIPT ( italic_u italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT split end_POSTSUBSCRIPT , cost ( italic_o , italic_t ) }
8:     if v(ub)<UB[o]superscript𝑣𝑢𝑏𝑈𝐵delimited-[]𝑜v^{(ub)}<UB[o]italic_v start_POSTSUPERSCRIPT ( italic_u italic_b ) end_POSTSUPERSCRIPT < italic_U italic_B [ italic_o ] then
9:        UB[o]:=v(ub)assign𝑈𝐵delimited-[]𝑜superscript𝑣𝑢𝑏UB[o]:=v^{(ub)}italic_U italic_B [ italic_o ] := italic_v start_POSTSUPERSCRIPT ( italic_u italic_b ) end_POSTSUPERSCRIPT
10:        Add all parents of o𝑜oitalic_o to 𝒱𝒱\mathcal{V}caligraphic_V
11:     end if
12:  end while

Theorems 5 and 6 imply that it is sufficient to find the minimum cost solution graph on 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT to recover the MAP tree under the BCART posterior. In this section, we introduce MAPTree, an AND/OR search algorithm that finds a minimal cost solution on 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT. MAPTree is shown in Algorithm 1.

A key component of MAPTree is the Perfect Split Heuristic hhitalic_h that guides the search, presented in Definition 7.

Definition 7 (Perfect Split Heuristic).

For OR node o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT with terminal node child t,dsubscript𝑡𝑑t_{\mathcal{I},d}italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT, let

h(o,d)subscript𝑜𝑑\displaystyle h(o_{\mathcal{I},d})italic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) =max{\displaystyle=-\max\{= - roman_max { (10)
logleaf(c1(),c0()),subscriptleafsuperscript𝑐1superscript𝑐0\displaystyle\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),c^{0}(\mathcal{I})),roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ) , (11)
logpsplit(d,)subscript𝑝split𝑑\displaystyle\log p_{\text{split}}(d,\mathcal{I})roman_log italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d , caligraphic_I ) (12)
+logleaf(c1(),0)subscriptleafsuperscript𝑐10\displaystyle+\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)+ roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , 0 ) (13)
+logleaf(0,c0())}\displaystyle+\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}))\}+ roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ) } (14)

and for AND node a,d,fsubscript𝑎𝑑𝑓a_{\mathcal{I},d,f}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT with OR node children o|f=0,d+1subscript𝑜evaluated-at𝑓0𝑑1o_{\mathcal{I}|_{f=0},d+1}italic_o start_POSTSUBSCRIPT caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT , italic_d + 1 end_POSTSUBSCRIPT and o|f=1,d+1subscript𝑜evaluated-at𝑓1𝑑1o_{\mathcal{I}|_{f=1},d+1}italic_o start_POSTSUBSCRIPT caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT , italic_d + 1 end_POSTSUBSCRIPT, let

h(a,d,f)subscript𝑎𝑑𝑓\displaystyle h(a_{\mathcal{I},d,f})italic_h ( italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT ) =h(o|f=0,d+1)+h(o|f=1,d+1)absentsubscript𝑜evaluated-at𝑓0𝑑1subscript𝑜evaluated-at𝑓1𝑑1\displaystyle=h(o_{\mathcal{I}|_{f=0},d+1})+h(o_{\mathcal{I}|_{f=1},d+1})= italic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT , italic_d + 1 end_POSTSUBSCRIPT ) + italic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT , italic_d + 1 end_POSTSUBSCRIPT ) (15)

Intuitively, the Perfect Split Heuristic describes the negative log posterior probability of the best potential subtree rooted at the given OR node o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT: one that perfectly classifies the data in a single additional split. The heuristic guides the search away from subproblems that are too deep or for which the labels have already been poorly divided. We prove that this heuristic is a lower bound (admissible) and consistent in later sections.

5.1 Analysis of MAPTree

We now introduce several key properties of MAPTree. In particular, we show that (1) the Perfect Split Heuristic is consistent and therefore also admissible, (2) MAPTree finds the maximum a posteriori tree of the BCART posterior upon completion, and (3) upon early termination, MAPTree returns the minimum cost solution within the explored explicit graph 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Theorems 8 - 12 and Corollary 11 are proven in Appendix A.

Theorem 8 (Consistency of the Perfect Split Heuristic).

The Perfect Split Heuristic in Definition 7 is consistent, i.e., for any OR node o𝑜oitalic_o with children {t,a1,,aF}𝑡subscript𝑎1normal-…subscript𝑎𝐹\{t,a_{1},\dots,a_{F}\}{ italic_t , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT }:

h(o)𝑜\displaystyle h(o)italic_h ( italic_o ) minc{t,a1,,aF}𝚌𝚘𝚜𝚝(o,c)+h(c)absentsubscript𝑐𝑡subscript𝑎1subscript𝑎𝐹𝚌𝚘𝚜𝚝𝑜𝑐𝑐\displaystyle\leq\min_{c\in\{t,a_{1},\dots,a_{F}\}}\texttt{cost}(o,c)+h(c)≤ roman_min start_POSTSUBSCRIPT italic_c ∈ { italic_t , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } end_POSTSUBSCRIPT cost ( italic_o , italic_c ) + italic_h ( italic_c ) (16)

and for any AND node a𝑎aitalic_a with children {o0,o1}subscript𝑜0subscript𝑜1\{o_{0},o_{1}\}{ italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }:

h(a)𝑎\displaystyle h(a)italic_h ( italic_a ) c{o0,o1}𝚌𝚘𝚜𝚝(a,c)+h(c)absentsubscript𝑐subscript𝑜0subscript𝑜1𝚌𝚘𝚜𝚝𝑎𝑐𝑐\displaystyle\leq\sum_{c\in\{o_{0},o_{1}\}}\texttt{cost}(a,c)+h(c)≤ ∑ start_POSTSUBSCRIPT italic_c ∈ { italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } end_POSTSUBSCRIPT cost ( italic_a , italic_c ) + italic_h ( italic_c ) (17)
Theorem 9 (Finiteness of MAPTree).

Algorithm 1 always terminates.

Theorem 10 (Correctness of MAPTree).

When Algorithm 1 does not terminate early due to the time remaining condition, it always outputs a minimal cost solution on 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT upon completion.

Corollary 11.

Consider the tree induced by the output of Algorithm 1 under the natural bijection described in Section 4. By Theorems 5 and 6, this tree is the maximum a posteriori tree argmaxTP(T|𝒳,𝒴)subscript𝑇𝑃conditional𝑇𝒳𝒴\arg\max_{T}P(T|\mathcal{X},\mathcal{Y})roman_arg roman_max start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_P ( italic_T | caligraphic_X , caligraphic_Y ).

Theorem 12 (Anytime optimality of MAPTree).

Upon early termination, Algorithm 1 outputs the minimal cost solution across the explicit subgraph 𝒢superscript𝒢normal-′\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of already explored nodes.

6 Experiments

We evaluate the performance of MAPTree in multiple settings. In all experiments in this section, we set α=0.95𝛼0.95\alpha=0.95italic_α = 0.95 and β=0.5𝛽0.5\beta=0.5italic_β = 0.5. We find that our results are not highly dependent on the choices of α𝛼\alphaitalic_α and B𝐵Bitalic_B; see Appendix B.

In the first setting, we compare the efficiency of MAPTree to the Sequential Monte Carlo (SMC) and Markov-Chain Monte Carlo (MCMC) baselines from Lakshminarayanan, Roy, and Teh (2013) and Chipman, George, and McCulloch (1998), respectively. In the second setting, we create a synthetic dataset in which the true labels are generated by a randomly generated tree and measure generalization performance with respect to training dataset size. In the third setting, we measure the generalization accuracy, log likelihood, and tree size of models generated by MAPTree and baseline algorithms across all 16 datasets from the CP4IM dataset repository (Guns, Nijssen, and De Raedt 2011).

6.1 Speed Comparisons against MCMC and SMC

We first compare the performance of MAPTree with the SMC and MCMC baselines from Lakshminarayanan, Roy, and Teh (2013) and Chipman, George, and McCulloch (1998), respectively, on all 16 binary classification datasets from the CP4IM dataset repository (Guns, Nijssen, and De Raedt 2011). We note that all three methods, given infinite exploration time, should recover the maximum a posteriori tree from the BCART posterior. However, it has been observed that the mixing times for Markov-Chain-based methods, such as the MCMC and SMC baselines, is exponential in the depth of the data-generating tree (Kim and Rockova 2023). Furthermore, the SMC and MCMC methods are unable to determine when they have converged, nor can they provide a certificate of optimality upon convergence.

In our experiments, we modify the hyperparameters of each algorithm and measure the training time and log posterior of the data under the output tree (Figure 4). In 12 of the 16 datasets in Figure 6, MAPTree outperforms SMC and MCMC and is able to find trees with higher log posterior faster than the baseline algorithms. Furthermore, in 5 of the 16 datasets, MAPTree converges to the provably optimal tree, i.e., the maximum a posteriori tree of the BCART posterior.

Refer to caption
Figure 4: Comparison of MAPTree, SMC, and MCMC on 4 datasets (results for an additional 12 datasets are presented in the appendix). Curves are created by modifying the hyperparameters for each algorithm and measuring training time and log posterior of the data under the tree. Higher and further left is better, i.e., better log posteriors in less time. In 12 of the 16 datasets, MAPTree outperforms SMC and MCMC and is able to find trees with higher log posterior faster than the baseline algorithms. Furthermore, in 5 of the 16 datasets, MAPTree converges to the provably optimal tree, i.e., the MAP tree. 95% confidence intervals are derived by bootstrap** the results of 10 random seeds and time is averaged across the 10 seeds.
Refer to caption
Figure 5: Box-and-whisker plot of stratified 10-fold relative test accuracy, relative per-sample log likelihood, and size of the output tree for MAPTree and various baseline algorithms for each of the 16 CP4IM datasets. Test accuracy and log likelihood are relative to that of CART (max depth 4444). Higher is better for the left and center plots and lower is better for the right plot. Against all baseline algorithms, MAPTree either a) performs better in test accuracy or log likelihood, or b) performs comparably in test accuracy and log likelihood but produces smaller trees.

6.2 Fitting a Synthetic Dataset

We measure the generalization performance of MAPTree and various other baseline algorithms as a function of training dataset size on tree-generated data.

Synthetic Data: We construct a synthetic dataset where labels are generated by a randomly generated tree. We first construct a random binary tree structure as specified in Devroye and Kruszewski (1995) via recursive random divisions of the available internal nodes to the left or right subtree. Next, features are selected for each internal node uniformly at random such that no internal node splits on the same feature as its ancestors. Lastly, labels are assigned to the leaf nodes in alternating fashion so as to avoid compression of the underlying tree structure. Individual datapoints with 40 features are then sampled with each feature drawn i.i.d. from Ber(1/2)12(1/2)( 1 / 2 ), and their labels are determined by following the generated tree to a leaf node. We repeat this process 20 times, generating 20 datasets for 20 random trees. We also randomly flip ϵitalic-ϵ\epsilonitalic_ϵ of the training data labels, with ϵitalic-ϵ\epsilonitalic_ϵ ranging from 00 to 0.250.250.250.25 to simulate label noise.

In our experiments, MAPTree generates trees which outperform both the greedy, top-down approaches and ODT methods in test accuracy for various training dataset sizes and values of label corruption proportion ϵitalic-ϵ\epsilonitalic_ϵ; the results are presented in Figure 6 in Appendix B due to space constraints. We note that though some baseline algorithms demonstrate comparable performance at a single noise level, no baseline algorithm demonstrates test accuracy comparable to MAPTree across all noise levels. We also emphasize that MAPTree requires no hyperparameter tuning, whereas we experimented with various values of hyperparameters for the baseline algorithms in which performance was highly dependent on hyperparameter values (e.g., DL8.5 and GOSDT); see Appendix B.

6.3 Accuracy, Likelihood, and Size Comparisons on Real World Benchmarks

We also compare the accuracy, test log likelihood, and sizes of trees generated by MAPTree and baseline algorithms on 16 real world binary classification datasets from the CP4IM dataset repository (Guns, Nijssen, and De Raedt 2011) (Figure 5). Against all baseline algorithms, MAPTree either a) performs better in test accuracy or log likelihood, or b) performs comparably in test accuracy and log likelihood but produces smaller trees.

7 Discussion and Conclusions

We presented MAPTree, an algorithm which provably finds the maximum a posteriori tree of the BCART posterior for a given dataset. Our algorithm is inspired by best-first-search algorithms over AND/OR graphs and the observation that the search problem for trees can be framed as a search problem over an appropriately constructed AND/OR graph.

MAPTree outperforms thematically similar approaches such as SMC- and MCMC-based algorithms, finding higher log-posterior trees faster, and is able to determine when it has converged to the maximum a posteriori tree, unlike prior work. MAPTree also outperforms greedy, ODT, and ODST construction methods in test accuracy on the synthetic dataset constructed in Section 6. Furthermore, on many real world benchmark datasets, MAPTree either a) demonstrates better generalization performance, or b) demonstrates comparable generalization performance but with smaller trees.

A limitation of MAPTree is that it constructs a potentially large AND/OR graph, which consumes a significant amount of memory. We leave optimizations that may permit MAPTree to run on huge datasets to future work. Nonetheless, with the optimizations presented in Section 6, we find that MAPTree was performant enough to run on the CP4IM benchmark datasets used in evaluation of previous ODT benchmarks.

Acknowledgements

We would like to thank the anonymous reviewers and Area Chair for their reviews and helpful feedback.

M. T. was funded by a J.P. Morgan AI Fellowship, a Stanford Indisciplinary Graduate Fellowship, a Stanford Data Science Scholarship, and an Oak Ridge Institute for Science and Engineering Fellowship.

References

  • Aglin, Nijssen, and Schaus (2020) Aglin, G.; Nijssen, S.; and Schaus, P. 2020. Learning Optimal Decision Trees Using Caching Branch-and-Bound Search. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04): 3146–3153. Number: 04.
  • Bertsimas and Dunn (2017) Bertsimas, D.; and Dunn, J. 2017. Optimal classification trees. Machine Learning, 106.
  • Breiman (2001) Breiman, L. 2001. Random Forests. Machine Learning, 45(1): 5–32.
  • Breiman et al. (1984) Breiman, L.; Friedman, J. H.; Olshen, R. A.; and Stone, C. J. 1984. Classification and Regression Trees. 1 edition.
  • Chen and Guestrin (2016) Chen, T.; and Guestrin, C. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16, 785–794. New York, NY, USA: Association for Computing Machinery. ISBN 978-1-4503-4232-2.
  • Chipman, George, and McCulloch (1998) Chipman, H. A.; George, E. I.; and McCulloch, R. E. 1998. Bayesian CART Model Search. Journal of the American Statistical Association, 93(443): 935–948. Publisher: Taylor & Francis.
  • Demirović et al. (2022) Demirović, E.; Lukina, A.; Hebrard, E.; Chan, J.; Bailey, J.; Leckie, C.; Ramamohanarao, K.; and Stuckey, P. J. 2022. MurTree: optimal decision trees via Dynamic programming and search. The Journal of Machine Learning Research, 23(1): 26:1169–26:1215.
  • Denison, Mallick, and Smith (1998) Denison, D. G. T.; Mallick, B. K.; and Smith, A. F. M. 1998. A Bayesian CART Algorithm. Biometrika, 85(3): 363–377.
  • Devroye and Kruszewski (1995) Devroye, L.; and Kruszewski, P. 1995. The Botanical Beauty of Random Binary Trees. In International Symposium Graph Drawing and Network Visualization.
  • Geels, Pratola, and Herbei (2022) Geels, V.; Pratola, M. T.; and Herbei, R. 2022. The Taxicab Sampler: MCMC for Discrete Spaces with Application to Tree Models. Journal of Statistical Computation and Simulation, 1–22. Publisher: Taylor & Francis.
  • Grinsztajn, Oyallon, and Varoquaux (2022) Grinsztajn, L.; Oyallon, E.; and Varoquaux, G. 2022. Why do Tree-Based Models Still Outperform Deep Learning on Tabular Data?
  • Guns, Nijssen, and De Raedt (2011) Guns, T.; Nijssen, S.; and De Raedt, L. 2011. Itemset mining: A constraint programming perspective. Artificial Intelligence, 175(12): 1951–1983.
  • Hu, Rudin, and Seltzer (2019) Hu, X.; Rudin, C.; and Seltzer, M. 2019. Optimal Sparse Decision Trees. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  • Hyafil and Rivest (1976) Hyafil, L.; and Rivest, R. L. 1976. Constructing optimal binary decision trees is NP-complete. Information Processing Letters, 5(1): 15–17.
  • Kim and Rockova (2023) Kim, J.; and Rockova, V. 2023. On Mixing Rates for Bayesian CART. ArXiv:2306.00126 [math, stat].
  • Kiossou et al. (2023) Kiossou, H.; Schaus, P.; Nijssen, S.; and Houndji, V. R. 2023. Time Constrained DL8.5 Using Limited Discrepancy Search. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part V, 443–459. Berlin, Heidelberg: Springer-Verlag. ISBN 978-3-031-26418-4.
  • Lakshminarayanan, Roy, and Teh (2013) Lakshminarayanan, B.; Roy, D. M.; and Teh, Y. W. 2013. Top-down particle filtering for bayesian decision trees. In Proceedings of the 30th international conference on international conference on machine learning - volume 28, ICML’13, III–280–III–288. JMLR.org. Place: Atlanta, GA, USA.
  • Lin et al. (2020) Lin, J.; Zhong, C.; Hu, D.; Rudin, C.; and Seltzer, M. 2020. Generalized and scalable optimal sparse decision trees. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of ICML’20, 6150–6160. JMLR.org.
  • Mahanti and Bagchi (1983) Mahanti, A.; and Bagchi, A. 1983. Admissible heuristic search in and/or graphs. Theoretical Computer Science, 24(2): 207–219. Publisher: Elsevier.
  • Mahanti and Bagchi (1985) Mahanti, A.; and Bagchi, A. 1985. AND/OR graph heuristic search methods. Journal of the ACM, 32(1): 28–51.
  • Nijssen (2008) Nijssen, S. 2008. Bayes optimal classification for decision trees. In Proceedings of the 25th international conference on Machine learning, ICML ’08, 696–703. New York, NY, USA: Association for Computing Machinery. ISBN 978-1-60558-205-4.
  • Nijssen and Fromont (2007) Nijssen, S.; and Fromont, E. 2007. Mining optimal decision trees from itemset lattices. In Knowledge discovery and data mining.
  • Pratola (2016) Pratola, M. T. 2016. Efficient Metropolis–Hastings Proposal Mechanisms for Bayesian Regression Tree Models. Bayesian Analysis, 11(3): 885–911. Publisher: International Society for Bayesian Analysis.
  • Quinlan (1986) Quinlan, J. R. 1986. Induction of Decision Trees. Machine Learning, 1(1): 81–106.
  • van der Linden, de Weerdt, and Demirović (2022) van der Linden, J.; de Weerdt, M.; and Demirović, E. 2022. Fair and optimal decision trees: A dynamic programming approach. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in neural information processing systems, volume 35, 38899–38911. Curran Associates, Inc.
  • Verhaeghe, Lecoutre, and Schaus (2018) Verhaeghe, H.; Lecoutre, C.; and Schaus, P. 2018. Compact-MDD: Efficiently Filtering (s)MDD Constraints with Reversible Sparse Bit-sets. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, 1383–1389. International Joint Conferences on Artificial Intelligence Organization.
  • Verhaeghe et al. (2020) Verhaeghe, H.; Nijssen, S.; Pesant, G.; Quimper, C.-G.; and Schaus, P. 2020. Learning optimal decision trees using constraint programming. Constraints, 25(3): 226–250.
  • Verwer and Zhang (2019) Verwer, S.; and Zhang, Y. 2019. Learning optimal classification trees using a binary linear program formulation. In Proceedings of the thirty-third AAAI conference on artificial intelligence and thirty-first innovative applications of artificial intelligence conference and ninth AAAI symposium on educational advances in artificial intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press. ISBN 978-1-57735-809-1. Place: Honolulu, Hawaii, USA Number of pages: 8 tex.articleno: 200.

Appendix A Proofs of Theorems

In this section, we present the proofs of the theorems.

Proof of Theorem 1.

We note that this result was first presented in the original BCART paper for the more general classification setting (Chipman, George, and McCulloch 1998). We reproduce the proof below in the binary classification setting.

By the definition of a BDT (T,Θ)𝑇Θ(T,\Theta)( italic_T , roman_Θ ), the tree T𝑇Titalic_T partitions the data such that the sample subsets I(l1)𝐼subscript𝑙1I(l_{1})italic_I ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), I(l2),I(lL)𝐼subscript𝑙2𝐼subscript𝑙𝐿I(l_{2}),I(l_{L})italic_I ( italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_I ( italic_l start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) fall into leaves l1,l2,,lLsubscript𝑙1subscript𝑙2subscript𝑙𝐿l_{1},l_{2},\dots,l_{L}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and each leaf contains an independent probability distribution Ber(θl)subscript𝜃𝑙(\theta_{l})( italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) that governs the probability of a given label occurring in each respective leaf. Note that the probability of each label yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT occurring is conditionally independent of the other elements of 𝒴𝒴\mathcal{Y}caligraphic_Y and the dataset 𝒳𝒳\mathcal{X}caligraphic_X given its leaf l𝑙litalic_l (which is determined only from the tree structure T𝑇Titalic_T and xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and the corresponding parameter θlsubscript𝜃𝑙\theta_{l}italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (which is determined from the global parametrization ΘΘ\Thetaroman_Θ and l𝑙litalic_l). Therefore

P(𝒴|𝒳,T,Θ)𝑃conditional𝒴𝒳𝑇Θ\displaystyle P(\mathcal{Y}|\mathcal{X},T,\Theta)italic_P ( caligraphic_Y | caligraphic_X , italic_T , roman_Θ ) =j[N]P(yj|xj,T,Θ)absentsubscriptproduct𝑗delimited-[]𝑁𝑃conditionalsubscript𝑦𝑗subscript𝑥𝑗𝑇Θ\displaystyle=\prod_{j\in[N]}P(y_{j}|x_{j},T,\Theta)= ∏ start_POSTSUBSCRIPT italic_j ∈ [ italic_N ] end_POSTSUBSCRIPT italic_P ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_T , roman_Θ ) (18)
=lTleavesi(l)P(yi|θl)absentsubscriptproduct𝑙subscript𝑇leavessubscriptproduct𝑖𝑙𝑃conditionalsubscript𝑦𝑖subscript𝜃𝑙\displaystyle=\prod_{l\in T_{\text{leaves}}}\prod_{i\in\mathcal{I}(l)}P(y_{i}|% \theta_{l})= ∏ start_POSTSUBSCRIPT italic_l ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I ( italic_l ) end_POSTSUBSCRIPT italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) (19)
=lTleavesi(l)θlyi(1θl)1yiabsentsubscriptproduct𝑙subscript𝑇leavessubscriptproduct𝑖𝑙superscriptsubscript𝜃𝑙subscript𝑦𝑖superscript1subscript𝜃𝑙1subscript𝑦𝑖\displaystyle=\prod_{l\in T_{\text{leaves}}}\prod_{i\in\mathcal{I}(l)}\theta_{% l}^{y_{i}}(1-\theta_{l})^{1-y_{i}}= ∏ start_POSTSUBSCRIPT italic_l ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I ( italic_l ) end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (20)
=lTleavesθlcl1(1θl)cl0absentsubscriptproduct𝑙subscript𝑇leavessuperscriptsubscript𝜃𝑙superscriptsubscript𝑐𝑙1superscript1subscript𝜃𝑙superscriptsubscript𝑐𝑙0\displaystyle=\prod_{l\in T_{\text{leaves}}}\theta_{l}^{c_{l}^{1}}(1-\theta_{l% })^{c_{l}^{0}}= ∏ start_POSTSUBSCRIPT italic_l ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT (21)

Proof of Theorem 2.

The likelihood of a tree T𝑇Titalic_T generating labels 𝒴𝒴\mathcal{Y}caligraphic_Y given features 𝒳𝒳\mathcal{X}caligraphic_X can be obtained by marginalizing over ΘΘ\Thetaroman_Θ using the prior P(Θ)𝑃ΘP(\Theta)italic_P ( roman_Θ ) given in Section 3:

P(𝒴|𝒳,T)𝑃conditional𝒴𝒳𝑇\displaystyle P(\mathcal{Y}|\mathcal{X},T)italic_P ( caligraphic_Y | caligraphic_X , italic_T ) =ΘP(𝒴|𝒳,T,Θ)P(Θ)𝑑ΘabsentsubscriptΘ𝑃conditional𝒴𝒳𝑇Θ𝑃Θdifferential-dΘ\displaystyle=\int_{\Theta}P(\mathcal{Y}|\mathcal{X},T,\Theta)P(\Theta)d\Theta= ∫ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT italic_P ( caligraphic_Y | caligraphic_X , italic_T , roman_Θ ) italic_P ( roman_Θ ) italic_d roman_Θ (22)
=ΘlTleaves(θlcl1(1θl)cl0)×\displaystyle=\int_{\Theta}\prod_{l\in T_{\text{leaves}}}\left(\theta_{l}^{c^{% 1}_{l}}\left(1-\theta_{l}\right)^{c^{0}_{l}}\right)\times= ∫ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_l ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) × (23)
(θlρ1(1θl)ρ0B(ρ1,ρ0))dΘsuperscriptsubscript𝜃𝑙superscript𝜌1superscript1subscript𝜃𝑙superscript𝜌0𝐵superscript𝜌1superscript𝜌0𝑑Θ\displaystyle\left(\frac{\theta_{l}^{\rho^{1}}(1-\theta_{l})^{\rho^{0}}}{B(% \rho^{1},\rho^{0})}\right)d\Theta( divide start_ARG italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_ARG ) italic_d roman_Θ (24)
=ΘlTleaves1B(ρ1,ρ0)×\displaystyle=\int_{\Theta}\prod_{l\in T_{\text{leaves}}}\frac{1}{B(\rho^{1},% \rho^{0})}\times= ∫ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_l ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_B ( italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_ARG × (25)
(θlcl1+ρ1(1θs)cl0+ρ0)dΘsuperscriptsubscript𝜃𝑙subscriptsuperscript𝑐1𝑙superscript𝜌1superscript1subscript𝜃𝑠subscriptsuperscript𝑐0𝑙superscript𝜌0𝑑Θ\displaystyle\left(\theta_{l}^{c^{1}_{l}+\rho^{1}}\left(1-\theta_{s}\right)^{c% ^{0}_{l}+\rho^{0}}\right)d\Theta( italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) italic_d roman_Θ (26)
=lTleaves1B(ρ1,ρ0)×\displaystyle=\prod_{l\in T_{\text{leaves}}}\frac{1}{B(\rho^{1},\rho^{0})}\times= ∏ start_POSTSUBSCRIPT italic_l ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_B ( italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_ARG × (27)
θl(θlcl1+ρ1(1θj)cl0+ρ0)𝑑Θsubscriptsubscript𝜃𝑙superscriptsubscript𝜃𝑙subscriptsuperscript𝑐1𝑙superscript𝜌1superscript1subscript𝜃𝑗subscriptsuperscript𝑐0𝑙superscript𝜌0differential-dΘ\displaystyle\int_{\theta_{l}}\left(\theta_{l}^{c^{1}_{l}+\rho^{1}}\left(1-% \theta_{j}\right)^{c^{0}_{l}+\rho^{0}}\right)d\Theta∫ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) italic_d roman_Θ (28)
=lTleavesB(cl1+ρ1,cl0+ρ0)B(ρ1,ρ0)absentsubscriptproduct𝑙subscript𝑇leaves𝐵subscriptsuperscript𝑐1𝑙superscript𝜌1subscriptsuperscript𝑐0𝑙superscript𝜌0𝐵superscript𝜌1superscript𝜌0\displaystyle=\prod_{l\in T_{\text{leaves}}}\frac{B(c^{1}_{l}+\rho^{1},c^{0}_{% l}+\rho^{0})}{B(\rho^{1},\rho^{0})}= ∏ start_POSTSUBSCRIPT italic_l ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_B ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_B ( italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_ARG (29)

where for the second equality we used Theorem 1 and the choice of prior P(Θ)𝑃ΘP(\Theta)italic_P ( roman_Θ ), and the definition of the Beta function B(ρ1,ρ0)𝐵superscript𝜌1superscript𝜌0B(\rho^{1},\rho^{0})italic_B ( italic_ρ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) throughout. ∎

Proof of Theorem 5.

Our proof is by construction; we explicitly define the “natural” map** both ways. We first show that any binary decision tree corresponds to a solution graph in 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT under a natural correspondence, which we explicitly construct. Given a binary decision tree T𝑇Titalic_T with m𝑚mitalic_m nodes (some of which may be internal and some of which are leaf nodes), we construct its solution graph 𝒮𝒢𝒳,𝒴𝒮subscript𝒢𝒳𝒴\mathcal{S}\subset\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_S ⊂ caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT explicitly.

Let Tleavessubscript𝑇leavesT_{\text{leaves}}italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT and Tinternalsubscript𝑇internalT_{\text{internal}}italic_T start_POSTSUBSCRIPT internal end_POSTSUBSCRIPT denote the leaf and internal nodes of T𝑇Titalic_T, respectively. For node n𝑛nitalic_n with depth d(n)𝑑𝑛d(n)italic_d ( italic_n ) in T𝑇Titalic_T, denote by (n)𝑛\mathcal{I}(n)caligraphic_I ( italic_n ) the datapoint indices for the datapoints which reach node n𝑛nitalic_n. Furthermore, if n𝑛nitalic_n is an internal node of T𝑇Titalic_T, let f(n)𝑓𝑛f(n)italic_f ( italic_n ) denote the feature on which n𝑛nitalic_n is split into its children in T𝑇Titalic_T. Then let 𝒮={o(n),d(n):nT}{t(n),d(n):nTleaves}{a,d(n),f(n):nTinternal}{o(n)|f(n)=k,d(n)+1:nTinternal and k{0,1}}𝒮conditional-setsubscript𝑜𝑛𝑑𝑛𝑛𝑇conditional-setsubscript𝑡𝑛𝑑𝑛𝑛subscript𝑇leavesconditional-setsubscript𝑎𝑑𝑛𝑓𝑛𝑛subscript𝑇internalconditional-setsubscript𝑜evaluated-at𝑛𝑓𝑛𝑘𝑑𝑛1𝑛subscript𝑇internal and 𝑘01\mathcal{S}=\{o_{\mathcal{I}(n),d(n)}:n\in T\}\cup\{t_{\mathcal{I}(n),d(n)}:n% \in T_{\text{leaves}}\}\cup\{a_{\mathcal{I},d(n),f(n)}:n\in T_{\text{internal}% }\}\cup\{o_{\mathcal{I}(n)|_{f(n)=k},d(n)+1}:n\in T_{\text{internal}}\text{ % and }k\in\{0,1\}\}caligraphic_S = { italic_o start_POSTSUBSCRIPT caligraphic_I ( italic_n ) , italic_d ( italic_n ) end_POSTSUBSCRIPT : italic_n ∈ italic_T } ∪ { italic_t start_POSTSUBSCRIPT caligraphic_I ( italic_n ) , italic_d ( italic_n ) end_POSTSUBSCRIPT : italic_n ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT } ∪ { italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d ( italic_n ) , italic_f ( italic_n ) end_POSTSUBSCRIPT : italic_n ∈ italic_T start_POSTSUBSCRIPT internal end_POSTSUBSCRIPT } ∪ { italic_o start_POSTSUBSCRIPT caligraphic_I ( italic_n ) | start_POSTSUBSCRIPT italic_f ( italic_n ) = italic_k end_POSTSUBSCRIPT , italic_d ( italic_n ) + 1 end_POSTSUBSCRIPT : italic_n ∈ italic_T start_POSTSUBSCRIPT internal end_POSTSUBSCRIPT and italic_k ∈ { 0 , 1 } }.

We will now show that this choice of 𝒮𝒮\mathcal{S}caligraphic_S is a solution graph on 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT. First note that the root node in T𝑇Titalic_T has depth 00 and must exist and contain the whole dataset, so o[n],0𝒮subscript𝑜delimited-[]𝑛0𝒮o_{[n],0}\in\mathcal{S}italic_o start_POSTSUBSCRIPT [ italic_n ] , 0 end_POSTSUBSCRIPT ∈ caligraphic_S. Now consider any AND node a,d(n),f(n)𝒮subscript𝑎𝑑𝑛𝑓𝑛𝒮a_{\mathcal{I},d(n),f(n)}\in\mathcal{S}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d ( italic_n ) , italic_f ( italic_n ) end_POSTSUBSCRIPT ∈ caligraphic_S. By construction, we must have that both o(n)|f(n)=0,d(n)+1𝒮subscript𝑜evaluated-at𝑛𝑓𝑛0𝑑𝑛1𝒮o_{\mathcal{I}(n)|_{f(n)=0},d(n)+1}\in\mathcal{S}italic_o start_POSTSUBSCRIPT caligraphic_I ( italic_n ) | start_POSTSUBSCRIPT italic_f ( italic_n ) = 0 end_POSTSUBSCRIPT , italic_d ( italic_n ) + 1 end_POSTSUBSCRIPT ∈ caligraphic_S and o(n)|f(n)=1,d(n)+1𝒮subscript𝑜evaluated-at𝑛𝑓𝑛1𝑑𝑛1𝒮o_{\mathcal{I}(n)|_{f(n)=1},d(n)+1}\in\mathcal{S}italic_o start_POSTSUBSCRIPT caligraphic_I ( italic_n ) | start_POSTSUBSCRIPT italic_f ( italic_n ) = 1 end_POSTSUBSCRIPT , italic_d ( italic_n ) + 1 end_POSTSUBSCRIPT ∈ caligraphic_S, so all of the immediate children of a,d(n),f(n)𝒮subscript𝑎𝑑𝑛𝑓𝑛𝒮a_{\mathcal{I},d(n),f(n)}\in\mathcal{S}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d ( italic_n ) , italic_f ( italic_n ) end_POSTSUBSCRIPT ∈ caligraphic_S must also be in 𝒮𝒮\mathcal{S}caligraphic_S. Finally, consider any OR node o𝒮𝑜𝒮o\in\mathcal{S}italic_o ∈ caligraphic_S. We must have one of three cases. Either:

  1. 1.

    o𝑜oitalic_o is of the form o(n),d(n)subscript𝑜𝑛𝑑𝑛o_{\mathcal{I}(n),d(n)}italic_o start_POSTSUBSCRIPT caligraphic_I ( italic_n ) , italic_d ( italic_n ) end_POSTSUBSCRIPT for nTleaves𝑛subscript𝑇leavesn\in T_{\text{leaves}}italic_n ∈ italic_T start_POSTSUBSCRIPT leaves end_POSTSUBSCRIPT, in which case t(n),d(n)𝒮subscript𝑡𝑛𝑑𝑛𝒮t_{\mathcal{I}(n),d(n)}\in\mathcal{S}italic_t start_POSTSUBSCRIPT caligraphic_I ( italic_n ) , italic_d ( italic_n ) end_POSTSUBSCRIPT ∈ caligraphic_S,

  2. 2.

    o𝑜oitalic_o is of the form o(n),d(n)subscript𝑜𝑛𝑑𝑛o_{\mathcal{I}(n),d(n)}italic_o start_POSTSUBSCRIPT caligraphic_I ( italic_n ) , italic_d ( italic_n ) end_POSTSUBSCRIPT for nTinternal𝑛subscript𝑇internaln\in T_{\text{internal}}italic_n ∈ italic_T start_POSTSUBSCRIPT internal end_POSTSUBSCRIPT, in which case a,d(n),f(n)𝒮subscript𝑎𝑑𝑛𝑓𝑛𝒮a_{\mathcal{I},d(n),f(n)}\in\mathcal{S}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d ( italic_n ) , italic_f ( italic_n ) end_POSTSUBSCRIPT ∈ caligraphic_S,

  3. 3.

    o𝑜oitalic_o is of the form o(n)|f(n)=k,d(n)+1subscript𝑜evaluated-at𝑛𝑓𝑛𝑘𝑑𝑛1o_{\mathcal{I}(n)|_{f(n)=k},d(n)+1}italic_o start_POSTSUBSCRIPT caligraphic_I ( italic_n ) | start_POSTSUBSCRIPT italic_f ( italic_n ) = italic_k end_POSTSUBSCRIPT , italic_d ( italic_n ) + 1 end_POSTSUBSCRIPT for some node nTinternal𝑛subscript𝑇internaln\in T_{\text{internal}}italic_n ∈ italic_T start_POSTSUBSCRIPT internal end_POSTSUBSCRIPT and some k𝑘kitalic_k. In this case, since nTinternal𝑛subscript𝑇internaln\in T_{\text{internal}}italic_n ∈ italic_T start_POSTSUBSCRIPT internal end_POSTSUBSCRIPT, n𝑛nitalic_n must have children nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (for k=0,1𝑘01k=0,1italic_k = 0 , 1) in T𝑇Titalic_T obtained by splitting node n𝑛nitalic_n on feature f(n)𝑓𝑛f(n)italic_f ( italic_n ) and so we must have that o𝑜oitalic_o is of the form o(nk),d(nk)subscript𝑜subscript𝑛𝑘𝑑subscript𝑛𝑘o_{\mathcal{I}(n_{k}),d(n_{k})}italic_o start_POSTSUBSCRIPT caligraphic_I ( italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_d ( italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT for some nkTsubscript𝑛𝑘𝑇n_{k}\in Titalic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_T. In this case, we can apply either Case 1 or Case 2 to nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to show that o𝑜oitalic_o must have exactly one child in 𝒮𝒮\mathcal{S}caligraphic_S.

In all cases, each OR node o𝒮𝑜𝒮o\in\mathcal{S}italic_o ∈ caligraphic_S must have exactly one child in 𝒮𝒮\mathcal{S}caligraphic_S. Thus, 𝒮𝒮\mathcal{S}caligraphic_S is a solution graph on 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT.

We will now show that every solution graph 𝒮𝒮\mathcal{S}caligraphic_S defines a binary decision tree, and define this correspondence explicitly. For every OR Node o,d𝒮subscript𝑜𝑑𝒮o_{\mathcal{I},d}\in\mathcal{S}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ∈ caligraphic_S, we create a corresponding node noTsubscript𝑛𝑜𝑇n_{o}\in Titalic_n start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ italic_T. Since the root node r=o[n],0𝒮𝑟subscript𝑜delimited-[]𝑛0𝒮r=o_{[n],0}\in\mathcal{S}italic_r = italic_o start_POSTSUBSCRIPT [ italic_n ] , 0 end_POSTSUBSCRIPT ∈ caligraphic_S, we must have that T𝑇Titalic_T is nonempty. Furthermore, by the definition of solution graph, we must have that for every OR node o,d𝒮subscript𝑜𝑑𝒮o_{\mathcal{I},d}\in\mathcal{S}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ∈ caligraphic_S, either a,d,f𝒮subscript𝑎𝑑𝑓𝒮a_{\mathcal{I},d,f}\in\mathcal{S}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT ∈ caligraphic_S for some value of f𝑓fitalic_f or its corresponding terminal node t,d𝒮subscript𝑡𝑑𝒮t_{\mathcal{I},d}\in\mathcal{S}italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ∈ caligraphic_S. If a,d,f𝒮subscript𝑎𝑑𝑓𝒮a_{\mathcal{I},d,f}\in\mathcal{S}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT ∈ caligraphic_S for some value of f𝑓fitalic_f, then by the definition of solution graph over 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT, we must have oLo(n)|f(n)=0,d(n)+1𝒮superscript𝑜𝐿subscript𝑜evaluated-at𝑛𝑓𝑛0𝑑𝑛1𝒮o^{L}\coloneqq o_{\mathcal{I}(n)|_{f(n)=0},d(n)+1}\in\mathcal{S}italic_o start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ≔ italic_o start_POSTSUBSCRIPT caligraphic_I ( italic_n ) | start_POSTSUBSCRIPT italic_f ( italic_n ) = 0 end_POSTSUBSCRIPT , italic_d ( italic_n ) + 1 end_POSTSUBSCRIPT ∈ caligraphic_S and oRo(n)|f(n)=1,d(n)+1𝒮superscript𝑜𝑅subscript𝑜evaluated-at𝑛𝑓𝑛1𝑑𝑛1𝒮o^{R}\coloneqq o_{\mathcal{I}(n)|_{f(n)=1},d(n)+1}\in\mathcal{S}italic_o start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ≔ italic_o start_POSTSUBSCRIPT caligraphic_I ( italic_n ) | start_POSTSUBSCRIPT italic_f ( italic_n ) = 1 end_POSTSUBSCRIPT , italic_d ( italic_n ) + 1 end_POSTSUBSCRIPT ∈ caligraphic_S. In this case, we connect node nosubscript𝑛𝑜n_{o}italic_n start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT to each of noLsubscript𝑛superscript𝑜𝐿n_{o^{L}}italic_n start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and noRsubscript𝑛superscript𝑜𝑅n_{o^{R}}italic_n start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUBSCRIPT in T𝑇Titalic_T with a directed edge. (If t,d𝒮subscript𝑡𝑑𝒮t_{\mathcal{I},d}\in\mathcal{S}italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ∈ caligraphic_S, then o,d𝒮subscript𝑜𝑑𝒮o_{\mathcal{I},d}\in\mathcal{S}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ∈ caligraphic_S corresponds to a leaf node in T𝑇Titalic_T).

We now show that this process gives rise to a binary decision tree, i.e., that T𝑇Titalic_T is a binary decision tree. First, we note that by construction, any noTsubscript𝑛𝑜𝑇n_{o}\in Titalic_n start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ italic_T that has an outgoing edge must have exactly two outgoing edges, say to noLsubscript𝑛superscript𝑜𝐿n_{o^{L}}italic_n start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and noRsubscript𝑛superscript𝑜𝑅n_{o^{R}}italic_n start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Furthermore, these edges exist if and only if the corresponding OR nodes in the solution graph 𝒮𝒮\mathcal{S}caligraphic_S are connected through directed edges through an AND node afsubscript𝑎𝑓a_{f}italic_a start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT for some f𝑓fitalic_f. In this case, the subsets of the data at noLsubscript𝑛superscript𝑜𝐿n_{o^{L}}italic_n start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and noRsubscript𝑛superscript𝑜𝑅n_{o^{R}}italic_n start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUBSCRIPT must correspond to the subset of the data at nosubscript𝑛𝑜n_{o}italic_n start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT split by feature f𝑓fitalic_f. This implies that the subset of data that reaches noLsubscript𝑛superscript𝑜𝐿n_{o^{L}}italic_n start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and noRsubscript𝑛superscript𝑜𝑅n_{o^{R}}italic_n start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_POSTSUBSCRIPT in T𝑇Titalic_T must be a strict subset of the data that reaches nosubscript𝑛𝑜n_{o}italic_n start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. Furthermore, we note that every node in T𝑇Titalic_T must be reachable from the root node of T𝑇Titalic_T, nrsubscript𝑛𝑟n_{r}italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (which corresponds to the start node in 𝒮𝒮\mathcal{S}caligraphic_S), so T𝑇Titalic_T is connected. Together, these observations imply that T𝑇Titalic_T is a binary decision tree.

Finally, we note that these two constructions are inverses. Any binary decision tree T𝑇Titalic_T can used to induce a solution graph 𝒮𝒮\mathcal{S}caligraphic_S over 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT, and 𝒮𝒮\mathcal{S}caligraphic_S in turn induces a binary decision tree equivalent to T𝑇Titalic_T. This proves the claim. ∎

Proof of Theorem 6.

Let LOsubscript𝐿𝑂L_{O}italic_L start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT and IOsubscript𝐼𝑂I_{O}italic_I start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT denote the sets of OR nodes corresponding with leaf nodes and internal nodes, respectively, in the tree represented by solution 𝒮𝒮\mathcal{S}caligraphic_S. Then the cost of a solution graph 𝒮𝒮\mathcal{S}caligraphic_S over 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT is

lLOlogpstop(𝒳|l,𝒴|l,d(l))subscript𝑙subscript𝐿𝑂subscript𝑝stopevaluated-at𝒳𝑙evaluated-at𝒴𝑙𝑑𝑙\displaystyle-\sum_{l\in L_{O}}\log p_{\text{stop}}(\mathcal{X}|_{l},\mathcal{% Y}|_{l},d(l))- ∑ start_POSTSUBSCRIPT italic_l ∈ italic_L start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT stop end_POSTSUBSCRIPT ( caligraphic_X | start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , caligraphic_Y | start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_d ( italic_l ) )
mIOlogpnode(d(m))subscript𝑚subscript𝐼𝑂subscript𝑝node𝑑𝑚\displaystyle\hskip 25.00003pt-\sum_{m\in I_{O}}\log p_{\text{node}}(d(m))- ∑ start_POSTSUBSCRIPT italic_m ∈ italic_I start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT node end_POSTSUBSCRIPT ( italic_d ( italic_m ) )

where for any node with depth d𝑑ditalic_d, we have

pnode{psplit(d),if node is an internal nodepstop(d,𝒳o)if node is a leaf nodesubscript𝑝nodecasessubscript𝑝split𝑑if node is an internal nodesubscript𝑝stop𝑑subscript𝒳𝑜if node is a leaf nodep_{\text{node}}\coloneqq\begin{cases}p_{\text{split}}(d),&\text{if node is an % internal node}\\ p_{\text{stop}}(d,\mathcal{X}_{o})&\text{if node is a leaf node}\end{cases}italic_p start_POSTSUBSCRIPT node end_POSTSUBSCRIPT ≔ { start_ROW start_CELL italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) , end_CELL start_CELL if node is an internal node end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT stop end_POSTSUBSCRIPT ( italic_d , caligraphic_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) end_CELL start_CELL if node is a leaf node end_CELL end_ROW (30)

by our choices of psplitsubscript𝑝splitp_{\text{split}}italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT and pstopsubscript𝑝stopp_{\text{stop}}italic_p start_POSTSUBSCRIPT stop end_POSTSUBSCRIPT in Section 4. We can further simplify this to

LOlogP(𝒴|leaf|𝒳|leaf,T)subscriptsubscript𝐿𝑂𝑃evaluated-at𝒴leafsubscript𝒳leaf𝑇\displaystyle-\sum_{L_{O}}\log P(\mathcal{Y}|_{\text{leaf}}|\mathcal{X}|_{% \text{leaf}},T)- ∑ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P ( caligraphic_Y | start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT | caligraphic_X | start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT , italic_T ) IOlogP(T|𝒳)subscriptsubscript𝐼𝑂𝑃conditional𝑇𝒳\displaystyle-\sum_{I_{O}}\log P(T|\mathcal{X})- ∑ start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_P ( italic_T | caligraphic_X )
=logP(𝒴,T|𝒳)absent𝑃𝒴conditional𝑇𝒳\displaystyle=-\log P(\mathcal{Y},T|\mathcal{X})= - roman_log italic_P ( caligraphic_Y , italic_T | caligraphic_X )

by the definition of BDT.

Lemma 13.

For the leaf likelihood given in Equation 4, we have that 𝑙𝑒𝑎𝑓(a,0)𝑙𝑒𝑎𝑓(0,b)𝑙𝑒𝑎𝑓(a,b)subscriptnormal-ℓ𝑙𝑒𝑎𝑓𝑎0subscriptnormal-ℓ𝑙𝑒𝑎𝑓0𝑏subscriptnormal-ℓ𝑙𝑒𝑎𝑓𝑎𝑏\ell_{\text{leaf}}(a,0)\ell_{\text{leaf}}(0,b)\geq\ell_{\text{leaf}}(a,b)roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_a , 0 ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_b ) ≥ roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_a , italic_b ) for integers a,b0𝑎𝑏0a,b\geq 0italic_a , italic_b ≥ 0.

Proof of 13.

If either a𝑎aitalic_a or b𝑏bitalic_b is 00, then the claim is trivially true as leaf(0,0)=1subscriptleaf001\ell_{\text{leaf}}(0,0)=1roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , 0 ) = 1. Therefore, we assume both a𝑎aitalic_a and b𝑏bitalic_b are greater than 00. Using the facts that B(x,y)=Γ(x)Γ(y)Γ(x+y)𝐵𝑥𝑦Γ𝑥Γ𝑦Γ𝑥𝑦B(x,y)=\frac{\Gamma(x)\Gamma(y)}{\Gamma(x+y)}italic_B ( italic_x , italic_y ) = divide start_ARG roman_Γ ( italic_x ) roman_Γ ( italic_y ) end_ARG start_ARG roman_Γ ( italic_x + italic_y ) end_ARG, where ΓΓ\Gammaroman_Γ is the gamma function, and Γ(x+1)=xΓ(x)Γ𝑥1𝑥Γ𝑥\Gamma(x+1)=x\Gamma(x)roman_Γ ( italic_x + 1 ) = italic_x roman_Γ ( italic_x ) we have that:

leaf(a,0)leaf(0,b)?leaf(a,b)superscript?subscriptleaf𝑎0subscriptleaf0𝑏subscriptleaf𝑎𝑏\displaystyle\ell_{\text{leaf}}(a,0)\ell_{\text{leaf}}(0,b)\stackrel{{% \scriptstyle?}}{{\geq}}\ell_{\text{leaf}}(a,b)roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_a , 0 ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_b ) start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_a , italic_b )
iff\displaystyle\iff B(a+ρ0,ρ1)B(ρ0,b+ρ1)B(ρ0,ρ1)B(ρ0,ρ1)?B(a+ρ0,b+ρ1)B(ρ0,ρ1)superscript?𝐵𝑎subscript𝜌0subscript𝜌1𝐵subscript𝜌0𝑏subscript𝜌1𝐵subscript𝜌0subscript𝜌1𝐵subscript𝜌0subscript𝜌1𝐵𝑎subscript𝜌0𝑏subscript𝜌1𝐵subscript𝜌0subscript𝜌1\displaystyle\frac{B(a+\rho_{0},\rho_{1})B(\rho_{0},b+\rho_{1})}{B(\rho_{0},% \rho_{1})B(\rho_{0},\rho_{1})}\stackrel{{\scriptstyle?}}{{\geq}}\frac{B(a+\rho% _{0},b+\rho_{1})}{B(\rho_{0},\rho_{1})}divide start_ARG italic_B ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_B ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_b + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_B ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_B ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP divide start_ARG italic_B ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_b + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_B ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG
iff\displaystyle\iff (Γ(a+ρ0)Γ(ρ1)Γ(a+ρ0+ρ1))(Γ(ρ0)Γ(b+ρ1)Γ(b+ρ0+ρ1))(Γ(ρ0)Γ(ρ1)Γ(ρ0+ρ1))(Γ(ρ0)Γ(ρ1)Γ(ρ0+ρ1))?Γ(a+ρ0)Γ(b+ρ1)Γ(a+b+ρ0+ρ1)Γ(ρ0)Γ(ρ1)Γ(ρ0+ρ1)superscript?Γ𝑎subscript𝜌0Γsubscript𝜌1Γ𝑎subscript𝜌0subscript𝜌1Γsubscript𝜌0Γ𝑏subscript𝜌1Γ𝑏subscript𝜌0subscript𝜌1Γsubscript𝜌0Γsubscript𝜌1Γsubscript𝜌0subscript𝜌1Γsubscript𝜌0Γsubscript𝜌1Γsubscript𝜌0subscript𝜌1Γ𝑎subscript𝜌0Γ𝑏subscript𝜌1Γ𝑎𝑏subscript𝜌0subscript𝜌1Γsubscript𝜌0Γsubscript𝜌1Γsubscript𝜌0subscript𝜌1\displaystyle\frac{\left(\frac{\Gamma(a+\rho_{0})\Gamma(\rho_{1})}{\Gamma(a+% \rho_{0}+\rho_{1})}\right)\left(\frac{\Gamma(\rho_{0})\Gamma(b+\rho_{1})}{% \Gamma(b+\rho_{0}+\rho_{1})}\right)}{\left(\frac{\Gamma(\rho_{0})\Gamma(\rho_{% 1})}{\Gamma(\rho_{0}+\rho_{1})}\right)\left(\frac{\Gamma(\rho_{0})\Gamma(\rho_% {1})}{\Gamma(\rho_{0}+\rho_{1})}\right)}\stackrel{{\scriptstyle?}}{{\geq}}% \frac{\frac{\Gamma(a+\rho_{0})\Gamma(b+\rho_{1})}{\Gamma(a+b+\rho_{0}+\rho_{1}% )}}{\frac{\Gamma(\rho_{0})\Gamma(\rho_{1})}{\Gamma(\rho_{0}+\rho_{1})}}divide start_ARG ( divide start_ARG roman_Γ ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ) ( divide start_ARG roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_b + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ) end_ARG start_ARG ( divide start_ARG roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ) ( divide start_ARG roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ) end_ARG start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP divide start_ARG divide start_ARG roman_Γ ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_b + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_a + italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG end_ARG start_ARG divide start_ARG roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG end_ARG
iff\displaystyle\iff Γ(ρ0+ρ1)Γ(a+b+ρ0+ρ1)?superscript?Γsubscript𝜌0subscript𝜌1Γ𝑎𝑏subscript𝜌0subscript𝜌1absent\displaystyle\Gamma(\rho_{0}+\rho_{1})\Gamma(a+b+\rho_{0}+\rho_{1})\stackrel{{% \scriptstyle?}}{{\geq}}roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_Γ ( italic_a + italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP
Γ(a+ρ0+ρ1)Γ(b+ρ0+ρ1)Γ𝑎subscript𝜌0subscript𝜌1Γ𝑏subscript𝜌0subscript𝜌1\displaystyle\hskip 5.0pt\Gamma(a+\rho_{0}+\rho_{1})\Gamma(b+\rho_{0}+\rho_{1})roman_Γ ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_Γ ( italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
iff\displaystyle\iff (i=0a+b1(i+ρ0+ρ1))Γ(ρ0+ρ1)2?superscript?superscriptsubscriptproduct𝑖0𝑎𝑏1𝑖subscript𝜌0subscript𝜌1Γsuperscriptsubscript𝜌0subscript𝜌12absent\displaystyle\left(\prod_{i=0}^{a+b-1}(i+\rho_{0}+\rho_{1})\right)\Gamma(\rho_% {0}+\rho_{1})^{2}\stackrel{{\scriptstyle?}}{{\geq}}( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a + italic_b - 1 end_POSTSUPERSCRIPT ( italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP
(i=0a1(i+ρ0+ρ1))(i=0b1(i+ρ0+ρ1))Γ(ρ0+ρ1)2superscriptsubscriptproduct𝑖0𝑎1𝑖subscript𝜌0subscript𝜌1superscriptsubscriptproduct𝑖0𝑏1𝑖subscript𝜌0subscript𝜌1Γsuperscriptsubscript𝜌0subscript𝜌12\displaystyle\hskip 5.0pt\left(\prod_{i=0}^{a-1}(i+\rho_{0}+\rho_{1})\right)% \left(\prod_{i=0}^{b-1}(i+\rho_{0}+\rho_{1})\right)\Gamma(\rho_{0}+\rho_{1})^{2}( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a - 1 end_POSTSUPERSCRIPT ( italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT ( italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
iff\displaystyle\iff i=aa+b1(i+ρ0+ρ1)?i=0b1(i+ρ0+ρ1)superscript?superscriptsubscriptproduct𝑖𝑎𝑎𝑏1𝑖subscript𝜌0subscript𝜌1superscriptsubscriptproduct𝑖0𝑏1𝑖subscript𝜌0subscript𝜌1\displaystyle\prod_{i=a}^{a+b-1}(i+\rho_{0}+\rho_{1})\stackrel{{\scriptstyle?}% }{{\geq}}\prod_{i=0}^{b-1}(i+\rho_{0}+\rho_{1})∏ start_POSTSUBSCRIPT italic_i = italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a + italic_b - 1 end_POSTSUPERSCRIPT ( italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT ( italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )

Since ρ0,ρ1>0subscript𝜌0subscript𝜌10\rho_{0},\rho_{1}>0italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 and a>0𝑎0a>0italic_a > 0, we must have the the LHS \geq RHS, which proves the claim. ∎

Lemma 14.

For the leaf likelihood given in Definition 4, we have that 𝑙𝑒𝑎𝑓(a+b,0)𝑙𝑒𝑎𝑓(a,0)𝑙𝑒𝑎𝑓(b,0)subscriptnormal-ℓ𝑙𝑒𝑎𝑓𝑎𝑏0subscriptnormal-ℓ𝑙𝑒𝑎𝑓𝑎0subscriptnormal-ℓ𝑙𝑒𝑎𝑓𝑏0\ell_{\text{leaf}}(a+b,0)\geq\ell_{\text{leaf}}(a,0)\ell_{\text{leaf}}(b,0)roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_a + italic_b , 0 ) ≥ roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_a , 0 ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_b , 0 ) and 𝑙𝑒𝑎𝑓(0,a+b)𝑙𝑒𝑎𝑓(0,a)𝑙𝑒𝑎𝑓(0,b)subscriptnormal-ℓ𝑙𝑒𝑎𝑓0𝑎𝑏subscriptnormal-ℓ𝑙𝑒𝑎𝑓0𝑎subscriptnormal-ℓ𝑙𝑒𝑎𝑓0𝑏\ell_{\text{leaf}}(0,a+b)\geq\ell_{\text{leaf}}(0,a)\ell_{\text{leaf}}(0,b)roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_a + italic_b ) ≥ roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_a ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_b ) for integers a,b0𝑎𝑏0a,b\geq 0italic_a , italic_b ≥ 0.

Proof of 14.

If either a𝑎aitalic_a or b𝑏bitalic_b is 00, then the claim is trivially true as leaf(0,0)=1subscriptleaf001\ell_{\text{leaf}}(0,0)=1roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , 0 ) = 1. Therefore, we assume both a𝑎aitalic_a and b𝑏bitalic_b are greater than 00. Using the facts that B(x,y)=Γ(x)Γ(y)Γ(x+y)𝐵𝑥𝑦Γ𝑥Γ𝑦Γ𝑥𝑦B(x,y)=\frac{\Gamma(x)\Gamma(y)}{\Gamma(x+y)}italic_B ( italic_x , italic_y ) = divide start_ARG roman_Γ ( italic_x ) roman_Γ ( italic_y ) end_ARG start_ARG roman_Γ ( italic_x + italic_y ) end_ARG and Γ(x+1)=xΓ(x)Γ𝑥1𝑥Γ𝑥\Gamma(x+1)=x\Gamma(x)roman_Γ ( italic_x + 1 ) = italic_x roman_Γ ( italic_x ) we have that:

leaf(a+b,0)?leaf(a,0)leaf(b,0)superscript?subscriptleaf𝑎𝑏0subscriptleaf𝑎0subscriptleaf𝑏0\displaystyle\ell_{\text{leaf}}(a+b,0)\stackrel{{\scriptstyle?}}{{\geq}}\ell_{% \text{leaf}}(a,0)\ell_{\text{leaf}}(b,0)roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_a + italic_b , 0 ) start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_a , 0 ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_b , 0 )
iff\displaystyle\iff B(a+b+ρ0,ρ1)B(ρ0,ρ1)?B(a+ρ0,ρ1)B(b+ρ0,ρ1)B(ρ0,ρ1)B(ρ0,ρ1)superscript?𝐵𝑎𝑏subscript𝜌0subscript𝜌1𝐵subscript𝜌0subscript𝜌1𝐵𝑎subscript𝜌0subscript𝜌1𝐵𝑏subscript𝜌0subscript𝜌1𝐵subscript𝜌0subscript𝜌1𝐵subscript𝜌0subscript𝜌1\displaystyle\frac{B(a+b+\rho_{0},\rho_{1})}{B(\rho_{0},\rho_{1})}\stackrel{{% \scriptstyle?}}{{\geq}}\frac{B(a+\rho_{0},\rho_{1})B(b+\rho_{0},\rho_{1})}{B(% \rho_{0},\rho_{1})B(\rho_{0},\rho_{1})}divide start_ARG italic_B ( italic_a + italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_B ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP divide start_ARG italic_B ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_B ( italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_B ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_B ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG
iff\displaystyle\iff B(ρ0,ρ1)B(a+b+ρ0,ρ1)?superscript?𝐵subscript𝜌0subscript𝜌1𝐵𝑎𝑏subscript𝜌0subscript𝜌1absent\displaystyle B(\rho_{0},\rho_{1})B(a+b+\rho_{0},\rho_{1})\stackrel{{% \scriptstyle?}}{{\geq}}italic_B ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_B ( italic_a + italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP
B(a+ρ0,ρ1)B(b+ρ0,ρ1)𝐵𝑎subscript𝜌0subscript𝜌1𝐵𝑏subscript𝜌0subscript𝜌1\displaystyle\hskip 60.00009ptB(a+\rho_{0},\rho_{1})B(b+\rho_{0},\rho_{1})italic_B ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_B ( italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
iff\displaystyle\iff Γ(ρ0)Γ(ρ1)Γ(ρ0+ρ1)Γ(a+b+ρ0)Γ(ρ1)Γ(a+b+ρ0+ρ1)?superscript?Γsubscript𝜌0Γsubscript𝜌1Γsubscript𝜌0subscript𝜌1Γ𝑎𝑏subscript𝜌0Γsubscript𝜌1Γ𝑎𝑏subscript𝜌0subscript𝜌1absent\displaystyle\frac{\Gamma(\rho_{0})\Gamma(\rho_{1})}{\Gamma(\rho_{0}+\rho_{1})% }\frac{\Gamma(a+b+\rho_{0})\Gamma(\rho_{1})}{\Gamma(a+b+\rho_{0}+\rho_{1})}% \stackrel{{\scriptstyle?}}{{\geq}}divide start_ARG roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG divide start_ARG roman_Γ ( italic_a + italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_a + italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP
Γ(a+ρ0)Γ(ρ1)Γ(a+ρ0+ρ1)Γ(b+ρ0)Γ(ρ1)Γ(b+ρ0+ρ1)Γ𝑎subscript𝜌0Γsubscript𝜌1Γ𝑎subscript𝜌0subscript𝜌1Γ𝑏subscript𝜌0Γsubscript𝜌1Γ𝑏subscript𝜌0subscript𝜌1\displaystyle\hskip 20.00003pt\frac{\Gamma(a+\rho_{0})\Gamma(\rho_{1})}{\Gamma% (a+\rho_{0}+\rho_{1})}\frac{\Gamma(b+\rho_{0})\Gamma(\rho_{1})}{\Gamma(b+\rho_% {0}+\rho_{1})}divide start_ARG roman_Γ ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG divide start_ARG roman_Γ ( italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG
iff\displaystyle\iff Γ(ρ0)Γ(a+b+ρ0)Γ(a+ρ0+ρ1)Γ(b+ρ0+ρ1)?superscript?Γsubscript𝜌0Γ𝑎𝑏subscript𝜌0Γ𝑎subscript𝜌0subscript𝜌1Γ𝑏subscript𝜌0subscript𝜌1absent\displaystyle\Gamma(\rho_{0})\Gamma(a+b+\rho_{0})\Gamma(a+\rho_{0}+\rho_{1})% \Gamma(b+\rho_{0}+\rho_{1})\stackrel{{\scriptstyle?}}{{\geq}}roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_a + italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_Γ ( italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP
Γ(ρ0+ρ1)Γ(a+b+ρ0+ρ1)Γ(a+ρ0)Γ(b+ρ0)Γsubscript𝜌0subscript𝜌1Γ𝑎𝑏subscript𝜌0subscript𝜌1Γ𝑎subscript𝜌0Γ𝑏subscript𝜌0\displaystyle\hskip 10.00002pt\Gamma(\rho_{0}+\rho_{1})\Gamma(a+b+\rho_{0}+% \rho_{1})\Gamma(a+\rho_{0})\Gamma(b+\rho_{0})roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_Γ ( italic_a + italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_Γ ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
iff\displaystyle\iff Γ(ρ0)Γ(a+b+ρ0)Γ(a+ρ0)Γ(b+ρ0)?superscript?Γsubscript𝜌0Γ𝑎𝑏subscript𝜌0Γ𝑎subscript𝜌0Γ𝑏subscript𝜌0absent\displaystyle\frac{\Gamma(\rho_{0})\Gamma(a+b+\rho_{0})}{\Gamma(a+\rho_{0})% \Gamma(b+\rho_{0})}\stackrel{{\scriptstyle?}}{{\geq}}divide start_ARG roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_a + italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_Γ ( italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP
Γ(ρ0+ρ1)Γ(a+b+ρ0+ρ1)Γ(a+ρ0+ρ1)Γ(b+ρ0+ρ1)Γsubscript𝜌0subscript𝜌1Γ𝑎𝑏subscript𝜌0subscript𝜌1Γ𝑎subscript𝜌0subscript𝜌1Γ𝑏subscript𝜌0subscript𝜌1\displaystyle\hskip 30.00005pt\frac{\Gamma(\rho_{0}+\rho_{1})\Gamma(a+b+\rho_{% 0}+\rho_{1})}{\Gamma(a+\rho_{0}+\rho_{1})\Gamma(b+\rho_{0}+\rho_{1})}divide start_ARG roman_Γ ( italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_Γ ( italic_a + italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_a + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_Γ ( italic_b + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG
iff\displaystyle\iff (i=0a+b1(i+ρ0))(i=0b1(i+ρ0))(i=0a1(i+ρ0))?superscript?superscriptsubscriptproduct𝑖0𝑎𝑏1𝑖subscript𝜌0superscriptsubscriptproduct𝑖0𝑏1𝑖subscript𝜌0superscriptsubscriptproduct𝑖0𝑎1𝑖subscript𝜌0absent\displaystyle\frac{\left(\prod_{i=0}^{a+b-1}(i+\rho_{0})\right)}{(\prod_{i=0}^% {b-1}(i+\rho_{0}))(\prod_{i=0}^{a-1}(i+\rho_{0}))}\stackrel{{\scriptstyle?}}{{% \geq}}divide start_ARG ( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a + italic_b - 1 end_POSTSUPERSCRIPT ( italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) end_ARG start_ARG ( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT ( italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a - 1 end_POSTSUPERSCRIPT ( italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) end_ARG start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP
(i=0a+b1(i+ρ0+ρ1))(i=0b1(i+ρ0+ρ1))(i=0a1(i+ρ0+ρ1))superscriptsubscriptproduct𝑖0𝑎𝑏1𝑖subscript𝜌0subscript𝜌1superscriptsubscriptproduct𝑖0𝑏1𝑖subscript𝜌0subscript𝜌1superscriptsubscriptproduct𝑖0𝑎1𝑖subscript𝜌0subscript𝜌1\displaystyle\hskip 20.00003pt\frac{\left(\prod_{i=0}^{a+b-1}(i+\rho_{0}+\rho_% {1})\right)}{\left(\prod_{i=0}^{b-1}(i+\rho_{0}+\rho_{1})\right)\left(\prod_{i% =0}^{a-1}(i+\rho_{0}+\rho_{1})\right)}divide start_ARG ( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a + italic_b - 1 end_POSTSUPERSCRIPT ( italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) end_ARG start_ARG ( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT ( italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ( ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a - 1 end_POSTSUPERSCRIPT ( italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) end_ARG
iff\displaystyle\iff i=0b1(a+i+ρ0i+ρ0)?i=0b1(a+i+ρ0+ρ1i+ρ0+ρ1)superscript?superscriptsubscriptproduct𝑖0𝑏1𝑎𝑖subscript𝜌0𝑖subscript𝜌0superscriptsubscriptproduct𝑖0𝑏1𝑎𝑖subscript𝜌0subscript𝜌1𝑖subscript𝜌0subscript𝜌1\displaystyle\prod_{i=0}^{b-1}\left(\frac{a+i+\rho_{0}}{i+\rho_{0}}\right)% \stackrel{{\scriptstyle?}}{{\geq}}\prod_{i=0}^{b-1}\left(\frac{a+i+\rho_{0}+% \rho_{1}}{i+\rho_{0}+\rho_{1}}\right)∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT ( divide start_ARG italic_a + italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT ( divide start_ARG italic_a + italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG )
iff\displaystyle\iff i=0b1(1+ai+ρ0)?i=0b1(1+ai+ρ0+ρ1)superscript?superscriptsubscriptproduct𝑖0𝑏11𝑎𝑖subscript𝜌0superscriptsubscriptproduct𝑖0𝑏11𝑎𝑖subscript𝜌0subscript𝜌1\displaystyle\prod_{i=0}^{b-1}\left(1+\frac{a}{i+\rho_{0}}\right)\stackrel{{% \scriptstyle?}}{{\geq}}\prod_{i=0}^{b-1}\left(1+\frac{a}{i+\rho_{0}+\rho_{1}}\right)∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT ( 1 + divide start_ARG italic_a end_ARG start_ARG italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ? end_ARG end_RELOP ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT ( 1 + divide start_ARG italic_a end_ARG start_ARG italic_i + italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG )

Since ρ0,ρ1>0subscript𝜌0subscript𝜌10\rho_{0},\rho_{1}>0italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 and a>0𝑎0a>0italic_a > 0, each term in the product on the LHS is positive and greater than or equal to the corresponding term on the RHS, so we have that LHS \geq RHS. This proves that leaf(a+b,0)leaf(a,0)leaf(b,0)subscriptleaf𝑎𝑏0subscriptleaf𝑎0subscriptleaf𝑏0\ell_{\text{leaf}}(a+b,0)\geq\ell_{\text{leaf}}(a,0)\ell_{\text{leaf}}(b,0)roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_a + italic_b , 0 ) ≥ roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_a , 0 ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_b , 0 ) for integers a,b0𝑎𝑏0a,b\geq 0italic_a , italic_b ≥ 0. The proof for leaf(0,a+b)leaf(0,a)leaf(0,b)subscriptleaf0𝑎𝑏subscriptleaf0𝑎subscriptleaf0𝑏\ell_{\text{leaf}}(0,a+b)\geq\ell_{\text{leaf}}(0,a)\ell_{\text{leaf}}(0,b)roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_a + italic_b ) ≥ roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_a ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_b ) for integers a,b0𝑎𝑏0a,b\geq 0italic_a , italic_b ≥ 0 follows from the symmetry of leafsubscriptleaf\ell_{\text{leaf}}roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT in its arguments. ∎

Proof of Theorem 8.

First, we show that hhitalic_h is consistent across any AND node a,d,fsubscript𝑎𝑑𝑓a_{\mathcal{I},d,f}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT with children o|f=0,d+1,o|f=1,d+1subscript𝑜evaluated-at𝑓0𝑑1subscript𝑜evaluated-at𝑓1𝑑1o_{\mathcal{I}|_{f=0},d+1},o_{\mathcal{I}|_{f=1},d+1}italic_o start_POSTSUBSCRIPT caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT , italic_d + 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT , italic_d + 1 end_POSTSUBSCRIPT. This follows directly from the definition of hhitalic_h:

h(a,d,f)subscript𝑎𝑑𝑓\displaystyle h(a_{\mathcal{I},d,f})italic_h ( italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT ) =\displaystyle== (31)
𝚌𝚘𝚜𝚝(a,d,f,o|f=0,d+1)𝚌𝚘𝚜𝚝subscript𝑎𝑑𝑓subscript𝑜evaluated-at𝑓0𝑑1\displaystyle\texttt{cost}(a_{\mathcal{I},d,f},o_{\mathcal{I}|_{f=0},d+1})cost ( italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT , italic_d + 1 end_POSTSUBSCRIPT ) (32)
+h(o|f=0,d+1)subscript𝑜evaluated-at𝑓0𝑑1\displaystyle+h(o_{\mathcal{I}|_{f=0},d+1})+ italic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT , italic_d + 1 end_POSTSUBSCRIPT ) (33)
+𝚌𝚘𝚜𝚝(a,d,f,o|f=1,d+1)𝚌𝚘𝚜𝚝subscript𝑎𝑑𝑓subscript𝑜evaluated-at𝑓1𝑑1\displaystyle+\texttt{cost}(a_{\mathcal{I},d,f},o_{\mathcal{I}|_{f=1},d+1})+ cost ( italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT , italic_d + 1 end_POSTSUBSCRIPT ) (34)
+h(o|f=1,d+1)subscript𝑜evaluated-at𝑓1𝑑1\displaystyle+h(o_{\mathcal{I}|_{f=1},d+1})+ italic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT , italic_d + 1 end_POSTSUBSCRIPT ) (35)

Next, we show that hhitalic_h is consistent for any OR node o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT with children t,d,a,d,1,,a,d,Fsubscript𝑡𝑑subscript𝑎𝑑1subscript𝑎𝑑𝐹t_{\mathcal{I},d},a_{\mathcal{I},d,1},\dots,a_{\mathcal{I},d,F}italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_F end_POSTSUBSCRIPT.
Case 1:

leaf(c1(),c0())psplit(d)leaf(c1(),0)leaf(0,c0())subscriptleafsuperscript𝑐1superscript𝑐0subscript𝑝split𝑑subscriptleafsuperscript𝑐10subscriptleaf0superscript𝑐0\displaystyle\ell_{\text{leaf}}(c^{1}(\mathcal{I}),c^{0}(\mathcal{I}))\geq p_{% \text{split}}(d)\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)\ell_{\text{leaf}}(0,c% ^{0}(\mathcal{I}))roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ) ≥ italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , 0 ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ) (36)

In this case, we see that the heuristic is consistent for the terminal node child of o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT: t,dsubscript𝑡𝑑t_{\mathcal{I},d}italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT:

h(o,d)subscript𝑜𝑑\displaystyle h(o_{\mathcal{I},d})italic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) =logleaf(c1(),c0())absentsubscriptleafsuperscript𝑐1superscript𝑐0\displaystyle=-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),c^{0}(\mathcal{I}))= - roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ) (37)
=𝚌𝚘𝚜𝚝(o,d,t,d)absent𝚌𝚘𝚜𝚝subscript𝑜𝑑subscript𝑡𝑑\displaystyle=\texttt{cost}(o_{\mathcal{I},d},t_{\mathcal{I},d})= cost ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) (38)

We also have the following:

h(o,d)subscript𝑜𝑑absent\displaystyle h(o_{\mathcal{I},d})\leqitalic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) ≤ (39)
logpsplit(d)subscript𝑝split𝑑\displaystyle-\log p_{\text{split}}(d)- roman_log italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) (40)
logleaf(c1(),0)subscriptleafsuperscript𝑐10\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , 0 ) (41)
logleaf(0,c0())subscriptleaf0superscript𝑐0\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}))- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ) (42)

It remains to show that the heuristic is consistent for all AND node children of o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT: a,d,1,,a,d,Fsubscript𝑎𝑑1subscript𝑎𝑑𝐹a_{\mathcal{I},d,1},\dots,a_{\mathcal{I},d,F}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_F end_POSTSUBSCRIPT.

Case 2:

leaf(c1(),c0())<psplit(d)leaf(c1(),0)leaf(0,c0())subscriptleafsuperscript𝑐1superscript𝑐0subscript𝑝split𝑑subscriptleafsuperscript𝑐10subscriptleaf0superscript𝑐0\displaystyle\ell_{\text{leaf}}(c^{1}(\mathcal{I}),c^{0}(\mathcal{I}))<p_{% \text{split}}(d)\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)\ell_{\text{leaf}}(0,c% ^{0}(\mathcal{I}))roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ) < italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , 0 ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ) (43)

In this case, we see that the heuristic is again consistent for the terminal node child of o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT, t,dsubscript𝑡𝑑t_{\mathcal{I},d}italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT:

h(o,d)subscript𝑜𝑑\displaystyle h(o_{\mathcal{I},d})italic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) =\displaystyle== (44)
logpsplit(d)subscript𝑝split𝑑\displaystyle-\log p_{\text{split}}(d)- roman_log italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) (45)
logleaf(c1(),0)subscriptleafsuperscript𝑐10\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , 0 ) (46)
logleaf(0,c0())subscriptleaf0superscript𝑐0\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}))- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ) (47)
logleaf(c1(),c0())absentsubscriptleafsuperscript𝑐1superscript𝑐0\displaystyle\leq-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),c^{0}(\mathcal{I}))≤ - roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ) (48)
=𝚌𝚘𝚜𝚝(o,d,t,d)absent𝚌𝚘𝚜𝚝subscript𝑜𝑑subscript𝑡𝑑\displaystyle=\texttt{cost}(o_{\mathcal{I},d},t_{\mathcal{I},d})= cost ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) (49)

For the AND node children, we begin with the following:

h(o,d)subscript𝑜𝑑\displaystyle h(o_{\mathcal{I},d})italic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) \displaystyle\leq (50)
logpsplit(d)subscript𝑝split𝑑\displaystyle-\log p_{\text{split}}(d)- roman_log italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) (51)
logleaf(c1(),0)subscriptleafsuperscript𝑐10\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , 0 ) (52)
logleaf(0,c0())subscriptleaf0superscript𝑐0\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}))- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ) (53)

As in Case 1, it remains to show that the heuristic is consistent for all AND node children of o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT: a,d,1,,a,d,Fsubscript𝑎𝑑1subscript𝑎𝑑𝐹a_{\mathcal{I},d,1},\dots,a_{\mathcal{I},d,F}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_F end_POSTSUBSCRIPT.

We will now show, for both Cases 1 and 2, that from this inequality, it follows that the heuristic is consistent across all AND node children of o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT: a,d,1,,a,d,Fsubscript𝑎𝑑1subscript𝑎𝑑𝐹a_{\mathcal{I},d,1},\dots,a_{\mathcal{I},d,F}italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_F end_POSTSUBSCRIPT.

Applying Lemmas 13 and 14 and the symmetry of leafsubscriptleaf\ell_{\text{leaf}}roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT, we have:

h(o,d)subscript𝑜𝑑\displaystyle h(o_{\mathcal{I},d})italic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) \displaystyle\leq (54)
logpsplit(d)subscript𝑝split𝑑\displaystyle-\log p_{\text{split}}(d)- roman_log italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) (55)
logleaf(c1(),0)subscriptleafsuperscript𝑐10\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , 0 ) (56)
logleaf(0,c0())subscriptleaf0superscript𝑐0\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}))- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ) (57)
\displaystyle\leq (58)
logpsplit(d)subscript𝑝split𝑑\displaystyle-\log p_{\text{split}}(d)- roman_log italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) (59)
logleaf(c1(|f=0),0)leaf(c1(|f=1),0)subscriptleafsuperscript𝑐1evaluated-at𝑓00subscriptleafsuperscript𝑐1evaluated-at𝑓10\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}|_{f=0}),0)\ell_{\text{% leaf}}(c^{1}(\mathcal{I}|_{f=1}),0)- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT ) , 0 ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT ) , 0 ) (60)
logleaf(0,c0(|f=0))leaf(0,c0(|f=1))subscriptleaf0superscript𝑐0evaluated-at𝑓0subscriptleaf0superscript𝑐0evaluated-at𝑓1\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}|_{f=0}))\ell_{\text{% leaf}}(0,c^{0}(\mathcal{I}|_{f=1}))- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT ) ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT ) ) (61)
=\displaystyle== (62)
logpsplit(d)subscript𝑝split𝑑\displaystyle-\log p_{\text{split}}(d)- roman_log italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) (63)
logleaf(c1(|f=0),0)leaf(0,c0(|f=0))subscriptleafsuperscript𝑐1evaluated-at𝑓00subscriptleaf0superscript𝑐0evaluated-at𝑓0\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}|_{f=0}),0)\ell_{\text{% leaf}}(0,c^{0}(\mathcal{I}|_{f=0}))- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT ) , 0 ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT ) ) (64)
logleaf(c1(|f=1),0)leaf(0,c0(|f=1))subscriptleafsuperscript𝑐1evaluated-at𝑓10subscriptleaf0superscript𝑐0evaluated-at𝑓1\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}|_{f=1}),0)\ell_{\text{% leaf}}(0,c^{0}(\mathcal{I}|_{f=1}))- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT ) , 0 ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT ) ) (65)
\displaystyle\leq (66)
logpsplit(d)subscript𝑝split𝑑\displaystyle-\log p_{\text{split}}(d)- roman_log italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) (67)
logleaf(c1(|f=0),c0(|f=0))subscriptleafsuperscript𝑐1evaluated-at𝑓0superscript𝑐0evaluated-at𝑓0\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}|_{f=0}),c^{0}(\mathcal{% I}|_{f=0}))- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT ) , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT ) ) (68)
logleaf(c1(|f=1),c0(|f=1))subscriptleafsuperscript𝑐1evaluated-at𝑓1superscript𝑐0evaluated-at𝑓1\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}|_{f=1}),c^{0}(\mathcal{% I}|_{f=1}))- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT ) , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT ) ) (69)

Applying Lemma 14, we have:

h(o,d)subscript𝑜𝑑\displaystyle h(o_{\mathcal{I},d})italic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) \displaystyle\leq (70)
logpsplit(d)subscript𝑝split𝑑\displaystyle-\log p_{\text{split}}(d)- roman_log italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) (71)
logleaf(c1(),0)subscriptleafsuperscript𝑐10\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I ) , 0 ) (72)
logleaf(0,c0())subscriptleaf0superscript𝑐0\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}))- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I ) ) (73)
\displaystyle\leq (74)
logpsplit(d)subscript𝑝split𝑑\displaystyle-\log p_{\text{split}}(d)- roman_log italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) (75)
logleaf(c1(|f=0),0)leaf(c1(|f=1),0)subscriptleafsuperscript𝑐1evaluated-at𝑓00subscriptleafsuperscript𝑐1evaluated-at𝑓10\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}|_{f=0}),0)\ell_{\text{% leaf}}(c^{1}(\mathcal{I}|_{f=1}),0)- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT ) , 0 ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT ) , 0 ) (76)
logleaf(0,c0(|f=0))leaf(0,c0(|f=1))subscriptleaf0superscript𝑐0evaluated-at𝑓0subscriptleaf0superscript𝑐0evaluated-at𝑓1\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}|_{f=0}))\ell_{\text{% leaf}}(0,c^{0}(\mathcal{I}|_{f=1}))- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT ) ) roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT ) ) (77)
\displaystyle\leq (78)
logpsplit(d)subscript𝑝split𝑑\displaystyle-\log p_{\text{split}}(d)- roman_log italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d ) (79)
2logpsplit(d+1)2subscript𝑝split𝑑1\displaystyle-2\log p_{\text{split}}(d+1)- 2 roman_log italic_p start_POSTSUBSCRIPT split end_POSTSUBSCRIPT ( italic_d + 1 ) (80)
logleaf(c1(|f=0),0)subscriptleafsuperscript𝑐1evaluated-at𝑓00\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}|_{f=0}),0)- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT ) , 0 ) (81)
logleaf(0,c0(|f=0))subscriptleaf0superscript𝑐0evaluated-at𝑓0\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}|_{f=0}))- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT ) ) (82)
logleaf(c1(|f=1),0)subscriptleafsuperscript𝑐1evaluated-at𝑓10\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}|_{f=1}),0)- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT ) , 0 ) (83)
logleaf(0,c0(|f=1))subscriptleaf0superscript𝑐0evaluated-at𝑓1\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}|_{f=1}))- roman_log roman_ℓ start_POSTSUBSCRIPT leaf end_POSTSUBSCRIPT ( 0 , italic_c start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( caligraphic_I | start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT ) ) (84)

From the above two inequalities, we have that:

h(o,d)subscript𝑜𝑑\displaystyle h(o_{\mathcal{I},d})italic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) 𝚌𝚘𝚜𝚝(o,d,a,d,f)+h(a,d,1)absent𝚌𝚘𝚜𝚝subscript𝑜𝑑subscript𝑎𝑑𝑓subscript𝑎𝑑1\displaystyle\leq\texttt{cost}(o_{\mathcal{I},d},a_{\mathcal{I},d,f})+h(a_{% \mathcal{I},d,1})≤ cost ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT ) + italic_h ( italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , 1 end_POSTSUBSCRIPT ) (85)

as required.

Corollary 15 (Admissibility of Perfect Split Heuristic).

The Perfect Split Heuristic hhitalic_h defined in Definition 7 is admissible, i.e., given the true value of an OR node f(o)𝑓𝑜f(o)italic_f ( italic_o ), we have that h(o)f(o)𝑜𝑓𝑜h(o)\leq f(o)italic_h ( italic_o ) ≤ italic_f ( italic_o ), and given the true value of an AND node f(a)𝑓𝑎f(a)italic_f ( italic_a ), we have that h(a)f(a)𝑎𝑓𝑎h(a)\leq f(a)italic_h ( italic_a ) ≤ italic_f ( italic_a ).

Lemma 16.

Across all iterations of MAPTree, UB[o]𝑈𝐵delimited-[]𝑜UB[o]italic_U italic_B [ italic_o ] represents the minimal cost of any partial solution rooted at OR node o𝑜oitalic_o in 𝒢superscript𝒢normal-′\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Proof of Lemma 16.

We prove this via induction on iteration. After the first iteration, there is only terminal node in 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and only one valid solution 𝒮𝒮\mathcal{S}caligraphic_S exists in 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT: 𝒮0={o[N],0,t[N],0}subscript𝒮0subscript𝑜delimited-[]𝑁0subscript𝑡delimited-[]𝑁0\mathcal{S}_{0}=\{o_{[N],0},t_{[N],0}\}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { italic_o start_POSTSUBSCRIPT [ italic_N ] , 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT [ italic_N ] , 0 end_POSTSUBSCRIPT }. Thus, no nodes other than r=o[N],0𝑟subscript𝑜delimited-[]𝑁0r=o_{[N],0}italic_r = italic_o start_POSTSUBSCRIPT [ italic_N ] , 0 end_POSTSUBSCRIPT have valid partial solutions. At this point, UB[r]=𝚌𝚘𝚜𝚝(o[N],0,t[N],0)=𝚌𝚘𝚜𝚝(𝒮)𝑈𝐵delimited-[]𝑟𝚌𝚘𝚜𝚝subscript𝑜delimited-[]𝑁0subscript𝑡delimited-[]𝑁0𝚌𝚘𝚜𝚝𝒮UB[r]=\texttt{cost}(o_{[N],0},t_{[N],0})=\texttt{cost}(\mathcal{S})italic_U italic_B [ italic_r ] = cost ( italic_o start_POSTSUBSCRIPT [ italic_N ] , 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT [ italic_N ] , 0 end_POSTSUBSCRIPT ) = cost ( caligraphic_S ) and UB𝑈𝐵UBitalic_U italic_B is undefined on all other nodes, as required.

In each future iterations, there is at most one terminal node t*superscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT added to 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We will show that for any OR node o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT, UB[o,d]𝑈𝐵delimited-[]subscript𝑜𝑑UB[o_{\mathcal{I},d}]italic_U italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] represents the minimal cost of any partial solution rooted at o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT. We prove this over induction over the size of \mathcal{I}caligraphic_I of o,d𝒢subscript𝑜𝑑superscript𝒢o_{\mathcal{I},d}\in\mathcal{G}^{\prime}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ∈ caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

When ||=11|\mathcal{I}|=1| caligraphic_I | = 1, any split will lead to an empty subtree, meaning if t,d𝒢subscript𝑡𝑑superscript𝒢t_{\mathcal{I},d}\in\mathcal{G}^{\prime}italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ∈ caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT then {o,d,t,d}subscript𝑜𝑑subscript𝑡𝑑\{o_{\mathcal{I},d},t_{\mathcal{I},d}\}{ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT } is the only partial solution rooted at o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT and otherwise no such partial solution exists. If t,d𝒢subscript𝑡𝑑superscript𝒢t_{\mathcal{I},d}\not\in\mathcal{G}^{\prime}italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ∉ caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, this implies that o,dsubscript𝑜𝑑o_{\mathcal{I},d}\not\in\mathcal{E}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ∉ caligraphic_E, meaning UB[o,d]𝑈𝐵delimited-[]subscript𝑜𝑑UB[o_{\mathcal{I},d}]italic_U italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] is undefined. If t,d𝒢subscript𝑡𝑑superscript𝒢t_{\mathcal{I},d}\in\mathcal{G}^{\prime}italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ∈ caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and ||=11|\mathcal{I}|=1| caligraphic_I | = 1, then UB[o,d]𝑈𝐵delimited-[]subscript𝑜𝑑UB[o_{\mathcal{I},d}]italic_U italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] is 𝚌𝚘𝚜𝚝(o,d,t,d)𝚌𝚘𝚜𝚝subscript𝑜𝑑subscript𝑡𝑑\texttt{cost}(o_{\mathcal{I},d},t_{\mathcal{I},d})cost ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ), as required.

When ||>11|\mathcal{I}|>1| caligraphic_I | > 1, we have two cases:

Case 1: There exists a minimal cost partial solution rooted at o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT which does not contain t*superscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.
In this case, the cost of this partial solution is still the minimal cost across any partial solution rooted at o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT. UB[o,d]𝑈𝐵delimited-[]subscript𝑜𝑑UB[o_{\mathcal{I},d}]italic_U italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] remains unchanged in this case. (Otherwise a child’s UB𝑈𝐵UBitalic_U italic_B must have been updated to a value such that there is now a partial solution rooted at o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT which contains that child and its new minimal cost partial solution. However, since the only terminal node added to 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT this iteration was t*superscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, this implies that the child’s new minimal cost solution must contain t*superscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, which is a contradiction.) Since UB[o,d]𝑈𝐵delimited-[]subscript𝑜𝑑UB[o_{\mathcal{I},d}]italic_U italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] was not changed, our inductive hypothesis over iterations states that UB[o,d]𝑈𝐵delimited-[]subscript𝑜𝑑UB[o_{\mathcal{I},d}]italic_U italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] still represents the minimal cost of any partial solution rooted at OR node o𝑜oitalic_o in 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Case 2: All minimal cost partial solutions rooted at o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT contain t*superscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.
Consider a child c*superscript𝑐c^{*}italic_c start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT that is part of some such minimal cost partial solution. In this case, our inductive hypothesis over |||\mathcal{I}|| caligraphic_I | gives us that UB𝑈𝐵UBitalic_U italic_B is correctly updated in this iteration. It follows that o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT will be added to the queue in updateUpperBounds after this child because it must have lower depth. As a result, UB[o,d]𝑈𝐵delimited-[]subscript𝑜𝑑UB[o_{\mathcal{I},d}]italic_U italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] will be set to the minimal cost of any partial solution rooted at o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT, as required.

We conclude that, in either case, UB[o]𝑈𝐵delimited-[]𝑜UB[o]italic_U italic_B [ italic_o ] represents the minimal cost of any partial solution rooted at OR node o𝑜oitalic_o in 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. ∎

Lemma 17.

getSolution(o,dsubscriptnormal-onormal-do_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT) outputs a minimal cost partial solution of 𝒢superscript𝒢normal-′\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT rooted at OR node o,dsubscriptnormal-onormal-do_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT.

Proof of Lemma 17.

We will show that getSolution(o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT) outputs a minimal cost partial solution of 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT rooted at OR node o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT via induction on |||\mathcal{I}|| caligraphic_I |. When ||=11|\mathcal{I}|=1| caligraphic_I | = 1, any splits will lead to a empty subtrees, so 𝚐𝚎𝚝𝚂𝚘𝚕𝚞𝚝𝚒𝚘𝚗(o,d)𝚐𝚎𝚝𝚂𝚘𝚕𝚞𝚝𝚒𝚘𝚗subscript𝑜𝑑\texttt{getSolution}(o_{\mathcal{I},d})getSolution ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) must return {o,d,t,d}subscript𝑜𝑑subscript𝑡𝑑\{o_{\mathcal{I},d},t_{\mathcal{I},d}\}{ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT }. When ||>11|\mathcal{I}|>1| caligraphic_I | > 1, getSolution(o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT) will either stop for a minimal cost or recurse on a split that yields minimal UB𝑈𝐵UBitalic_U italic_B value. Lemma 16 shows that the UB𝑈𝐵UBitalic_U italic_B values of o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT and its children are equal to the minimal cost across all partial solutions rooted at each of these respective nodes. As a result, if getSolution stops, then {o,d,t,d}subscript𝑜𝑑subscript𝑡𝑑\{o_{\mathcal{I},d},t_{\mathcal{I},d}\}{ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT } is a minimal cost partial solution. Otherwise, if getSolution splits on feature f𝑓fitalic_f, {o,d,a,d,f}𝚐𝚎𝚝𝚂𝚘𝚕𝚞𝚝𝚒𝚘𝚗(o|f=0,d+1)𝚐𝚎𝚝𝚂𝚘𝚕𝚞𝚝𝚒𝚘𝚗(o|f=0,d+1)subscript𝑜𝑑subscript𝑎𝑑𝑓𝚐𝚎𝚝𝚂𝚘𝚕𝚞𝚝𝚒𝚘𝚗subscript𝑜evaluated-at𝑓0𝑑1𝚐𝚎𝚝𝚂𝚘𝚕𝚞𝚝𝚒𝚘𝚗subscript𝑜evaluated-at𝑓0𝑑1\{o_{\mathcal{I},d},a_{\mathcal{I},d,f}\}\,\cup\texttt{getSolution}(o_{% \mathcal{I}|_{f=0},d+1})\,\cup\texttt{getSolution}(o_{\mathcal{I}|_{f=0},d+1}){ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_f end_POSTSUBSCRIPT } ∪ getSolution ( italic_o start_POSTSUBSCRIPT caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT , italic_d + 1 end_POSTSUBSCRIPT ) ∪ getSolution ( italic_o start_POSTSUBSCRIPT caligraphic_I | start_POSTSUBSCRIPT italic_f = 0 end_POSTSUBSCRIPT , italic_d + 1 end_POSTSUBSCRIPT ) is also a minimal cost partial solution by the inductive hypothesis. ∎

Proof of Theorem 12.

We will show that upon early termination, MAPTree always returns a minimal cost solution within the explicit subgraph 𝒢Gsuperscript𝒢𝐺\mathcal{G}^{\prime}\subset Gcaligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊂ italic_G explored by MAPTree. From Lemma 17, we have that a minimal cost solution of 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is output by MAPTree, even upon early termination. ∎

Lemma 18.

The lower bounds LB𝐿𝐵LBitalic_L italic_B represent correct lower bounds on the true value of a node in every iteration. For any OR node o𝑜oitalic_o with true value f(o)𝑓𝑜f(o)italic_f ( italic_o ), we have that LB[o]f(o)𝐿𝐵delimited-[]𝑜𝑓𝑜LB[o]\leq f(o)italic_L italic_B [ italic_o ] ≤ italic_f ( italic_o ) and for any AND node a𝑎aitalic_a with true value f(a)𝑓𝑎f(a)italic_f ( italic_a ), we have that LB[a]f(a)𝐿𝐵delimited-[]𝑎𝑓𝑎LB[a]\leq f(a)italic_L italic_B [ italic_a ] ≤ italic_f ( italic_a ).

Proof of Lemma 18.

We will show that for any OR node o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT, we have that LB[o,d]f(o,d)𝐿𝐵delimited-[]subscript𝑜𝑑𝑓subscript𝑜𝑑LB[o_{\mathcal{I},d}]\leq f(o_{\mathcal{I},d})italic_L italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] ≤ italic_f ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ). This follows from Corollary 15. Throughout MAPTree, LB[o,d]𝐿𝐵delimited-[]subscript𝑜𝑑LB[o_{\mathcal{I},d}]italic_L italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] is set to either:

  1. 1.

    h(o,d)subscript𝑜𝑑h(o_{\mathcal{I},d})italic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT )

  2. 2.

    minc{t,d,a,d,1,,a,d,F}𝚌𝚘𝚜𝚝(o,d,c)+LB[c]subscript𝑐subscript𝑡𝑑subscript𝑎𝑑1subscript𝑎𝑑𝐹𝚌𝚘𝚜𝚝subscript𝑜𝑑𝑐𝐿𝐵delimited-[]𝑐\min_{c\in\{t_{\mathcal{I},d},a_{\mathcal{I},d,1},\dots,a_{\mathcal{I},d,F}\}}% \texttt{cost}(o_{\mathcal{I},d},c)+LB[c]roman_min start_POSTSUBSCRIPT italic_c ∈ { italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_F end_POSTSUBSCRIPT } end_POSTSUBSCRIPT cost ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_c ) + italic_L italic_B [ italic_c ]

In the first case, Corollary 15 gives us that h(o,d)f(o,d)subscript𝑜𝑑𝑓subscript𝑜𝑑h(o_{\mathcal{I},d})\leq f(o_{\mathcal{I},d})italic_h ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) ≤ italic_f ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ). In the second case, we induct on iteration. First, though MAPTree does not query LB𝐿𝐵LBitalic_L italic_B for nodes on which a value has not yet been assigned, we will assume for the purpose of this proof that LB𝐿𝐵LBitalic_L italic_B defaults to 00. Thus, before the first iteration, LB𝐿𝐵LBitalic_L italic_B is 00 across all nodes. Our cost function cost is nonnegative, so LB[o,d]=0f(o,d)𝐿𝐵delimited-[]subscript𝑜𝑑0𝑓subscript𝑜𝑑LB[o_{\mathcal{I},d}]=0\leq f(o_{\mathcal{I},d})italic_L italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] = 0 ≤ italic_f ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) must hold. For future iterations then, we have the following for any node o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT on which LB𝐿𝐵LBitalic_L italic_B is defined:

LB[o,d]𝐿𝐵delimited-[]subscript𝑜𝑑\displaystyle LB[o_{\mathcal{I},d}]italic_L italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] :=minc{t,d,a,d,1,,a,d,F}𝚌𝚘𝚜𝚝(o,d,c)+LB[c]assignabsentsubscript𝑐subscript𝑡𝑑subscript𝑎𝑑1subscript𝑎𝑑𝐹𝚌𝚘𝚜𝚝subscript𝑜𝑑𝑐𝐿𝐵delimited-[]𝑐\displaystyle:=\min_{c\in\{t_{\mathcal{I},d},a_{\mathcal{I},d,1},\dots,a_{% \mathcal{I},d,F}\}}\texttt{cost}(o_{\mathcal{I},d},c)+LB[c]:= roman_min start_POSTSUBSCRIPT italic_c ∈ { italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_F end_POSTSUBSCRIPT } end_POSTSUBSCRIPT cost ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_c ) + italic_L italic_B [ italic_c ] (86)
minc{t,d,a,d,1,,a,d,F}𝚌𝚘𝚜𝚝(o,d,c)+f(c)absentsubscript𝑐subscript𝑡𝑑subscript𝑎𝑑1subscript𝑎𝑑𝐹𝚌𝚘𝚜𝚝subscript𝑜𝑑𝑐𝑓𝑐\displaystyle\leq\min_{c\in\{t_{\mathcal{I},d},a_{\mathcal{I},d,1},\dots,a_{% \mathcal{I},d,F}\}}\texttt{cost}(o_{\mathcal{I},d},c)+f(c)≤ roman_min start_POSTSUBSCRIPT italic_c ∈ { italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT caligraphic_I , italic_d , italic_F end_POSTSUBSCRIPT } end_POSTSUBSCRIPT cost ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_c ) + italic_f ( italic_c ) (87)
f(o,d)absent𝑓subscript𝑜𝑑\displaystyle\leq f(o_{\mathcal{I},d})≤ italic_f ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) (88)

Thus, LB𝐿𝐵LBitalic_L italic_B is a true lower bound, as required. ∎

Lemma 19.

For any OR node o𝑜oitalic_o, if LB[o]<UB[o]𝐿𝐵delimited-[]𝑜𝑈𝐵delimited-[]𝑜LB[o]<UB[o]italic_L italic_B [ italic_o ] < italic_U italic_B [ italic_o ], then o𝑜oitalic_o must have some OR node descendant osuperscript𝑜normal-′o^{\prime}italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that osuperscript𝑜normal-′o^{\prime}\not\in\mathcal{E}italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∉ caligraphic_E.

Proof of Lemma 19.

We prove the contrapositive. Consider any OR node o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT. We show that if every OR node descendant of o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT is in \mathcal{E}caligraphic_E, then LB[o,d]=UB[o,d]𝐿𝐵delimited-[]subscript𝑜𝑑𝑈𝐵delimited-[]subscript𝑜𝑑LB[o_{\mathcal{I},d}]=UB[o_{\mathcal{I},d}]italic_L italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] = italic_U italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ]. We show this via induction on |||\mathcal{I}|| caligraphic_I |. When ||=11|\mathcal{I}|=1| caligraphic_I | = 1, splitting further incurs infinite cost, so LB[o,d]=UB[o,d]=𝚌𝚘𝚜𝚝(o,d,t,d)𝐿𝐵delimited-[]subscript𝑜𝑑𝑈𝐵delimited-[]subscript𝑜𝑑𝚌𝚘𝚜𝚝subscript𝑜𝑑subscript𝑡𝑑LB[o_{\mathcal{I},d}]=UB[o_{\mathcal{I},d}]=\texttt{cost}(o_{\mathcal{I},d},t_% {\mathcal{I},d})italic_L italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] = italic_U italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] = cost ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ). When ||>11|\mathcal{I}|>1| caligraphic_I | > 1, if every OR node descendant of o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT is in \mathcal{E}caligraphic_E, then o,dsubscript𝑜𝑑o_{\mathcal{I},d}\in\mathcal{E}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ∈ caligraphic_E must also hold. Since the OR node descendants of the children of o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT must be in \mathcal{E}caligraphic_E as well, they must all have matching UB𝑈𝐵UBitalic_U italic_B and LB𝐿𝐵LBitalic_L italic_B by inductive hypothesis, meaning LB[o,d]=UB[o,d]𝐿𝐵delimited-[]subscript𝑜𝑑𝑈𝐵delimited-[]subscript𝑜𝑑LB[o_{\mathcal{I},d}]=UB[o_{\mathcal{I},d}]italic_L italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ] = italic_U italic_B [ italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ]. ∎

Lemma 20.

If LB[r]<UB[r]𝐿𝐵delimited-[]𝑟𝑈𝐵delimited-[]𝑟LB[r]<UB[r]italic_L italic_B [ italic_r ] < italic_U italic_B [ italic_r ], then findNodeToExpand returns an unexpanded OR node o𝑜o\not\in\mathcal{E}italic_o ∉ caligraphic_E.

Proof.

This follows from Lemma 19 and that findNodeToExpand selects an AND node with the lowest lower bound and the child of this AND node with the largest gap between LB𝐿𝐵LBitalic_L italic_B and UB𝑈𝐵UBitalic_U italic_B, meaning a nonzero gap is chosen when one exists. ∎

Proof of Theorem 9.

Lemma 19 gives us that LB[r]=UB[r]𝐿𝐵delimited-[]𝑟𝑈𝐵delimited-[]𝑟LB[r]=UB[r]italic_L italic_B [ italic_r ] = italic_U italic_B [ italic_r ] must hold upon exhaustive exploration of the search space. From Lemma 20, we have that MAPTree will always expand a new node every iteration that MAPTree has not completed. Since 𝒢𝒢\mathcal{G}caligraphic_G is finite, as discussed after Definition 4, it follows that MAPTree must eventually complete. ∎

Proof of Theorem 10.

This follows directly from Theorem 9, Lemma 16, and Lemma 18. ∎

Appendix B Experiment Details and Additional Experiments

B.1 Experiment Details

In Section 6, we compared the performance of MAPTree against various state-of-the-art baselines. In this subsection, we describe those baselines and the experiments in more detail.

Speed Comparisons against MCMC and SMC

In this set of experiments in Section 6, we compared against the Sequential Monte Carlo (SMC) and Markov-Chain Monte Carlo (MCMC) methods (Lakshminarayanan, Roy, and Teh 2013) which sample from the BCART posterior from Chipman, George, and McCulloch (1998). We used the posterior distribution hyperparameters for each algorithm specified in Lakshminarayanan, Roy, and Teh (2013). To gather results for each baseline with varying times, we set the number of islands in SMC to 10, 30, 100, 300, and 1000 and the number of iterations for MCMC to 10, 30, 100, 300, and 1000. For each of these settings we ran 10 iterations with different random seeds and recorded the average time across iterations, the mean log posterior of the tree with the highest log posterior discovered in each run, and a 95% bootstrapped confidence interval of the average of highest log posteriors discovered across all runs. MAPTree was run with number of expansions limited to 10, 30, 100, 300, and 1000, 3000, 10000, 30000, and 100000. We also ran one additional run of MAPTree with a 10 minute time limit. Since MAPTree is a deterministic algorithm, we did a single run for each of these settings, recording the time and log posterior of the returned tree. Runs which ran out of memory on our computing cluster with 16 GB of RAM in the time limit were discarded.

Note that, given that MAPTree, the SMC baseline, and MCMC baseline all explore the same posterior, measuring their relative generalization performance would not be meaningful; it is more meaningful to measure how quickly they can explore the posterior P(T|𝒳,𝒴)𝑃conditional𝑇𝒳𝒴P(T|\mathcal{X},\mathcal{Y})italic_P ( italic_T | caligraphic_X , caligraphic_Y ) to discover the maximum a posteriori tree. Given infinite runtime, all three algorithms (MAPTree, SMC, and MCMC) should recover the maximum a posteriori tree from the BCART posterior. However, recent work has proven that algorithms such as SMC and MCMC experience long mixing times on the BCART posterior (Kim and Rockova 2023). Our experiments are consistent with these observations; we find that MAPTree is able to find higher likelihood trees faster than the SMC and MCMC baselines in most datasets. Furthermore, in 5 of the 16 datasets, MAPTree is able to recover a maximum a posteriori tree and provide a certificate of optimality.

Fitting a Synthetic Dataset

In the remaining two sets of experiments, we compare MAPTree’s generalization performance and model size to baseline state-of-the-art ODT and OSDT algorithms. ODT algorithms search for trees which minimize misclassification error given a maximum depth. OSDT algorithms search for trees which minimize the same objective but with an added per-leaf sparsity penalty in lieu of a hard depth constraint. We use DL8.5 (Aglin, Nijssen, and Schaus 2020) as our baseline ODT algorithm and GOSDT (Lin et al. 2020) as our baseline OSDT algorithm. Note that these algorithms maximize different objectives than MAPTree and do not explicitly explore the posterior P(T|𝒳,𝒴)𝑃conditional𝑇𝒳𝒴P(T|\mathcal{X},\mathcal{Y})italic_P ( italic_T | caligraphic_X , caligraphic_Y ), so we are primarily interested in the different methods’ generalization performance. We also use CART with constrained depth (Breiman et al. 1984), as the baseline representative of greedy, top-down approaches. All of these baselines are sensitive to their choices of hyperparameters, in particular the maximum depth for CART and DL8.5, and sparsity penalty for GOSDT. We experimented with these hyperparameters to find the best-performing ones for each baseline algorithms, and presented the results for representative settings in Section 6. In particular, we chose maximum depth 4444 for CART, maximum depth 45454-54 - 5 for DL8.5, and sparsity penalties 10321032\frac{10}{32}divide start_ARG 10 end_ARG start_ARG 32 end_ARG and 132132\frac{1}{32}divide start_ARG 1 end_ARG start_ARG 32 end_ARG for GOSDT. These two sparsity penalties were taken from (Lin et al. 2020): the former was used in evaluation of GOSDT’s speed and the latter was used in evaluation of its accuracy. Our experiments set a time limit of 1 minute across all algorithms; the best tree discovered by each algorithm within this time limit was recorded.

The synthetic dataset was created via the process described in Section 6.

Accuracy, Likelihood and Size Comparisons on Real World Benchmarks

In this set of experiments, we compared MAPTree to the baseline algorithms as described in Appendix B.1 on the CP4IM datasets, and the hyperparameters for all algorithms were set to the same values. Again, our experiments set a time limit of 1 minute across all algorithms; the best tree discovered by each algorithm within this time limit was recorded.

The metrics we measure are the per-sample test log likelihood and test accuracy (relative to the performance of CART), and the total number of nodes in the trained tree. We performed stratified 10-fold cross validation on each dataset and recorded the average value across folds for each metric. The average per-sample test log likelihood and test accuracy of CART with maximum depth 4 was subtracted from all baselines on each dataset to get the relative per-sample test log likelihood and test accuracy. The plots in Figure 5 are box-and-whisker plots of the metric values across all 16 datasets, where each box represents the 25252525th to 75757575th percentile, whiskers extend out to at most 1.5×1.5\times1.5 × the size of the box body, and the remaining points are marked as outliers.

B.2 Additional Experiments

In this subsection, we present additional experimental results that were omitted from the main paper due to space constraints.

Figure 6 shows that MAPTree generates trees which out-perform both the greedy, top-down approaches and ODT methods in test accuracy for various training dataset sizes and values of label corruption proportion ϵitalic-ϵ\epsilonitalic_ϵ.

Refer to caption
Figure 6: Test accuracy of MAPTree and various baseline algorithms as a function of training dataset size on the synthetic dataset, for different values of noise, ϵitalic-ϵ\epsilonitalic_ϵ. MAPTree generates trees which outperform both the greedy, top-down approaches and ODT methods in test accuracy for various training dataset sizes and values of label corruption proportion ϵitalic-ϵ\epsilonitalic_ϵ. 95% confidence intervals are obtained via bootstrap** the results of 20 synthetic datasets generated independently at random.

Speed Comparisons against MCMC and SMC (Full)

In this subsection, we include the results of the speed comparisons of MAPTree with SMC and MCMC on the remaining 12 datasets of the CP4IM dataset that were not presented in Section 6 due to space constraints. Figure 7 demonstrates a similar trend on the additional datasets as was demonstrated by Figure 4, namely that MAPTree generally outperforms SMC and MCMC and is able to find trees with higher log posterior faster than the baseline algorithms.

Refer to caption
Figure 7: Comparison of MAPTree, SMC, and MCMC on 12 datasets. Curves are created by modifying the hyperparameters for each algorithm and measuring training time and log posterior of the data under the tree. Higher and further left is better, i.e., better log posteriors in less time. In 12 of the 16 datasets, MAPTree outperforms SMC and MCMC and is able to find trees with higher log posterior faster than the baseline algorithms. Furthermore, in 5 of the 16 datasets, MAPTree converges to the provably optimal tree, i.e., the maximum a posteriori tree. 95% confidence intervals are obtained by bootstrap** the results of 10 random seeds and time is averaged across the 10 seeds.

Fitting a Synthetic Dataset (Full)

In this subsection, we include the results of the synthetic data experiment against benchmarks with a more exhaustive list of hyperparameters, which we omitted in Section 6 due to space constraints. Figure 8, 9, and 10 demonstrate a similar trends as in Figure 6, namely that MAPTree generally the baseline algorithms with less training data, and is more robust to label noise than the baselines.

Refer to caption
Figure 8: Test accuracy of MAPTree and DL8.5, for various hyperparameter settings of DL8.5, as a function of training dataset size on the synthetic dataset, for different values of noise, ϵitalic-ϵ\epsilonitalic_ϵ. MAPTree generates trees which outperform DL8.5 for various training dataset sizes and values of label corruption proportion ϵitalic-ϵ\epsilonitalic_ϵ. 95% confidence intervals are derived by bootstrap** the results across 20 synthetic datasets generated independently at random.
Refer to caption
Figure 9: Test accuracy of MAPTree and GOSDT, for various hyperparameter settings of GOSDT, as a function of training dataset size on the synthetic dataset, for different values of noise, ϵitalic-ϵ\epsilonitalic_ϵ. MAPTree generates trees which outperform GOSDT for various training dataset sizes and values of label corruption proportion ϵitalic-ϵ\epsilonitalic_ϵ. 95% confidence intervals are derived by bootstrap** the results across 20 synthetic datasets generated independently at random.
Refer to caption
Figure 10: Test accuracy of MAPTree and CART, for various hyperparameter settings of CART, as a function of training dataset size on the synthetic dataset, for different values of noise, ϵitalic-ϵ\epsilonitalic_ϵ. MAPTree generates trees which outperform CART for various training dataset sizes and values of label corruption proportion ϵitalic-ϵ\epsilonitalic_ϵ. 95% confidence intervals are derived by bootstrap** the results across 20 synthetic datasets generated independently at random.

Accuracy, Likelihood and Size Comparison on Real World Benchmarks (Full)

In this subsection, we include the results of accuracy, likelihood, and size comparisons against benchmarks with a more exhaustive list hyperparameter settings, which we omitted in Section 6 due to space constraints. Figure 11 demonstrates a similar trend as in Figure 5, namely that MAPTree generally either a) outperforms the baseline algorithms in generalization performance, or b) performs comparably but with smaller trees. Further, we observe that CART and DL8.5 are sensitive to their hyperparameter settings: at higher maximum depths, both algorithms output much larger trees that do not perform any better than their shallower counterparts.

Refer to caption
Figure 11: We run MAPTree and benchmarks with a more exhaustive list of hyperparameter settings on the 16 real world datasets from the CP4IM dataset (Guns, Nijssen, and De Raedt 2011). Higher is better for the left and center subplots, and lower is better for the right subplot.

Hyperparameters of MAPTree

In this subsection, We demonstrate that MAPTree is not sensitive to the choice of hyperparameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β. We run MAPTree on all 16 benchmark datasets from CP4IM (Guns, Nijssen, and De Raedt 2011) with seven different hyperparameter settings of the prior distribution used in MAPTree:

  1. 0.

    α=0.999,β=0.1formulae-sequence𝛼0.999𝛽0.1\alpha=0.999,\beta=0.1italic_α = 0.999 , italic_β = 0.1

  2. 1.

    α=0.99,β=0.2formulae-sequence𝛼0.99𝛽0.2\alpha=0.99,\beta=0.2italic_α = 0.99 , italic_β = 0.2

  3. 2.

    α=0.95,β=0.5formulae-sequence𝛼0.95𝛽0.5\alpha=0.95,\beta=0.5italic_α = 0.95 , italic_β = 0.5

  4. 3.

    α=0.9,β=1.0formulae-sequence𝛼0.9𝛽1.0\alpha=0.9,\beta=1.0italic_α = 0.9 , italic_β = 1.0

  5. 4.

    α=0.8,β=2.0formulae-sequence𝛼0.8𝛽2.0\alpha=0.8,\beta=2.0italic_α = 0.8 , italic_β = 2.0

  6. 5.

    α=0.5,β=4.0formulae-sequence𝛼0.5𝛽4.0\alpha=0.5,\beta=4.0italic_α = 0.5 , italic_β = 4.0

  7. 6.

    α=0.2,β=8.0formulae-sequence𝛼0.2𝛽8.0\alpha=0.2,\beta=8.0italic_α = 0.2 , italic_β = 8.0

These prior specifications were chosen to cover a range of the hyperparameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β that induce MAPTree to have different priors over the probability of splitting. We expect the size of trees generated to be higher for earlier priors and lower for later priors. Relative to the first prior specification, the last prior specification assumes over 1000×1000\times1000 × lower probability of splitting at depth 1111 a priori.

We measure the test accuracy relative to CART, the per-sample test log likelihood relative to CART, and the model size of the trees generated by MAPTree with the different hyperparameter settings. Figure 12 demonstrates our results. MAPTree does not show significant sensitivity to its hyperparameters across any metric.

Refer to caption
Figure 12: We run MAPTree on the 16 real world datasets from CP4IM (Guns, Nijssen, and De Raedt 2011) for various hyperparameter settings. We find that the trees generated by MAPTree are not sensitive to the exact values of hyperparameters.

Appendix C Implementation Details

C.1 Reversible Sparse Bitset

In order to efficiently explore the search space 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT, described in Definition 4, MAPTree must be able to compactly represent subproblems and move between them efficiently. Subproblems in 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT correspond with a subset \mathcal{I}caligraphic_I of samples from the dataset 𝒳,𝒴𝒳𝒴\mathcal{X},\mathcal{Y}caligraphic_X , caligraphic_Y as well as a given depth. We represent these subproblems using a reversible sparse bitset, a data structure also used in DL8.5 (Aglin, Nijssen, and Schaus 2020). Reversible sparse bitsets represent \mathcal{I}caligraphic_I as a list of indexed bitstring blocks that correspond with nonempty subarrays of the current bitset. These blocks also record a history of their previous values, allowing us to efficiently “reverse” a bitset to its previous value before branching into another region of the search space. For more details on reversible sparse bitsets, we direct the reader to Verhaeghe, Lecoutre, and Schaus (2018).

C.2 Caching Subproblems

It is also necessary to identify equivalent subproblems in our search. To do this, we cache each explored subproblem o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT. Previous ODT algorithms cache (,d)𝑑(\mathcal{I},d)( caligraphic_I , italic_d ) explicitly or the path of splits used to reach the subproblem: loosely, path(o,d)pathsubscript𝑜𝑑\text{path}(o_{\mathcal{I},d})path ( italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT ) (Demirović et al. 2022; Aglin, Nijssen, and Schaus 2020). The former approach takes up O(N)𝑂𝑁O(N)italic_O ( italic_N ) memory per subproblem whereas the latter uses less memory but does not identify all equivalences between subproblems (i.e. two paths may result in the same subset of datapoints), meaning it is slower. In MAPTree, we provide a probably correct cache which takes O(1)𝑂1O(1)italic_O ( 1 ) memory per subproblem and always identifies equivalent subproblems. The cache stores the depth d𝑑ditalic_d and constructs a 128-bit hash of \mathcal{I}caligraphic_I for each subproblem o,dsubscript𝑜𝑑o_{\mathcal{I},d}italic_o start_POSTSUBSCRIPT caligraphic_I , italic_d end_POSTSUBSCRIPT. We describe the 128-bit hashing function in Algorithm 6.

Algorithm 6 subsetHash

Input: Subset of sample indices \mathcal{I}caligraphic_I
Output: 128-bit hash value hhitalic_h

1:  Let bitset be a bitset of size N𝑁Nitalic_N
2:  for all i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] do
3:     𝚋𝚒𝚝𝚜𝚎𝚝[i]:=iassign𝚋𝚒𝚝𝚜𝚎𝚝delimited-[]𝑖𝑖\texttt{bitset}[i]:=i\in\mathcal{I}bitset [ italic_i ] := italic_i ∈ caligraphic_I
4:  end for
5:  Let h1,h2subscript1subscript2h_{1},h_{2}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be 64-bit integers.
6:  for all b[N/64]𝑏delimited-[]𝑁64b\in[N/64]italic_b ∈ [ italic_N / 64 ] do
7:     Let block be 𝚋𝚒𝚝𝚜𝚎𝚝[64b:64(b+1)]\texttt{bitset}[64b:64(b+1)]bitset [ 64 italic_b : 64 ( italic_b + 1 ) ]
8:     h1:=h1+𝚋𝚕𝚘𝚌𝚔×(377424577268497867)bassignsubscript1subscript1𝚋𝚕𝚘𝚌𝚔superscript377424577268497867𝑏h_{1}:=h_{1}+\texttt{block}\times(377424577268497867)^{b}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + block × ( 377424577268497867 ) start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT
9:     h2:=h2+𝚋𝚕𝚘𝚌𝚔×(285989758769553131)bassignsubscript2subscript2𝚋𝚕𝚘𝚌𝚔superscript285989758769553131𝑏h_{2}:=h_{2}+\texttt{block}\times(285989758769553131)^{b}italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT := italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + block × ( 285989758769553131 ) start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT
10:  end for
11:  return  (h1,h2)subscript1subscript2(h_{1},h_{2})( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )