MAPTree: Beating “Optimal” Decision Trees with Bayesian Decision Trees

Colin Sullivan\equalcontrib¹, Mo Tiwari\equalcontrib¹, Sebastian Thrun¹

Abstract

Decision trees remain one of the most popular machine learning models today, largely due to their out-of-the-box performance and interpretability. In this work, we present a Bayesian approach to decision tree induction via maximum a posteriori inference of a posterior distribution over trees. We first demonstrate a connection between maximum a posteriori inference of decision trees and AND/OR search. Using this connection, we propose an AND/OR search algorithm, dubbed MAPTree, which is able to recover the maximum a posteriori tree. Lastly, we demonstrate the empirical performance of the maximum a posteriori tree both on synthetic data and in real world settings. On 16 real world datasets, MAPTree either outperforms baselines or demonstrates comparable performance but with much smaller trees. On a synthetic dataset, MAPTree also demonstrates greater robustness to noise and better generalization than existing approaches. Finally, MAPTree recovers the maxiumum a posteriori tree faster than existing sampling approaches and, in contrast with those algorithms, is able to provide a certificate of optimality. The code for our experiments is available at https://github.com/ThrunGroup/maptree.

1 Introduction

Decision trees are amongst the most widely used machine learning models today due to their empirical performance, generality, and interpretability. A decision tree is a binary tree in which each internal node corresponds to an if/then/else comparison on a feature value; a label for a datapoint is produced by determining the corresponding leaf node into which it falls. The predicted label is usually the majority vote (respectively, mean) of the label of training datapoints at the leaf node in classification (respectively, regression).

Despite recent advances in neural networks, decision trees remain a popular choice amongst machine learning practitioners. Decision trees form the backbone of more complex ensemble models such as Random Forest (Breiman 2001) and XGBoost (Chen and Guestrin 2016), which have been the leading models in many machine learning competitions and often outperform neural networks on tabular data (Grinsztajn, Oyallon, and Varoquaux 2022). Decision trees naturally work with complex data where the features can be of mixed data types, e.g., binary, categorical, or continuous. Furthermore, decision trees are highly interpretable and the prediction-generating process can be inspected, which can be a necessity in domains such as law and healthcare. Furthermore, inference in decision trees is highly efficient as it relies only on efficient feature value comparisons. Given decision trees’ popularity, an improvement upon existing decision tree approaches would have widespread impact.

Contributions: In this work, we:

•

Formalize a connection between maximum a posteriori inference of Bayesian Classification and Regression Trees (BCART) and AND/OR search problems,
•

Propose an algorithm, dubbed MAPTree, for search on AND/OR graphs that recovers the maximum a posteriori tree of the BCART posterior over decision trees,
•

Demonstrate that MAPTree is significantly faster than previous sampling-based approaches,
•

Demonstrate that the tree recovered by MAPTree either a) outperforms current state-of-the-art algorithms in performance, or b) demonstrates comparable performance but with smaller trees, and
•

Provide a heavily optimized C++ implementation that is also callable from Python for practitioners.

2 Related Work

In this work, we focus on the construction of individual decision trees. We compare our proposed algorithm with four main classes of prior algorithms: greedy algorithms, “Optimal” Decision Trees (ODTs), “Optimal” Sparse Decision Trees (OSDTs), and sampling-based approaches.

The most popular method for constructing decision trees is a greedy approach that recursively splits nodes based on a heuristic such as Gini impurity or entropy (in classification) or mean-squared error (in regression) (Quinlan 1986). However, individual decision trees constructed in this manner often overfit the training data; ensemble methods such as Random Forest and XGBoost attempt to ameliorate overfitting but are significantly more complex than a single decision tree (Breiman 2001; Chen and Guestrin 2016).

So-called “optimal” decision trees reformulate the problem of decision tree induction as a global optimization problem, i.e., to find the tree that maximizes global objective function, such as training accuracy, of a given maximum depth (Bertsimas and Dunn 2017; Verwer and Zhang 2019; Verhaeghe et al. 2020; Aglin, Nijssen, and Schaus 2020). Though this problem is NP-Hard in general (Hyafil and Rivest 1976), existing approaches can find the global optimum of shallow trees (depth $\leq 5$ ) on medium-sized datasets with thousands of datapoints and tens of features. The original ODT approaches were based on mixed integer programming or binary linear program formulations (Verhaeghe et al. 2020; Nijssen and Fromont 2007; Bertsimas and Dunn 2017; Verwer and Zhang 2019). Other work attempts to improve upon these methods using caching branch-and-bound search (Aglin, Nijssen, and Schaus 2020), constraint programming with AND/OR search (Verhaeghe et al. 2020), or dynamic programming with bounds (van der Linden, de Weerdt, and Demirović 2022). ODTs have been shown to outperform their greedily constructed counterparts with smaller trees (Verhaeghe et al. 2020; Verwer and Zhang 2019) but still suffer from several drawbacks. First, choosing the maximum depth hyperparameter is nontrivial, even with cross-validation, and the maximum depth cannot be set too large as the runtime of these algorithms scales exponentially with depth. Furthermore, ODTs often suffer from overfitting, especially when the maximum depth is set too large. Amongst ODT approaches, Verhaeghe et al. (2020) formulates the search for an optimal decision tree in terms of an AND/OR graph and is most similar to ours, but still suffers from the aforementioned drawbacks. Additionally, many ODT algorithms exhibit poor anytime behavior (Kiossou et al. 2023). Optimal sparse decision trees attempt to adapt ODT approaches to train smaller and sparser trees by incorporating a sparsity penalty in their objectives. As a result, OSDTs are smaller and less prone to overfitting than ODTs (Hu, Rudin, and Seltzer 2019; Lin et al. 2020). These approaches, however, often underfit the data (Hu, Rudin, and Seltzer 2019; Lin et al. 2020).

Another class of approaches, called Bayesian Classification and Regression Trees (BCART), introduce a posterior over tree structures given the data and sample trees from this posterior. Initially, BCART methods were observed to generate better trees than greedy methods (Denison, Mallick, and Smith 1998). Many variations to the BCART methodology were developed using sampling methods based on Markov-Chain Monte Carlo (MCMC), such as Metropolis-Hastings (Pratola 2016) and others (Geels, Pratola, and Herbei 2022; Lakshminarayanan, Roy, and Teh 2013). These methods, however, often suffer from exponentially long mixing times in practice and become stuck in local minima (Kim and Rockova 2023). In one study, the posterior over trees was represented as a lattice over itemsets (Nijssen 2008). This approach discovered the maximum a posteriori tree within the hypothesis space of decision trees. However, this approach required enumerating and storing the entire space of decision trees and therefore placed stringent constraints on the search space of possible trees, based on leaf node support and maximum depth. Our method utilises the same posterior over tree structures introduced by BCART. In contrast with prior work, however, we are able to recover the provably maximum a posteriori tree from this posterior in the unconstrained setting.

3 Preliminaries and Notation

In this paper, we focus on the binary classification task, though our techniques extend to multi-class classification and regression. We also focus on binary datasets, as is common in the decision tree literature (Verhaeghe et al. 2020; Nijssen 2008; Nijssen and Fromont 2007) since many datasets can be binarized via bucketing, one-hot encoding, and other techniques.

General notation: We assume we are given a binary dataset $\mathcal{X}\in\{0,1\}^{N\times F}$ with $N$ samples, $F$ features, and associated binary labels $\mathcal{Y}\in\{0,1\}^{N}$ . We let $[u]\coloneqq\{1,\ldots,u\}$ , $\mathcal{I}\subseteq[N]$ the indices of a subsample of the dataset, and $(x_{i},y_{i})$ denote the $i$ th sample and its label. We define $\mathcal{X}|_{\mathcal{I}}\coloneqq\{x_{i}:i\in\mathcal{I}\}\subset\mathcal{X}$ , $\mathcal{Y}|_{\mathcal{I}}\coloneqq\{y_{i}:i\in\mathcal{I}\}\subset\mathcal{Y}$ , and $\mathcal{I}|_{f=k}\coloneqq\{i:i\in\mathcal{I}$ and $(x_{i})_{f}=k\}$ , for $k\in\{0,1\}$ . Finally, we let $c^{k}(\mathcal{I})$ be the count of points in $\mathcal{I}$ with label $k\in\{0,1\}$ , i.e., $c^{k}(\mathcal{I})=|\{i:i\in\mathcal{I}\text{ and }y_{i}=k\}|$ and $\mathcal{V}(\mathcal{I})$ be the set of nontrivial features splits of the samples in $\mathcal{I}$ , i.e., the set of features such that neither $\mathcal{I}|_{f=0}$ nor $\mathcal{I}|_{f=1}$ is nonempty.

Tree notation: We let $T=\{n_{1},n_{2},\dots,n_{M+L}\}$ be a binary classification tree represented as a collection of its nodes and use $n$ to refer to a node in $T$ , $m$ to refer to one of the $M$ internal nodes in $T$ , and $l$ to refer to one of the $L$ leaf nodes in $T$ . Furthermore, we use $\mathcal{I}(n)$ to denote the indices of the samples in $\mathcal{X}$ that reach node $n$ in $T$ , namely $\{i:x_{i}\in\text{space}(n)\}$ , where space $(n)$ is the subset of feature space that reaches node $n$ in $T$ . We also use $c^{k}_{l}$ to denote the count of points assigned to leaf $l$ with label $k\in\{0,1\}$ (i.e., $c^{k}_{l}=c^{k}(I(l))$ ), $T_{\text{internal}}=\{m_{1},m_{2},\dots,m_{M}\}\subset T$ to denote the set of internal nodes in tree $T$ , and $T_{\text{leaves}}=\{l_{1},l_{2},\dots,l_{L}\}\subset T$ is the set of all leaf nodes in tree $T$ . Finally, we use $d(n)$ to denote the depth of node $n$ in $T$ .

3.1 AND/OR Graph Search

We briefly recapitulate the concept of AND/OR graphs and a search algorithm for AND/OR graphs, AO*. AND/OR graph search can be viewed as a generalization of the shortest path problem that allows nodes consisting of independent subproblems to be decomposed and solved separately. Thus, a solution of an AND/OR graph is not a path but rather a subgraph $\mathcal{S}$ with cost, denoted cost $(\mathcal{S})$ , equal to the sum across the costs of its edges. AND/OR graphs contain two types of nodes: terminal nodes and nonterminal nodes. Nonterminal nodes can be further subdivided into AND nodes and OR nodes, with a special OR node designated as the root or start node $r$ . For a given AND/OR graph $\mathcal{G}$ , a solution graph $\mathcal{S}$ on an AND/OR graph is a connected subset of nodes of $\mathcal{G}$ in which:

1.

$r\in\mathcal{S}$ ,
2.

for every AND node $a\in\mathcal{S}$ , all the immediate children of $a$ are also in $\mathcal{S}$ , and
3.

for every non-terminal OR node $o\in\mathcal{S}$ exactly one of $o$ ’s children is also in $\mathcal{S}$ .

Intuitively, the children of an AND node $a$ represent subtasks that must all be solved for $a$ to be satisfied (e.g., simultaneous prerequisites), and the children of an OR node $o$ represent mutually exclusive satisfying choices.

Refer to caption — Figure 1: An example (general) AND/OR graph, with AND nodes drawn as squares, and OR nodes drawn as solid circles, and terminal nodes drawn as dashed circles. The minimal cost solution is highlighted in red and has cost $0+0+3+4+1+2=10$ . This diagram demonstrates an AND/OR graph where the root node $r$ is an AND node; in MAPTree, the root node is an OR node.

One of the most popular AND/OR graph search algorithms is AO* (Mahanti and Bagchi 1985, 1983). The AO* algorithm explores potential paths in an AND/OR graph in a best-first fashion, guided by a heuristic. When a new node is explored, its children are revealed and the cost for that node and all of its ancestors is updated; the search then continues. This process is repeated until the the root node is marked as solved, indicating that no immediately accessible nodes could lead to an increase in heuristic value. The AO* algorithm is guaranteed to find the minimal cost solution if the heuristic is admissible, i.e., the heuristic estimate of cost is always less than or equal to the actual cost of a node. For more details on the AO* algorithm, we refer the reader to (Mahanti and Bagchi 1985). An example AND/OR graph is given in Figure 1 with its minimal cost solution shown in red.

Additional AND/OR graph notation: In addition to the notation defined above, we use $t$ to refer to a terminal node. When searching over an AND/OR graph, we use $\mathcal{G}$ to refer to the implicit (entire) AND/OR graph and $\mathcal{G}^{\prime}\subset\mathcal{G}$ to explicit (explored) AND/OR graph, as in prior work.

3.2 Bayesian Classification and Regression Trees (BCART)

Bayesian Decision Trees are a family of statistical models of decision trees introduced in Chipman, George, and McCulloch (1998) and Denison, Mallick, and Smith (1998). A Bayesian Decision Tree (BDT) is a pair $(T,\Theta)$ where $T$ is a tree and $\Theta=(\theta_{l_{1}},\theta_{l_{2}},\dots,\theta_{l_{L}})$ parameterizes the independent probability distributions over labels in the leaf nodes of tree $T$ . We are interested in the binary classification setting, where each $\theta_{l}$ parameterizes a Bernoulli distribution $\text{Ber}(\theta_{l})$ with $\theta_{l}\in[0,1]$ . We denote by $\text{Beta}(\rho^{1},\rho^{0})$ the Beta distribution with parameters $\rho^{1},\rho^{0}\in\mathbb{R}^{+}$ and by $B(c^{1},c^{0})$ the Beta function.

We note that a BDT’s tree $T$ partitions the data such that the sample subsets $\mathcal{I}(l_{1})$ , $\mathcal{I}(l_{2}),\mathcal{I}(l_{L})$ fall into leaves $l_{1},l_{2},\dots,l_{L}$ . Furthermore, a BDT defines a probability distribution over the respective labels occurring in their leaves: each label in leaf $l$ is sampled from Ber $(\theta_{l})$ ). Every BDT therefore induces a likelihood function, given in Theorem 1.

Theorem 1.

The likelihood of a BDT $(T,\Theta)$ generating labels $\mathcal{Y}$ given features $\mathcal{X}$ is

	$\displaystyle P(\mathcal{Y}\|\mathcal{X},T,\Theta)$	$\displaystyle=\prod_{l\in T_{\text{leaves}}}\prod_{i\in I(l)}\theta_{l}^{y_{i}% }\left(1-\theta_{l}\right)^{1-y_{i}}$		(1)
		$\displaystyle=\prod_{l\in T_{\text{leaves}}}\theta_{l}^{c^{1}_{l}}\left(1-% \theta_{l}\right)^{c^{0}_{l}}$		(2)

The specific formulation of BCART also assumes a prior distribution over $\Theta$ , i.e., that $\theta\sim\text{Beta}(\rho^{1},\rho^{0})$ for each $\theta\in\Theta$ . With this assumption, we can derive the likelihood function $P(\mathcal{Y}|\mathcal{X},T)$ ; see Theorem 2.

Theorem 2.

Assume that each $\theta\sim\text{Beta}(\rho^{1},\rho^{0})$ for each $\theta\in\Theta$ . Then the likelihood of a tree $T$ generating labels $\mathcal{Y}$ given features $\mathcal{X}$ is

\displaystyle P(\mathcal{Y}|\mathcal{X},T)=\prod_{l\in T_{\text{leaves}}}\frac% {B(c^{1}_{l}+\rho^{1},c^{0}_{l}+\rho^{0})}{B(\rho^{1},\rho^{0})}

(3)

Theorems 1 and 2 are proven in the appendices; we note they have been observed in different forms in prior work (Chipman, George, and McCulloch 1998).

For notational convenience, we define a leaf count likelihood function $\ell_{\text{leaf}}(c^{1},c^{0})$ for integers $c^{1}$ and $c^{0}$ :

\displaystyle\ell_{\text{leaf}}(c^{1},c^{0})\coloneqq\frac{B(c^{1}+\rho^{1},c^% {0}+\rho^{0})}{B(\rho^{1},\rho^{0})}

(4)

and we can rewrite Equation 3 as

\displaystyle P(\mathcal{Y}|\mathcal{X},T)

\displaystyle=\prod_{l\in T_{\text{leaves}}}\ell_{\text{leaf}}(c^{1}_{l},c^{0}% _{l})

(5)

In this work, we utilize the original prior over trees from (Chipman, George, and McCulloch 1998), given in Definition 3.

Definition 3.

The original BCART prior distribution over trees is

	$\displaystyle P(T\|\mathcal{X})$	$\displaystyle=\left(\prod_{l\in T_{\text{leaves}}}p_{\text{leaf}}(d(l),% \mathcal{I}(l))\right)\times$
		$\displaystyle\hskip 30.00005pt\left(\prod_{m\in T_{\text{internal}}}p_{\text{% inner}}(d(m),\mathcal{I}(m))\right)$

where

\displaystyle p_{\text{leaf}}(d,\mathcal{I})

\displaystyle=\begin{cases}1,&\mathcal{V}(\mathcal{I})=\emptyset\\ 1-p_{\text{split}}(d),&\mathcal{V}(\mathcal{I})\neq\emptyset\end{cases}

(6)

\displaystyle p_{\text{inner}}(d,\mathcal{I})

\displaystyle=\begin{cases}0,&\mathcal{V}(\mathcal{I})=\emptyset\\ p_{\text{split}}(d)/|\mathcal{V}(\mathcal{I})|,&\mathcal{V}(\mathcal{I})\neq% \emptyset\end{cases}

(7)

and

\displaystyle p_{\text{split}}(d)

\displaystyle=\alpha(1+d)^{-\beta}

(8)

Intuitively, $p_{\text{split}}(d)$ is the prior probability of any node splitting and is allocating equally amongst valid splits. This choice of prior, $P(T|\mathcal{X})$ , combined with the likelihood function in Equation 5 induces the posterior distribution over trees $P(T|\mathcal{Y},\mathcal{X})$ :

P(T|\mathcal{Y},\mathcal{X})\propto P(\mathcal{Y}|\mathcal{X},T)P(T|\mathcal{X})

(9)

Throughout our analysis, we treat the dataset $(\mathcal{X},\mathcal{Y})$ as fixed.

4 Connecting BCART with AND/OR Graphs

Given a dataset $(\mathcal{X},\mathcal{Y})$ , we will now construct a special AND/OR graph $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ . We will then show that a minimal cost solution graph on $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ corresponds directly with the maximum a posteriori tree given our choice of prior distributions $P(T|\mathcal{X})$ and $P(\Theta)$ . Using this construction, the problem of finding the maximum a posteriori tree of our posterior is reduced to that of finding the minimum cost solution graph on $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ .

Definition 4 (BCART AND/OR graph $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ ).

Given a dataset $(\mathcal{X},\mathcal{Y})$ , construct the AND/OR graph $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ as follows:

1.

For every possible subset $\mathcal{I}\subset[N]$ and depth $d\in\{0,\dots,F\}$ , create an OR node $o_{\mathcal{I},d}$ .
2.

For every OR node $o_{\mathcal{I},d}$ created in Step 1, create a terminal node $t_{\mathcal{I},d}$ and draw an edge from $o_{\mathcal{I},d}$ to $t_{\mathcal{I},d}$ with cost $\texttt{cost}(o_{\mathcal{I},d},t_{\mathcal{I},d})=-\log p_{\text{leaf}}(d,% \mathcal{I})-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),c^{0}(\mathcal{I}))$ .
3.

For every OR node $o_{\mathcal{I},d}$ created in Step 1, create $F$ AND nodes $a_{\mathcal{I},d,1},\ldots,a_{\mathcal{I},d,F}$ and drawn an edge from $o_{\mathcal{I},d}$ to each $a_{\mathcal{I},d,f}$ with cost $\texttt{cost}(o_{\mathcal{I},d},a_{\mathcal{I},d,f})=-\log p_{\text{inner}}(d)$ .
4.

For every pair $a_{\mathcal{I},d,f}$ and $o_{\mathcal{I}^{\prime},d+1}$ where $\mathcal{I}|_{f=k}=\mathcal{I}^{\prime}$ for some $f\in[F]$ and $k\in\{0,1\}$ , draw an edge from $a_{\mathcal{I},d,f}$ to $o_{\mathcal{I^{\prime}},d+1}$ with cost $\texttt{cost}(a_{\mathcal{I},d,f},o_{\mathcal{I^{\prime}},d+1})=0$ .
5.

Let $o_{[n],0}$ , the OR node representing all sample indices, be the unique root node $r$ of $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ .
6.

Remove all OR nodes representing empty subsets and their neighbors.
7.

Remove all nodes not connected to the root node $r$ .

We note that $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ contains $F\times 2^{n}$ OR Nodes, $F\times 2^{n}$ terminal nodes (one for each OR Node), and $F^{2}\times 2^{n}$ AND nodes ( $F$ for each OR Node) and so is finite.

Intuitively, each OR node $o_{\mathcal{I},d}$ in $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ corresponds with the subproblem of discovering a maximum a posteriori subtree starting from depth $d$ and over the subset of samples $\mathcal{I}$ from dataset $\mathcal{X},\mathcal{Y}$ . Each AND node $a_{\mathcal{I},d,f}$ then represents the same subproblem but given that a decision was already made to split on feature $f$ at the root node of this subtree. A valid solution graph on $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ corresponds with a binary classification tree $T$ on the dataset $(\mathcal{X},\mathcal{Y})$ and the value of a solution is related to the posterior probability of $T$ given by $P(T|\mathcal{Y},\mathcal{X})$ . We formalize these properties in Theorems 5 and 6.

Theorem 5.

Every solution graph on AND/OR graphs induces a unique binary decision tree. Furthermore, every decision tree can be represented as a unique solution graph under this correspondence. Thus, there is natural bijection between solution graphs on $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ and binary decision trees.

Theorem 6.

Under the natural bijection described in Theorem 5, given a solution graph $\mathcal{S}$ and its corresponding tree $T$ , we have that $\texttt{cost}(\mathcal{S})=-\log P(T,\mathcal{Y}|\mathcal{X})$ . Therefore the minimal cost solution over $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ corresponds with a maximum a posteriori tree.

The bijection constructed in Theorems 5 and 6 is depicted in Figure 3. Due to space constraints, we defer a formal description of this bijection to Appendix A.

5 MAPTree

Algorithm 1 MAPTree

Input: Root OR Node $r$ , cost function cost, and heuristic function $h$ for AND/OR graph $\mathcal{G}$
Output: Solution graph $\mathcal{S}$

\mathcal{G}^{\prime}:=\{r\}

\mathcal{E}:=\emptyset

LB[r]:=h(r)

UB[r]:=\infty

5: while

LB[r]<UB[r]

and time remaining do

o:=

findNodeToExpand(

r

, cost,

\mathcal{E}

LB

UB

)

7: Let

t

be the terminal node child of

o

\mathcal{E}:=\mathcal{E}\cup\{o\}

\mathcal{G}^{\prime}:=\mathcal{G}^{\prime}\cup\{t\}

10: Let

\{a_{1},\dots,a_{F}\}

be the AND node children of

o

11: for all

a_{f}\in\{a_{1},\dots,a_{F}\}

12: Let

\{o_{f=0},o_{f=1}\}

be the OR node children of

a_{f}

13:

LB[o_{f=0}]:=h(o_{f=0})

14:

LB[o_{f=1}]:=h(o_{f=1})

15:

v^{(lb)}_{f=0}=\texttt{cost}(a_{f},o_{f=0})+h(o_{f=0})

16:

v^{(lb)}_{f=1}=\texttt{cost}(a_{f},o_{f=1})+h(o_{f=1})

17:

LB[a_{f}]:=v_{f=0}+v_{f=1}

18:

\mathcal{G}^{\prime}:=\mathcal{G}^{\prime}\cup\{a,o_{f=0},o_{f=1}\}

19: end for

20: updateLowerBounds(

o

, cost,

LB

)

21: updateUpperBounds(

o

, cost,

UB

)

22: end while

23: return getSolution(

r

, cost,

UB

)

Algorithm 2 getSolution

Input: ORNode $o$ , cost function cost, and upper bounds $UB$
Output: Solution graph $\mathcal{S}$

1: Let

\{a_{1},\dots,a_{F}\}

be the AND node children of

o

2: Let

t

be the terminal node child of

o

a_{f^{*}}:=\arg\min_{c\in\{a_{1},\dots,a_{F}\}}\left(\texttt{cost}(o,c)+UB[c]\right)

\mathcal{S}:=\{o\}

5: if

\texttt{cost}(o,t)+UB[t]\leq\texttt{cost}(o,a_{f^{*}})+UB[a_{f^{*}}]

then

\mathcal{S}:=\mathcal{S}\cup\{t\}

7: else

8: Let

o_{f^{*}=0},o_{f^{*}=1}

be the children of

a_{f^{*}}

\mathcal{S}:=\mathcal{S}\cup\{a_{f^{*}}\}

10:

\mathcal{S}_{0}:=

getSolution(

o_{f^{*}=0}

)

11:

\mathcal{S}_{1}:=

getSolution(

o_{f^{*}=1}

)

12:

\mathcal{S}:=\mathcal{S}\cup\mathcal{S}_{0}\cup\mathcal{S}_{1}

13: end if

14: return

\mathcal{S}

Algorithm 3 findNodeToExpand

Input: Root node $r$ , cost function cost, set of expanded nodes $\mathcal{E}$ , lower bounds $LB$ and upper bounds $UB$
Output: ORNode $o$

o:=r

2: while

o\in\mathcal{E}

3: Let

\{a_{1},\dots,a_{F}\}

be the children of

o

a^{*}:=\arg\min_{c\in\{a_{1},\dots,a_{F}\}}\left(\texttt{cost}(o,c)+LB[c]\right)

5: Let

o^{*}_{0},o^{*}_{1}

be the children of

a^{*}

6: if

UB[o^{*}_{0}]

LB[o^{*}_{0}]>UB[o^{*}_{1}]

LB[o^{*}_{1}]

then

o:=o^{*}_{0}

8: else

o:=o^{*}_{1}

10: end if

11: end while

12: return

o

Algorithm 4 updateLowerBounds

Input: ORNode $l$ , cost function cost, lower bounds $LB$

\mathcal{V}=\{l\}

2: while

|V|>0

3: Remove a node

o

from

\mathcal{V}

with maximal depth

4: Let

\{a_{1},\dots,a_{F}\}

be the AND node children of

o

5: Let

t

be the terminal node child of

o

v^{(lb)}_{\text{split}}=\min_{c\in\{a_{1},\dots,a_{F}\}}\left(\texttt{cost}(o,% c)+LB[c]\right)

v^{(lb)}=\min\{v^{(lb)}_{\text{split}},\texttt{cost}(o,t)\}

8: if

v^{(lb)}>LB[o]

then

LB[o]:=v^{(lb)}

10: Add all parents of

o

\mathcal{V}

11: end if

12: end while

Algorithm 5 updateUpperBounds

Input: ORNode $l$ , cost function cost, upper bounds $UB$

\mathcal{V}=\{l\}

2: while

|V|>0

3: Remove a node

o

from

\mathcal{V}

with maximal depth

4: Let

\{a_{1},\dots,a_{F}\}

be the AND node children of

o

5: Let

t

be the terminal node child of

o

v^{(ub)}_{\text{split}}=\min_{c\in\{a_{1},\dots,a_{F}\}}\left(\texttt{cost}(o,% c)+UB[c]\right)

v^{(ub)}=\min\{v^{(ub)}_{\text{split}},\texttt{cost}(o,t)\}

8: if

v^{(ub)}<UB[o]

then

UB[o]:=v^{(ub)}

10: Add all parents of

o

\mathcal{V}

11: end if

12: end while

Theorems 5 and 6 imply that it is sufficient to find the minimum cost solution graph on $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ to recover the MAP tree under the BCART posterior. In this section, we introduce MAPTree, an AND/OR search algorithm that finds a minimal cost solution on $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ . MAPTree is shown in Algorithm 1.

A key component of MAPTree is the Perfect Split Heuristic $h$ that guides the search, presented in Definition 7.

Definition 7 (Perfect Split Heuristic).

For OR node $o_{\mathcal{I},d}$ with terminal node child $t_{\mathcal{I},d}$ , let

$\displaystyle h(o_{\mathcal{I},d})$	$\displaystyle=-\max\{$	(10)
	$\displaystyle\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),c^{0}(\mathcal{I})),$	(11)
	$\displaystyle\log p_{\text{split}}(d,\mathcal{I})$	(12)
	$\displaystyle+\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)$	(13)
	$\displaystyle+\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}))\}$	(14)

and for AND node $a_{\mathcal{I},d,f}$ with OR node children $o_{\mathcal{I}|_{f=0},d+1}$ and $o_{\mathcal{I}|_{f=1},d+1}$ , let

\displaystyle h(a_{\mathcal{I},d,f})

\displaystyle=h(o_{\mathcal{I}|_{f=0},d+1})+h(o_{\mathcal{I}|_{f=1},d+1})

(15)

Intuitively, the Perfect Split Heuristic describes the negative log posterior probability of the best potential subtree rooted at the given OR node $o_{\mathcal{I},d}$ : one that perfectly classifies the data in a single additional split. The heuristic guides the search away from subproblems that are too deep or for which the labels have already been poorly divided. We prove that this heuristic is a lower bound (admissible) and consistent in later sections.

5.1 Analysis of MAPTree

We now introduce several key properties of MAPTree. In particular, we show that (1) the Perfect Split Heuristic is consistent and therefore also admissible, (2) MAPTree finds the maximum a posteriori tree of the BCART posterior upon completion, and (3) upon early termination, MAPTree returns the minimum cost solution within the explored explicit graph $\mathcal{G}^{\prime}$ . Theorems 8 - 12 and Corollary 11 are proven in Appendix A.

Theorem 8 (Consistency of the Perfect Split Heuristic).

The Perfect Split Heuristic in Definition 7 is consistent, i.e., for any OR node $o$ with children $\{t,a_{1},\dots,a_{F}\}$ :

\displaystyle h(o)

\displaystyle\leq\min_{c\in\{t,a_{1},\dots,a_{F}\}}\texttt{cost}(o,c)+h(c)

(16)

and for any AND node $a$ with children $\{o_{0},o_{1}\}$ :

\displaystyle h(a)

\displaystyle\leq\sum_{c\in\{o_{0},o_{1}\}}\texttt{cost}(a,c)+h(c)

(17)

Theorem 9 (Finiteness of MAPTree).

Algorithm 1 always terminates.

Theorem 10 (Correctness of MAPTree).

When Algorithm 1 does not terminate early due to the time remaining condition, it always outputs a minimal cost solution on $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ upon completion.

Corollary 11.

Consider the tree induced by the output of Algorithm 1 under the natural bijection described in Section 4. By Theorems 5 and 6, this tree is the maximum a posteriori tree $\arg\max_{T}P(T|\mathcal{X},\mathcal{Y})$ .

Theorem 12 (Anytime optimality of MAPTree).

Upon early termination, Algorithm 1 outputs the minimal cost solution across the explicit subgraph $\mathcal{G}^{\prime}$ of already explored nodes.

6 Experiments

We evaluate the performance of MAPTree in multiple settings. In all experiments in this section, we set $\alpha=0.95$ and $\beta=0.5$ . We find that our results are not highly dependent on the choices of $\alpha$ and $B$ ; see Appendix B.

In the first setting, we compare the efficiency of MAPTree to the Sequential Monte Carlo (SMC) and Markov-Chain Monte Carlo (MCMC) baselines from Lakshminarayanan, Roy, and Teh (2013) and Chipman, George, and McCulloch (1998), respectively. In the second setting, we create a synthetic dataset in which the true labels are generated by a randomly generated tree and measure generalization performance with respect to training dataset size. In the third setting, we measure the generalization accuracy, log likelihood, and tree size of models generated by MAPTree and baseline algorithms across all 16 datasets from the CP4IM dataset repository (Guns, Nijssen, and De Raedt 2011).

6.1 Speed Comparisons against MCMC and SMC

We first compare the performance of MAPTree with the SMC and MCMC baselines from Lakshminarayanan, Roy, and Teh (2013) and Chipman, George, and McCulloch (1998), respectively, on all 16 binary classification datasets from the CP4IM dataset repository (Guns, Nijssen, and De Raedt 2011). We note that all three methods, given infinite exploration time, should recover the maximum a posteriori tree from the BCART posterior. However, it has been observed that the mixing times for Markov-Chain-based methods, such as the MCMC and SMC baselines, is exponential in the depth of the data-generating tree (Kim and Rockova 2023). Furthermore, the SMC and MCMC methods are unable to determine when they have converged, nor can they provide a certificate of optimality upon convergence.

In our experiments, we modify the hyperparameters of each algorithm and measure the training time and log posterior of the data under the output tree (Figure 4). In 12 of the 16 datasets in Figure 6, MAPTree outperforms SMC and MCMC and is able to find trees with higher log posterior faster than the baseline algorithms. Furthermore, in 5 of the 16 datasets, MAPTree converges to the provably optimal tree, i.e., the maximum a posteriori tree of the BCART posterior.

6.2 Fitting a Synthetic Dataset

We measure the generalization performance of MAPTree and various other baseline algorithms as a function of training dataset size on tree-generated data.

Synthetic Data: We construct a synthetic dataset where labels are generated by a randomly generated tree. We first construct a random binary tree structure as specified in Devroye and Kruszewski (1995) via recursive random divisions of the available internal nodes to the left or right subtree. Next, features are selected for each internal node uniformly at random such that no internal node splits on the same feature as its ancestors. Lastly, labels are assigned to the leaf nodes in alternating fashion so as to avoid compression of the underlying tree structure. Individual datapoints with 40 features are then sampled with each feature drawn i.i.d. from Ber $(1/2)$ , and their labels are determined by following the generated tree to a leaf node. We repeat this process 20 times, generating 20 datasets for 20 random trees. We also randomly flip $\epsilon$ of the training data labels, with $\epsilon$ ranging from $0$ to $0.25$ to simulate label noise.

In our experiments, MAPTree generates trees which outperform both the greedy, top-down approaches and ODT methods in test accuracy for various training dataset sizes and values of label corruption proportion $\epsilon$ ; the results are presented in Figure 6 in Appendix B due to space constraints. We note that though some baseline algorithms demonstrate comparable performance at a single noise level, no baseline algorithm demonstrates test accuracy comparable to MAPTree across all noise levels. We also emphasize that MAPTree requires no hyperparameter tuning, whereas we experimented with various values of hyperparameters for the baseline algorithms in which performance was highly dependent on hyperparameter values (e.g., DL8.5 and GOSDT); see Appendix B.

6.3 Accuracy, Likelihood, and Size Comparisons on Real World Benchmarks

We also compare the accuracy, test log likelihood, and sizes of trees generated by MAPTree and baseline algorithms on 16 real world binary classification datasets from the CP4IM dataset repository (Guns, Nijssen, and De Raedt 2011) (Figure 5). Against all baseline algorithms, MAPTree either a) performs better in test accuracy or log likelihood, or b) performs comparably in test accuracy and log likelihood but produces smaller trees.

7 Discussion and Conclusions

We presented MAPTree, an algorithm which provably finds the maximum a posteriori tree of the BCART posterior for a given dataset. Our algorithm is inspired by best-first-search algorithms over AND/OR graphs and the observation that the search problem for trees can be framed as a search problem over an appropriately constructed AND/OR graph.

MAPTree outperforms thematically similar approaches such as SMC- and MCMC-based algorithms, finding higher log-posterior trees faster, and is able to determine when it has converged to the maximum a posteriori tree, unlike prior work. MAPTree also outperforms greedy, ODT, and ODST construction methods in test accuracy on the synthetic dataset constructed in Section 6. Furthermore, on many real world benchmark datasets, MAPTree either a) demonstrates better generalization performance, or b) demonstrates comparable generalization performance but with smaller trees.

A limitation of MAPTree is that it constructs a potentially large AND/OR graph, which consumes a significant amount of memory. We leave optimizations that may permit MAPTree to run on huge datasets to future work. Nonetheless, with the optimizations presented in Section 6, we find that MAPTree was performant enough to run on the CP4IM benchmark datasets used in evaluation of previous ODT benchmarks.

Acknowledgements

We would like to thank the anonymous reviewers and Area Chair for their reviews and helpful feedback.

M. T. was funded by a J.P. Morgan AI Fellowship, a Stanford Indisciplinary Graduate Fellowship, a Stanford Data Science Scholarship, and an Oak Ridge Institute for Science and Engineering Fellowship.

References

Aglin, Nijssen, and Schaus (2020) Aglin, G.; Nijssen, S.; and Schaus, P. 2020. Learning Optimal Decision Trees Using Caching Branch-and-Bound Search. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04): 3146–3153. Number: 04.
Bertsimas and Dunn (2017) Bertsimas, D.; and Dunn, J. 2017. Optimal classification trees. Machine Learning, 106.
Breiman (2001) Breiman, L. 2001. Random Forests. Machine Learning, 45(1): 5–32.
Breiman et al. (1984) Breiman, L.; Friedman, J. H.; Olshen, R. A.; and Stone, C. J. 1984. Classification and Regression Trees. 1 edition.
Chen and Guestrin (2016) Chen, T.; and Guestrin, C. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16, 785–794. New York, NY, USA: Association for Computing Machinery. ISBN 978-1-4503-4232-2.
Chipman, George, and McCulloch (1998) Chipman, H. A.; George, E. I.; and McCulloch, R. E. 1998. Bayesian CART Model Search. Journal of the American Statistical Association, 93(443): 935–948. Publisher: Taylor & Francis.
Demirović et al. (2022) Demirović, E.; Lukina, A.; Hebrard, E.; Chan, J.; Bailey, J.; Leckie, C.; Ramamohanarao, K.; and Stuckey, P. J. 2022. MurTree: optimal decision trees via Dynamic programming and search. The Journal of Machine Learning Research, 23(1): 26:1169–26:1215.
Denison, Mallick, and Smith (1998) Denison, D. G. T.; Mallick, B. K.; and Smith, A. F. M. 1998. A Bayesian CART Algorithm. Biometrika, 85(3): 363–377.
Devroye and Kruszewski (1995) Devroye, L.; and Kruszewski, P. 1995. The Botanical Beauty of Random Binary Trees. In International Symposium Graph Drawing and Network Visualization.
Geels, Pratola, and Herbei (2022) Geels, V.; Pratola, M. T.; and Herbei, R. 2022. The Taxicab Sampler: MCMC for Discrete Spaces with Application to Tree Models. Journal of Statistical Computation and Simulation, 1–22. Publisher: Taylor & Francis.
Grinsztajn, Oyallon, and Varoquaux (2022) Grinsztajn, L.; Oyallon, E.; and Varoquaux, G. 2022. Why do Tree-Based Models Still Outperform Deep Learning on Tabular Data?
Guns, Nijssen, and De Raedt (2011) Guns, T.; Nijssen, S.; and De Raedt, L. 2011. Itemset mining: A constraint programming perspective. Artificial Intelligence, 175(12): 1951–1983.
Hu, Rudin, and Seltzer (2019) Hu, X.; Rudin, C.; and Seltzer, M. 2019. Optimal Sparse Decision Trees. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Hyafil and Rivest (1976) Hyafil, L.; and Rivest, R. L. 1976. Constructing optimal binary decision trees is NP-complete. Information Processing Letters, 5(1): 15–17.
Kim and Rockova (2023) Kim, J.; and Rockova, V. 2023. On Mixing Rates for Bayesian CART. ArXiv:2306.00126 [math, stat].
Kiossou et al. (2023) Kiossou, H.; Schaus, P.; Nijssen, S.; and Houndji, V. R. 2023. Time Constrained DL8.5 Using Limited Discrepancy Search. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part V, 443–459. Berlin, Heidelberg: Springer-Verlag. ISBN 978-3-031-26418-4.
Lakshminarayanan, Roy, and Teh (2013) Lakshminarayanan, B.; Roy, D. M.; and Teh, Y. W. 2013. Top-down particle filtering for bayesian decision trees. In Proceedings of the 30th international conference on international conference on machine learning - volume 28, ICML’13, III–280–III–288. JMLR.org. Place: Atlanta, GA, USA.
Lin et al. (2020) Lin, J.; Zhong, C.; Hu, D.; Rudin, C.; and Seltzer, M. 2020. Generalized and scalable optimal sparse decision trees. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of ICML’20, 6150–6160. JMLR.org.
Mahanti and Bagchi (1983) Mahanti, A.; and Bagchi, A. 1983. Admissible heuristic search in and/or graphs. Theoretical Computer Science, 24(2): 207–219. Publisher: Elsevier.
Mahanti and Bagchi (1985) Mahanti, A.; and Bagchi, A. 1985. AND/OR graph heuristic search methods. Journal of the ACM, 32(1): 28–51.
Nijssen (2008) Nijssen, S. 2008. Bayes optimal classification for decision trees. In Proceedings of the 25th international conference on Machine learning, ICML ’08, 696–703. New York, NY, USA: Association for Computing Machinery. ISBN 978-1-60558-205-4.
Nijssen and Fromont (2007) Nijssen, S.; and Fromont, E. 2007. Mining optimal decision trees from itemset lattices. In Knowledge discovery and data mining.
Pratola (2016) Pratola, M. T. 2016. Efficient Metropolis–Hastings Proposal Mechanisms for Bayesian Regression Tree Models. Bayesian Analysis, 11(3): 885–911. Publisher: International Society for Bayesian Analysis.
Quinlan (1986) Quinlan, J. R. 1986. Induction of Decision Trees. Machine Learning, 1(1): 81–106.
van der Linden, de Weerdt, and Demirović (2022) van der Linden, J.; de Weerdt, M.; and Demirović, E. 2022. Fair and optimal decision trees: A dynamic programming approach. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in neural information processing systems, volume 35, 38899–38911. Curran Associates, Inc.
Verhaeghe, Lecoutre, and Schaus (2018) Verhaeghe, H.; Lecoutre, C.; and Schaus, P. 2018. Compact-MDD: Efficiently Filtering (s)MDD Constraints with Reversible Sparse Bit-sets. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, 1383–1389. International Joint Conferences on Artificial Intelligence Organization.
Verhaeghe et al. (2020) Verhaeghe, H.; Nijssen, S.; Pesant, G.; Quimper, C.-G.; and Schaus, P. 2020. Learning optimal decision trees using constraint programming. Constraints, 25(3): 226–250.
Verwer and Zhang (2019) Verwer, S.; and Zhang, Y. 2019. Learning optimal classification trees using a binary linear program formulation. In Proceedings of the thirty-third AAAI conference on artificial intelligence and thirty-first innovative applications of artificial intelligence conference and ninth AAAI symposium on educational advances in artificial intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press. ISBN 978-1-57735-809-1. Place: Honolulu, Hawaii, USA Number of pages: 8 tex.articleno: 200.

Appendix A Proofs of Theorems

In this section, we present the proofs of the theorems.

Proof of Theorem 1.

We note that this result was first presented in the original BCART paper for the more general classification setting (Chipman, George, and McCulloch 1998). We reproduce the proof below in the binary classification setting.

By the definition of a BDT $(T,\Theta)$ , the tree $T$ partitions the data such that the sample subsets $I(l_{1})$ , $I(l_{2}),I(l_{L})$ fall into leaves $l_{1},l_{2},\dots,l_{L}$ and each leaf contains an independent probability distribution Ber $(\theta_{l})$ that governs the probability of a given label occurring in each respective leaf. Note that the probability of each label $y_{i}$ occurring is conditionally independent of the other elements of $\mathcal{Y}$ and the dataset $\mathcal{X}$ given its leaf $l$ (which is determined only from the tree structure $T$ and $x_{i}$ ) and the corresponding parameter $\theta_{l}$ (which is determined from the global parametrization $\Theta$ and $l$ ). Therefore

$\displaystyle P(\mathcal{Y}\|\mathcal{X},T,\Theta)$	$\displaystyle=\prod_{j\in[N]}P(y_{j}\|x_{j},T,\Theta)$	(18)
	$\displaystyle=\prod_{l\in T_{\text{leaves}}}\prod_{i\in\mathcal{I}(l)}P(y_{i}\|% \theta_{l})$	(19)
	$\displaystyle=\prod_{l\in T_{\text{leaves}}}\prod_{i\in\mathcal{I}(l)}\theta_{% l}^{y_{i}}(1-\theta_{l})^{1-y_{i}}$	(20)
	$\displaystyle=\prod_{l\in T_{\text{leaves}}}\theta_{l}^{c_{l}^{1}}(1-\theta_{l% })^{c_{l}^{0}}$	(21)

∎

Proof of Theorem 2.

The likelihood of a tree $T$ generating labels $\mathcal{Y}$ given features $\mathcal{X}$ can be obtained by marginalizing over $\Theta$ using the prior $P(\Theta)$ given in Section 3:

$\displaystyle P(\mathcal{Y}\|\mathcal{X},T)$	$\displaystyle=\int_{\Theta}P(\mathcal{Y}\|\mathcal{X},T,\Theta)P(\Theta)d\Theta$	(22)
	$\displaystyle=\int_{\Theta}\prod_{l\in T_{\text{leaves}}}\left(\theta_{l}^{c^{% 1}_{l}}\left(1-\theta_{l}\right)^{c^{0}_{l}}\right)\times$	(23)
	$\displaystyle\left(\frac{\theta_{l}^{\rho^{1}}(1-\theta_{l})^{\rho^{0}}}{B(% \rho^{1},\rho^{0})}\right)d\Theta$	(24)
	$\displaystyle=\int_{\Theta}\prod_{l\in T_{\text{leaves}}}\frac{1}{B(\rho^{1},% \rho^{0})}\times$	(25)
	$\displaystyle\left(\theta_{l}^{c^{1}_{l}+\rho^{1}}\left(1-\theta_{s}\right)^{c% ^{0}_{l}+\rho^{0}}\right)d\Theta$	(26)
	$\displaystyle=\prod_{l\in T_{\text{leaves}}}\frac{1}{B(\rho^{1},\rho^{0})}\times$	(27)
	$\displaystyle\int_{\theta_{l}}\left(\theta_{l}^{c^{1}_{l}+\rho^{1}}\left(1-% \theta_{j}\right)^{c^{0}_{l}+\rho^{0}}\right)d\Theta$	(28)
	$\displaystyle=\prod_{l\in T_{\text{leaves}}}\frac{B(c^{1}_{l}+\rho^{1},c^{0}_{% l}+\rho^{0})}{B(\rho^{1},\rho^{0})}$	(29)

where for the second equality we used Theorem 1 and the choice of prior $P(\Theta)$ , and the definition of the Beta function $B(\rho^{1},\rho^{0})$ throughout. ∎

Proof of Theorem 5.

Our proof is by construction; we explicitly define the “natural” map** both ways. We first show that any binary decision tree corresponds to a solution graph in $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ under a natural correspondence, which we explicitly construct. Given a binary decision tree $T$ with $m$ nodes (some of which may be internal and some of which are leaf nodes), we construct its solution graph $\mathcal{S}\subset\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ explicitly.

Let $T_{\text{leaves}}$ and $T_{\text{internal}}$ denote the leaf and internal nodes of $T$ , respectively. For node $n$ with depth $d(n)$ in $T$ , denote by $\mathcal{I}(n)$ the datapoint indices for the datapoints which reach node $n$ . Furthermore, if $n$ is an internal node of $T$ , let $f(n)$ denote the feature on which $n$ is split into its children in $T$ . Then let $\mathcal{S}=\{o_{\mathcal{I}(n),d(n)}:n\in T\}\cup\{t_{\mathcal{I}(n),d(n)}:n% \in T_{\text{leaves}}\}\cup\{a_{\mathcal{I},d(n),f(n)}:n\in T_{\text{internal}% }\}\cup\{o_{\mathcal{I}(n)|_{f(n)=k},d(n)+1}:n\in T_{\text{internal}}\text{ % and }k\in\{0,1\}\}$ .

We will now show that this choice of $\mathcal{S}$ is a solution graph on $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ . First note that the root node in $T$ has depth $0$ and must exist and contain the whole dataset, so $o_{[n],0}\in\mathcal{S}$ . Now consider any AND node $a_{\mathcal{I},d(n),f(n)}\in\mathcal{S}$ . By construction, we must have that both $o_{\mathcal{I}(n)|_{f(n)=0},d(n)+1}\in\mathcal{S}$ and $o_{\mathcal{I}(n)|_{f(n)=1},d(n)+1}\in\mathcal{S}$ , so all of the immediate children of $a_{\mathcal{I},d(n),f(n)}\in\mathcal{S}$ must also be in $\mathcal{S}$ . Finally, consider any OR node $o\in\mathcal{S}$ . We must have one of three cases. Either:

1.

$o$ is of the form $o_{\mathcal{I}(n),d(n)}$ for $n\in T_{\text{leaves}}$ , in which case $t_{\mathcal{I}(n),d(n)}\in\mathcal{S}$ ,
2.

$o$ is of the form $o_{\mathcal{I}(n),d(n)}$ for $n\in T_{\text{internal}}$ , in which case $a_{\mathcal{I},d(n),f(n)}\in\mathcal{S}$ ,
3.

$o$ is of the form $o_{\mathcal{I}(n)|_{f(n)=k},d(n)+1}$ for some node $n\in T_{\text{internal}}$ and some $k$ . In this case, since $n\in T_{\text{internal}}$ , $n$ must have children $n_{k}$ (for $k=0,1$ ) in $T$ obtained by splitting node $n$ on feature $f(n)$ and so we must have that $o$ is of the form $o_{\mathcal{I}(n_{k}),d(n_{k})}$ for some $n_{k}\in T$ . In this case, we can apply either Case 1 or Case 2 to $n_{k}$ to show that $o$ must have exactly one child in $\mathcal{S}$ .

In all cases, each OR node $o\in\mathcal{S}$ must have exactly one child in $\mathcal{S}$ . Thus, $\mathcal{S}$ is a solution graph on $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ .

We will now show that every solution graph $\mathcal{S}$ defines a binary decision tree, and define this correspondence explicitly. For every OR Node $o_{\mathcal{I},d}\in\mathcal{S}$ , we create a corresponding node $n_{o}\in T$ . Since the root node $r=o_{[n],0}\in\mathcal{S}$ , we must have that $T$ is nonempty. Furthermore, by the definition of solution graph, we must have that for every OR node $o_{\mathcal{I},d}\in\mathcal{S}$ , either $a_{\mathcal{I},d,f}\in\mathcal{S}$ for some value of $f$ or its corresponding terminal node $t_{\mathcal{I},d}\in\mathcal{S}$ . If $a_{\mathcal{I},d,f}\in\mathcal{S}$ for some value of $f$ , then by the definition of solution graph over $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ , we must have $o^{L}\coloneqq o_{\mathcal{I}(n)|_{f(n)=0},d(n)+1}\in\mathcal{S}$ and $o^{R}\coloneqq o_{\mathcal{I}(n)|_{f(n)=1},d(n)+1}\in\mathcal{S}$ . In this case, we connect node $n_{o}$ to each of $n_{o^{L}}$ and $n_{o^{R}}$ in $T$ with a directed edge. (If $t_{\mathcal{I},d}\in\mathcal{S}$ , then $o_{\mathcal{I},d}\in\mathcal{S}$ corresponds to a leaf node in $T$ ).

We now show that this process gives rise to a binary decision tree, i.e., that $T$ is a binary decision tree. First, we note that by construction, any $n_{o}\in T$ that has an outgoing edge must have exactly two outgoing edges, say to $n_{o^{L}}$ and $n_{o^{R}}$ . Furthermore, these edges exist if and only if the corresponding OR nodes in the solution graph $\mathcal{S}$ are connected through directed edges through an AND node $a_{f}$ for some $f$ . In this case, the subsets of the data at $n_{o^{L}}$ and $n_{o^{R}}$ must correspond to the subset of the data at $n_{o}$ split by feature $f$ . This implies that the subset of data that reaches $n_{o^{L}}$ and $n_{o^{R}}$ in $T$ must be a strict subset of the data that reaches $n_{o}$ . Furthermore, we note that every node in $T$ must be reachable from the root node of $T$ , $n_{r}$ (which corresponds to the start node in $\mathcal{S}$ ), so $T$ is connected. Together, these observations imply that $T$ is a binary decision tree.

Finally, we note that these two constructions are inverses. Any binary decision tree $T$ can used to induce a solution graph $\mathcal{S}$ over $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ , and $\mathcal{S}$ in turn induces a binary decision tree equivalent to $T$ . This proves the claim. ∎

Proof of Theorem 6.

Let $L_{O}$ and $I_{O}$ denote the sets of OR nodes corresponding with leaf nodes and internal nodes, respectively, in the tree represented by solution $\mathcal{S}$ . Then the cost of a solution graph $\mathcal{S}$ over $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ is

	$\displaystyle-\sum_{l\in L_{O}}\log p_{\text{stop}}(\mathcal{X}\|_{l},\mathcal{% Y}\|_{l},d(l))$
	$\displaystyle\hskip 25.00003pt-\sum_{m\in I_{O}}\log p_{\text{node}}(d(m))$

where for any node with depth $d$ , we have

p_{\text{node}}\coloneqq\begin{cases}p_{\text{split}}(d),&\text{if node is an % internal node}\\ p_{\text{stop}}(d,\mathcal{X}_{o})&\text{if node is a leaf node}\end{cases}

(30)

by our choices of $p_{\text{split}}$ and $p_{\text{stop}}$ in Section 4. We can further simplify this to

	$\displaystyle-\sum_{L_{O}}\log P(\mathcal{Y}\|_{\text{leaf}}\|\mathcal{X}\|_{% \text{leaf}},T)$	$\displaystyle-\sum_{I_{O}}\log P(T\|\mathcal{X})$
		$\displaystyle=-\log P(\mathcal{Y},T\|\mathcal{X})$

by the definition of BDT.

∎

Lemma 13.

For the leaf likelihood given in Equation 4, we have that $\ell_{\text{leaf}}(a,0)\ell_{\text{leaf}}(0,b)\geq\ell_{\text{leaf}}(a,b)$ for integers $a,b\geq 0$ .

Proof of 13.

If either $a$ or $b$ is $0$ , then the claim is trivially true as $\ell_{\text{leaf}}(0,0)=1$ . Therefore, we assume both $a$ and $b$ are greater than $0$ . Using the facts that $B(x,y)=\frac{\Gamma(x)\Gamma(y)}{\Gamma(x+y)}$ , where $\Gamma$ is the gamma function, and $\Gamma(x+1)=x\Gamma(x)$ we have that:

		$\displaystyle\ell_{\text{leaf}}(a,0)\ell_{\text{leaf}}(0,b)\stackrel{{% \scriptstyle?}}{{\geq}}\ell_{\text{leaf}}(a,b)$
	$\displaystyle\iff$	$\displaystyle\frac{B(a+\rho_{0},\rho_{1})B(\rho_{0},b+\rho_{1})}{B(\rho_{0},% \rho_{1})B(\rho_{0},\rho_{1})}\stackrel{{\scriptstyle?}}{{\geq}}\frac{B(a+\rho% _{0},b+\rho_{1})}{B(\rho_{0},\rho_{1})}$
	$\displaystyle\iff$	$\displaystyle\frac{\left(\frac{\Gamma(a+\rho_{0})\Gamma(\rho_{1})}{\Gamma(a+% \rho_{0}+\rho_{1})}\right)\left(\frac{\Gamma(\rho_{0})\Gamma(b+\rho_{1})}{% \Gamma(b+\rho_{0}+\rho_{1})}\right)}{\left(\frac{\Gamma(\rho_{0})\Gamma(\rho_{% 1})}{\Gamma(\rho_{0}+\rho_{1})}\right)\left(\frac{\Gamma(\rho_{0})\Gamma(\rho_% {1})}{\Gamma(\rho_{0}+\rho_{1})}\right)}\stackrel{{\scriptstyle?}}{{\geq}}% \frac{\frac{\Gamma(a+\rho_{0})\Gamma(b+\rho_{1})}{\Gamma(a+b+\rho_{0}+\rho_{1}% )}}{\frac{\Gamma(\rho_{0})\Gamma(\rho_{1})}{\Gamma(\rho_{0}+\rho_{1})}}$
	$\displaystyle\iff$	$\displaystyle\Gamma(\rho_{0}+\rho_{1})\Gamma(a+b+\rho_{0}+\rho_{1})\stackrel{{% \scriptstyle?}}{{\geq}}$
		$\displaystyle\hskip 5.0pt\Gamma(a+\rho_{0}+\rho_{1})\Gamma(b+\rho_{0}+\rho_{1})$
	$\displaystyle\iff$	$\displaystyle\left(\prod_{i=0}^{a+b-1}(i+\rho_{0}+\rho_{1})\right)\Gamma(\rho_% {0}+\rho_{1})^{2}\stackrel{{\scriptstyle?}}{{\geq}}$
		$\displaystyle\hskip 5.0pt\left(\prod_{i=0}^{a-1}(i+\rho_{0}+\rho_{1})\right)% \left(\prod_{i=0}^{b-1}(i+\rho_{0}+\rho_{1})\right)\Gamma(\rho_{0}+\rho_{1})^{2}$
	$\displaystyle\iff$	$\displaystyle\prod_{i=a}^{a+b-1}(i+\rho_{0}+\rho_{1})\stackrel{{\scriptstyle?}% }{{\geq}}\prod_{i=0}^{b-1}(i+\rho_{0}+\rho_{1})$

Since $\rho_{0},\rho_{1}>0$ and $a>0$ , we must have the the LHS $\geq$ RHS, which proves the claim. ∎

Lemma 14.

For the leaf likelihood given in Definition 4, we have that $\ell_{\text{leaf}}(a+b,0)\geq\ell_{\text{leaf}}(a,0)\ell_{\text{leaf}}(b,0)$ and $\ell_{\text{leaf}}(0,a+b)\geq\ell_{\text{leaf}}(0,a)\ell_{\text{leaf}}(0,b)$ for integers $a,b\geq 0$ .

Proof of 14.

		$\displaystyle\ell_{\text{leaf}}(a+b,0)\stackrel{{\scriptstyle?}}{{\geq}}\ell_{% \text{leaf}}(a,0)\ell_{\text{leaf}}(b,0)$
	$\displaystyle\iff$	$\displaystyle\frac{B(a+b+\rho_{0},\rho_{1})}{B(\rho_{0},\rho_{1})}\stackrel{{% \scriptstyle?}}{{\geq}}\frac{B(a+\rho_{0},\rho_{1})B(b+\rho_{0},\rho_{1})}{B(% \rho_{0},\rho_{1})B(\rho_{0},\rho_{1})}$
	$\displaystyle\iff$	$\displaystyle B(\rho_{0},\rho_{1})B(a+b+\rho_{0},\rho_{1})\stackrel{{% \scriptstyle?}}{{\geq}}$
		$\displaystyle\hskip 60.00009ptB(a+\rho_{0},\rho_{1})B(b+\rho_{0},\rho_{1})$
	$\displaystyle\iff$	$\displaystyle\frac{\Gamma(\rho_{0})\Gamma(\rho_{1})}{\Gamma(\rho_{0}+\rho_{1})% }\frac{\Gamma(a+b+\rho_{0})\Gamma(\rho_{1})}{\Gamma(a+b+\rho_{0}+\rho_{1})}% \stackrel{{\scriptstyle?}}{{\geq}}$
		$\displaystyle\hskip 20.00003pt\frac{\Gamma(a+\rho_{0})\Gamma(\rho_{1})}{\Gamma% (a+\rho_{0}+\rho_{1})}\frac{\Gamma(b+\rho_{0})\Gamma(\rho_{1})}{\Gamma(b+\rho_% {0}+\rho_{1})}$
	$\displaystyle\iff$	$\displaystyle\Gamma(\rho_{0})\Gamma(a+b+\rho_{0})\Gamma(a+\rho_{0}+\rho_{1})% \Gamma(b+\rho_{0}+\rho_{1})\stackrel{{\scriptstyle?}}{{\geq}}$
		$\displaystyle\hskip 10.00002pt\Gamma(\rho_{0}+\rho_{1})\Gamma(a+b+\rho_{0}+% \rho_{1})\Gamma(a+\rho_{0})\Gamma(b+\rho_{0})$
	$\displaystyle\iff$	$\displaystyle\frac{\Gamma(\rho_{0})\Gamma(a+b+\rho_{0})}{\Gamma(a+\rho_{0})% \Gamma(b+\rho_{0})}\stackrel{{\scriptstyle?}}{{\geq}}$
		$\displaystyle\hskip 30.00005pt\frac{\Gamma(\rho_{0}+\rho_{1})\Gamma(a+b+\rho_{% 0}+\rho_{1})}{\Gamma(a+\rho_{0}+\rho_{1})\Gamma(b+\rho_{0}+\rho_{1})}$

	$\displaystyle\iff$	$\displaystyle\frac{\left(\prod_{i=0}^{a+b-1}(i+\rho_{0})\right)}{(\prod_{i=0}^% {b-1}(i+\rho_{0}))(\prod_{i=0}^{a-1}(i+\rho_{0}))}\stackrel{{\scriptstyle?}}{{% \geq}}$
		$\displaystyle\hskip 20.00003pt\frac{\left(\prod_{i=0}^{a+b-1}(i+\rho_{0}+\rho_% {1})\right)}{\left(\prod_{i=0}^{b-1}(i+\rho_{0}+\rho_{1})\right)\left(\prod_{i% =0}^{a-1}(i+\rho_{0}+\rho_{1})\right)}$
	$\displaystyle\iff$	$\displaystyle\prod_{i=0}^{b-1}\left(\frac{a+i+\rho_{0}}{i+\rho_{0}}\right)% \stackrel{{\scriptstyle?}}{{\geq}}\prod_{i=0}^{b-1}\left(\frac{a+i+\rho_{0}+% \rho_{1}}{i+\rho_{0}+\rho_{1}}\right)$
	$\displaystyle\iff$	$\displaystyle\prod_{i=0}^{b-1}\left(1+\frac{a}{i+\rho_{0}}\right)\stackrel{{% \scriptstyle?}}{{\geq}}\prod_{i=0}^{b-1}\left(1+\frac{a}{i+\rho_{0}+\rho_{1}}\right)$

Since $\rho_{0},\rho_{1}>0$ and $a>0$ , each term in the product on the LHS is positive and greater than or equal to the corresponding term on the RHS, so we have that LHS $\geq$ RHS. This proves that $\ell_{\text{leaf}}(a+b,0)\geq\ell_{\text{leaf}}(a,0)\ell_{\text{leaf}}(b,0)$ for integers $a,b\geq 0$ . The proof for $\ell_{\text{leaf}}(0,a+b)\geq\ell_{\text{leaf}}(0,a)\ell_{\text{leaf}}(0,b)$ for integers $a,b\geq 0$ follows from the symmetry of $\ell_{\text{leaf}}$ in its arguments. ∎

Proof of Theorem 8.

First, we show that $h$ is consistent across any AND node $a_{\mathcal{I},d,f}$ with children $o_{\mathcal{I}|_{f=0},d+1},o_{\mathcal{I}|_{f=1},d+1}$ . This follows directly from the definition of $h$ :

$\displaystyle h(a_{\mathcal{I},d,f})$	$\displaystyle=$	(31)
	$\displaystyle\texttt{cost}(a_{\mathcal{I},d,f},o_{\mathcal{I}\|_{f=0},d+1})$	(32)
	$\displaystyle+h(o_{\mathcal{I}\|_{f=0},d+1})$	(33)
	$\displaystyle+\texttt{cost}(a_{\mathcal{I},d,f},o_{\mathcal{I}\|_{f=1},d+1})$	(34)
	$\displaystyle+h(o_{\mathcal{I}\|_{f=1},d+1})$	(35)

Next, we show that $h$ is consistent for any OR node $o_{\mathcal{I},d}$ with children $t_{\mathcal{I},d},a_{\mathcal{I},d,1},\dots,a_{\mathcal{I},d,F}$ .
Case 1:

\displaystyle\ell_{\text{leaf}}(c^{1}(\mathcal{I}),c^{0}(\mathcal{I}))\geq p_{% \text{split}}(d)\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)\ell_{\text{leaf}}(0,c% ^{0}(\mathcal{I}))

(36)

In this case, we see that the heuristic is consistent for the terminal node child of $o_{\mathcal{I},d}$ : $t_{\mathcal{I},d}$ :

	$\displaystyle h(o_{\mathcal{I},d})$	$\displaystyle=-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),c^{0}(\mathcal{I}))$		(37)
		$\displaystyle=\texttt{cost}(o_{\mathcal{I},d},t_{\mathcal{I},d})$		(38)

We also have the following:

$\displaystyle h(o_{\mathcal{I},d})\leq$		(39)
	$\displaystyle-\log p_{\text{split}}(d)$	(40)
	$\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)$	(41)
	$\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}))$	(42)

It remains to show that the heuristic is consistent for all AND node children of $o_{\mathcal{I},d}$ : $a_{\mathcal{I},d,1},\dots,a_{\mathcal{I},d,F}$ .

Case 2:

\displaystyle\ell_{\text{leaf}}(c^{1}(\mathcal{I}),c^{0}(\mathcal{I}))<p_{% \text{split}}(d)\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)\ell_{\text{leaf}}(0,c% ^{0}(\mathcal{I}))

(43)

In this case, we see that the heuristic is again consistent for the terminal node child of $o_{\mathcal{I},d}$ , $t_{\mathcal{I},d}$ :

$\displaystyle h(o_{\mathcal{I},d})$	$\displaystyle=$	(44)
	$\displaystyle-\log p_{\text{split}}(d)$	(45)
	$\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)$	(46)
	$\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}))$	(47)
	$\displaystyle\leq-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),c^{0}(\mathcal{I}))$	(48)
	$\displaystyle=\texttt{cost}(o_{\mathcal{I},d},t_{\mathcal{I},d})$	(49)

For the AND node children, we begin with the following:

$\displaystyle h(o_{\mathcal{I},d})$	$\displaystyle\leq$	(50)
	$\displaystyle-\log p_{\text{split}}(d)$	(51)
	$\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)$	(52)
	$\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}))$	(53)

As in Case 1, it remains to show that the heuristic is consistent for all AND node children of $o_{\mathcal{I},d}$ : $a_{\mathcal{I},d,1},\dots,a_{\mathcal{I},d,F}$ .

We will now show, for both Cases 1 and 2, that from this inequality, it follows that the heuristic is consistent across all AND node children of $o_{\mathcal{I},d}$ : $a_{\mathcal{I},d,1},\dots,a_{\mathcal{I},d,F}$ .

Applying Lemmas 13 and 14 and the symmetry of $\ell_{\text{leaf}}$ , we have:

$\displaystyle h(o_{\mathcal{I},d})$	$\displaystyle\leq$	(54)
	$\displaystyle-\log p_{\text{split}}(d)$	(55)
	$\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)$	(56)
	$\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}))$	(57)
	$\displaystyle\leq$	(58)
	$\displaystyle-\log p_{\text{split}}(d)$	(59)
	$\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}\|_{f=0}),0)\ell_{\text{% leaf}}(c^{1}(\mathcal{I}\|_{f=1}),0)$	(60)
	$\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}\|_{f=0}))\ell_{\text{% leaf}}(0,c^{0}(\mathcal{I}\|_{f=1}))$	(61)
	$\displaystyle=$	(62)
	$\displaystyle-\log p_{\text{split}}(d)$	(63)
	$\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}\|_{f=0}),0)\ell_{\text{% leaf}}(0,c^{0}(\mathcal{I}\|_{f=0}))$	(64)
	$\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}\|_{f=1}),0)\ell_{\text{% leaf}}(0,c^{0}(\mathcal{I}\|_{f=1}))$	(65)
	$\displaystyle\leq$	(66)
	$\displaystyle-\log p_{\text{split}}(d)$	(67)
	$\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}\|_{f=0}),c^{0}(\mathcal{% I}\|_{f=0}))$	(68)
	$\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}\|_{f=1}),c^{0}(\mathcal{% I}\|_{f=1}))$	(69)

Applying Lemma 14, we have:

$\displaystyle h(o_{\mathcal{I},d})$	$\displaystyle\leq$	(70)
	$\displaystyle-\log p_{\text{split}}(d)$	(71)
	$\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}),0)$	(72)
	$\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}))$	(73)
	$\displaystyle\leq$	(74)
	$\displaystyle-\log p_{\text{split}}(d)$	(75)
	$\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}\|_{f=0}),0)\ell_{\text{% leaf}}(c^{1}(\mathcal{I}\|_{f=1}),0)$	(76)
	$\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}\|_{f=0}))\ell_{\text{% leaf}}(0,c^{0}(\mathcal{I}\|_{f=1}))$	(77)
	$\displaystyle\leq$	(78)
	$\displaystyle-\log p_{\text{split}}(d)$	(79)
	$\displaystyle-2\log p_{\text{split}}(d+1)$	(80)
	$\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}\|_{f=0}),0)$	(81)
	$\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}\|_{f=0}))$	(82)
	$\displaystyle-\log\ell_{\text{leaf}}(c^{1}(\mathcal{I}\|_{f=1}),0)$	(83)
	$\displaystyle-\log\ell_{\text{leaf}}(0,c^{0}(\mathcal{I}\|_{f=1}))$	(84)

From the above two inequalities, we have that:

\displaystyle h(o_{\mathcal{I},d})

\displaystyle\leq\texttt{cost}(o_{\mathcal{I},d},a_{\mathcal{I},d,f})+h(a_{% \mathcal{I},d,1})

(85)

as required.

∎

Corollary 15 (Admissibility of Perfect Split Heuristic).

The Perfect Split Heuristic $h$ defined in Definition 7 is admissible, i.e., given the true value of an OR node $f(o)$ , we have that $h(o)\leq f(o)$ , and given the true value of an AND node $f(a)$ , we have that $h(a)\leq f(a)$ .

Lemma 16.

Across all iterations of MAPTree, $UB[o]$ represents the minimal cost of any partial solution rooted at OR node $o$ in $\mathcal{G}^{\prime}$ .

Proof of Lemma 16.

We prove this via induction on iteration. After the first iteration, there is only terminal node in $\mathcal{G}^{\prime}$ , and only one valid solution $\mathcal{S}$ exists in $\mathcal{G}^{\prime}$ : $\mathcal{S}_{0}=\{o_{[N],0},t_{[N],0}\}$ . Thus, no nodes other than $r=o_{[N],0}$ have valid partial solutions. At this point, $UB[r]=\texttt{cost}(o_{[N],0},t_{[N],0})=\texttt{cost}(\mathcal{S})$ and $UB$ is undefined on all other nodes, as required.

In each future iterations, there is at most one terminal node $t^{*}$ added to $\mathcal{G}^{\prime}$ . We will show that for any OR node $o_{\mathcal{I},d}$ , $UB[o_{\mathcal{I},d}]$ represents the minimal cost of any partial solution rooted at $o_{\mathcal{I},d}$ . We prove this over induction over the size of $\mathcal{I}$ of $o_{\mathcal{I},d}\in\mathcal{G}^{\prime}$ .

When $|\mathcal{I}|=1$ , any split will lead to an empty subtree, meaning if $t_{\mathcal{I},d}\in\mathcal{G}^{\prime}$ then $\{o_{\mathcal{I},d},t_{\mathcal{I},d}\}$ is the only partial solution rooted at $o_{\mathcal{I},d}$ and otherwise no such partial solution exists. If $t_{\mathcal{I},d}\not\in\mathcal{G}^{\prime}$ , this implies that $o_{\mathcal{I},d}\not\in\mathcal{E}$ , meaning $UB[o_{\mathcal{I},d}]$ is undefined. If $t_{\mathcal{I},d}\in\mathcal{G}^{\prime}$ and $|\mathcal{I}|=1$ , then $UB[o_{\mathcal{I},d}]$ is $\texttt{cost}(o_{\mathcal{I},d},t_{\mathcal{I},d})$ , as required.

When $|\mathcal{I}|>1$ , we have two cases:

Case 1: There exists a minimal cost partial solution rooted at $o_{\mathcal{I},d}$ which does not contain $t^{*}$ .
In this case, the cost of this partial solution is still the minimal cost across any partial solution rooted at $o_{\mathcal{I},d}$ . $UB[o_{\mathcal{I},d}]$ remains unchanged in this case. (Otherwise a child’s $UB$ must have been updated to a value such that there is now a partial solution rooted at $o_{\mathcal{I},d}$ which contains that child and its new minimal cost partial solution. However, since the only terminal node added to $\mathcal{G}^{\prime}$ this iteration was $t^{*}$ , this implies that the child’s new minimal cost solution must contain $t^{*}$ , which is a contradiction.) Since $UB[o_{\mathcal{I},d}]$ was not changed, our inductive hypothesis over iterations states that $UB[o_{\mathcal{I},d}]$ still represents the minimal cost of any partial solution rooted at OR node $o$ in $\mathcal{G}^{\prime}$ .

Case 2: All minimal cost partial solutions rooted at $o_{\mathcal{I},d}$ contain $t^{*}$ .
Consider a child $c^{*}$ that is part of some such minimal cost partial solution. In this case, our inductive hypothesis over $|\mathcal{I}|$ gives us that $UB$ is correctly updated in this iteration. It follows that $o_{\mathcal{I},d}$ will be added to the queue in updateUpperBounds after this child because it must have lower depth. As a result, $UB[o_{\mathcal{I},d}]$ will be set to the minimal cost of any partial solution rooted at $o_{\mathcal{I},d}$ , as required.

We conclude that, in either case, $UB[o]$ represents the minimal cost of any partial solution rooted at OR node $o$ in $\mathcal{G}^{\prime}$ . ∎

Lemma 17.

getSolution( $o_{\mathcal{I},d}$ ) outputs a minimal cost partial solution of $\mathcal{G}^{\prime}$ rooted at OR node $o_{\mathcal{I},d}$ .

Proof of Lemma 17.

We will show that getSolution( $o_{\mathcal{I},d}$ ) outputs a minimal cost partial solution of $\mathcal{G}^{\prime}$ rooted at OR node $o_{\mathcal{I},d}$ via induction on $|\mathcal{I}|$ . When $|\mathcal{I}|=1$ , any splits will lead to a empty subtrees, so $\texttt{getSolution}(o_{\mathcal{I},d})$ must return $\{o_{\mathcal{I},d},t_{\mathcal{I},d}\}$ . When $|\mathcal{I}|>1$ , getSolution( $o_{\mathcal{I},d}$ ) will either stop for a minimal cost or recurse on a split that yields minimal $UB$ value. Lemma 16 shows that the $UB$ values of $o_{\mathcal{I},d}$ and its children are equal to the minimal cost across all partial solutions rooted at each of these respective nodes. As a result, if getSolution stops, then $\{o_{\mathcal{I},d},t_{\mathcal{I},d}\}$ is a minimal cost partial solution. Otherwise, if getSolution splits on feature $f$ , $\{o_{\mathcal{I},d},a_{\mathcal{I},d,f}\}\,\cup\texttt{getSolution}(o_{% \mathcal{I}|_{f=0},d+1})\,\cup\texttt{getSolution}(o_{\mathcal{I}|_{f=0},d+1})$ is also a minimal cost partial solution by the inductive hypothesis. ∎

Proof of Theorem 12.

We will show that upon early termination, MAPTree always returns a minimal cost solution within the explicit subgraph $\mathcal{G}^{\prime}\subset G$ explored by MAPTree. From Lemma 17, we have that a minimal cost solution of $\mathcal{G}^{\prime}$ is output by MAPTree, even upon early termination. ∎

Lemma 18.

The lower bounds $LB$ represent correct lower bounds on the true value of a node in every iteration. For any OR node $o$ with true value $f(o)$ , we have that $LB[o]\leq f(o)$ and for any AND node $a$ with true value $f(a)$ , we have that $LB[a]\leq f(a)$ .

Proof of Lemma 18.

We will show that for any OR node $o_{\mathcal{I},d}$ , we have that $LB[o_{\mathcal{I},d}]\leq f(o_{\mathcal{I},d})$ . This follows from Corollary 15. Throughout MAPTree, $LB[o_{\mathcal{I},d}]$ is set to either:

1.

$h(o_{\mathcal{I},d})$
2.

$\min_{c\in\{t_{\mathcal{I},d},a_{\mathcal{I},d,1},\dots,a_{\mathcal{I},d,F}\}}% \texttt{cost}(o_{\mathcal{I},d},c)+LB[c]$

In the first case, Corollary 15 gives us that $h(o_{\mathcal{I},d})\leq f(o_{\mathcal{I},d})$ . In the second case, we induct on iteration. First, though MAPTree does not query $LB$ for nodes on which a value has not yet been assigned, we will assume for the purpose of this proof that $LB$ defaults to $0$ . Thus, before the first iteration, $LB$ is $0$ across all nodes. Our cost function cost is nonnegative, so $LB[o_{\mathcal{I},d}]=0\leq f(o_{\mathcal{I},d})$ must hold. For future iterations then, we have the following for any node $o_{\mathcal{I},d}$ on which $LB$ is defined:

$\displaystyle LB[o_{\mathcal{I},d}]$	$\displaystyle:=\min_{c\in\{t_{\mathcal{I},d},a_{\mathcal{I},d,1},\dots,a_{% \mathcal{I},d,F}\}}\texttt{cost}(o_{\mathcal{I},d},c)+LB[c]$	(86)
	$\displaystyle\leq\min_{c\in\{t_{\mathcal{I},d},a_{\mathcal{I},d,1},\dots,a_{% \mathcal{I},d,F}\}}\texttt{cost}(o_{\mathcal{I},d},c)+f(c)$	(87)
	$\displaystyle\leq f(o_{\mathcal{I},d})$	(88)

Thus, $LB$ is a true lower bound, as required. ∎

Lemma 19.

For any OR node $o$ , if $LB[o]<UB[o]$ , then $o$ must have some OR node descendant $o^{\prime}$ such that $o^{\prime}\not\in\mathcal{E}$ .

Proof of Lemma 19.

We prove the contrapositive. Consider any OR node $o_{\mathcal{I},d}$ . We show that if every OR node descendant of $o_{\mathcal{I},d}$ is in $\mathcal{E}$ , then $LB[o_{\mathcal{I},d}]=UB[o_{\mathcal{I},d}]$ . We show this via induction on $|\mathcal{I}|$ . When $|\mathcal{I}|=1$ , splitting further incurs infinite cost, so $LB[o_{\mathcal{I},d}]=UB[o_{\mathcal{I},d}]=\texttt{cost}(o_{\mathcal{I},d},t_% {\mathcal{I},d})$ . When $|\mathcal{I}|>1$ , if every OR node descendant of $o_{\mathcal{I},d}$ is in $\mathcal{E}$ , then $o_{\mathcal{I},d}\in\mathcal{E}$ must also hold. Since the OR node descendants of the children of $o_{\mathcal{I},d}$ must be in $\mathcal{E}$ as well, they must all have matching $UB$ and $LB$ by inductive hypothesis, meaning $LB[o_{\mathcal{I},d}]=UB[o_{\mathcal{I},d}]$ . ∎

Lemma 20.

If $LB[r]<UB[r]$ , then findNodeToExpand returns an unexpanded OR node $o\not\in\mathcal{E}$ .

Proof.

This follows from Lemma 19 and that findNodeToExpand selects an AND node with the lowest lower bound and the child of this AND node with the largest gap between $LB$ and $UB$ , meaning a nonzero gap is chosen when one exists. ∎

Proof of Theorem 9.

Lemma 19 gives us that $LB[r]=UB[r]$ must hold upon exhaustive exploration of the search space. From Lemma 20, we have that MAPTree will always expand a new node every iteration that MAPTree has not completed. Since $\mathcal{G}$ is finite, as discussed after Definition 4, it follows that MAPTree must eventually complete. ∎

Proof of Theorem 10.

This follows directly from Theorem 9, Lemma 16, and Lemma 18. ∎

Appendix B Experiment Details and Additional Experiments

B.1 Experiment Details

In Section 6, we compared the performance of MAPTree against various state-of-the-art baselines. In this subsection, we describe those baselines and the experiments in more detail.

Speed Comparisons against MCMC and SMC

In this set of experiments in Section 6, we compared against the Sequential Monte Carlo (SMC) and Markov-Chain Monte Carlo (MCMC) methods (Lakshminarayanan, Roy, and Teh 2013) which sample from the BCART posterior from Chipman, George, and McCulloch (1998). We used the posterior distribution hyperparameters for each algorithm specified in Lakshminarayanan, Roy, and Teh (2013). To gather results for each baseline with varying times, we set the number of islands in SMC to 10, 30, 100, 300, and 1000 and the number of iterations for MCMC to 10, 30, 100, 300, and 1000. For each of these settings we ran 10 iterations with different random seeds and recorded the average time across iterations, the mean log posterior of the tree with the highest log posterior discovered in each run, and a 95% bootstrapped confidence interval of the average of highest log posteriors discovered across all runs. MAPTree was run with number of expansions limited to 10, 30, 100, 300, and 1000, 3000, 10000, 30000, and 100000. We also ran one additional run of MAPTree with a 10 minute time limit. Since MAPTree is a deterministic algorithm, we did a single run for each of these settings, recording the time and log posterior of the returned tree. Runs which ran out of memory on our computing cluster with 16 GB of RAM in the time limit were discarded.

Note that, given that MAPTree, the SMC baseline, and MCMC baseline all explore the same posterior, measuring their relative generalization performance would not be meaningful; it is more meaningful to measure how quickly they can explore the posterior $P(T|\mathcal{X},\mathcal{Y})$ to discover the maximum a posteriori tree. Given infinite runtime, all three algorithms (MAPTree, SMC, and MCMC) should recover the maximum a posteriori tree from the BCART posterior. However, recent work has proven that algorithms such as SMC and MCMC experience long mixing times on the BCART posterior (Kim and Rockova 2023). Our experiments are consistent with these observations; we find that MAPTree is able to find higher likelihood trees faster than the SMC and MCMC baselines in most datasets. Furthermore, in 5 of the 16 datasets, MAPTree is able to recover a maximum a posteriori tree and provide a certificate of optimality.

Fitting a Synthetic Dataset

In the remaining two sets of experiments, we compare MAPTree’s generalization performance and model size to baseline state-of-the-art ODT and OSDT algorithms. ODT algorithms search for trees which minimize misclassification error given a maximum depth. OSDT algorithms search for trees which minimize the same objective but with an added per-leaf sparsity penalty in lieu of a hard depth constraint. We use DL8.5 (Aglin, Nijssen, and Schaus 2020) as our baseline ODT algorithm and GOSDT (Lin et al. 2020) as our baseline OSDT algorithm. Note that these algorithms maximize different objectives than MAPTree and do not explicitly explore the posterior $P(T|\mathcal{X},\mathcal{Y})$ , so we are primarily interested in the different methods’ generalization performance. We also use CART with constrained depth (Breiman et al. 1984), as the baseline representative of greedy, top-down approaches. All of these baselines are sensitive to their choices of hyperparameters, in particular the maximum depth for CART and DL8.5, and sparsity penalty for GOSDT. We experimented with these hyperparameters to find the best-performing ones for each baseline algorithms, and presented the results for representative settings in Section 6. In particular, we chose maximum depth $4$ for CART, maximum depth $4-5$ for DL8.5, and sparsity penalties $\frac{10}{32}$ and $\frac{1}{32}$ for GOSDT. These two sparsity penalties were taken from (Lin et al. 2020): the former was used in evaluation of GOSDT’s speed and the latter was used in evaluation of its accuracy. Our experiments set a time limit of 1 minute across all algorithms; the best tree discovered by each algorithm within this time limit was recorded.

The synthetic dataset was created via the process described in Section 6.

Accuracy, Likelihood and Size Comparisons on Real World Benchmarks

In this set of experiments, we compared MAPTree to the baseline algorithms as described in Appendix B.1 on the CP4IM datasets, and the hyperparameters for all algorithms were set to the same values. Again, our experiments set a time limit of 1 minute across all algorithms; the best tree discovered by each algorithm within this time limit was recorded.

The metrics we measure are the per-sample test log likelihood and test accuracy (relative to the performance of CART), and the total number of nodes in the trained tree. We performed stratified 10-fold cross validation on each dataset and recorded the average value across folds for each metric. The average per-sample test log likelihood and test accuracy of CART with maximum depth 4 was subtracted from all baselines on each dataset to get the relative per-sample test log likelihood and test accuracy. The plots in Figure 5 are box-and-whisker plots of the metric values across all 16 datasets, where each box represents the $25$ th to $75$ th percentile, whiskers extend out to at most $1.5\times$ the size of the box body, and the remaining points are marked as outliers.

B.2 Additional Experiments

In this subsection, we present additional experimental results that were omitted from the main paper due to space constraints.

Figure 6 shows that MAPTree generates trees which out-perform both the greedy, top-down approaches and ODT methods in test accuracy for various training dataset sizes and values of label corruption proportion $\epsilon$ .

Speed Comparisons against MCMC and SMC (Full)

In this subsection, we include the results of the speed comparisons of MAPTree with SMC and MCMC on the remaining 12 datasets of the CP4IM dataset that were not presented in Section 6 due to space constraints. Figure 7 demonstrates a similar trend on the additional datasets as was demonstrated by Figure 4, namely that MAPTree generally outperforms SMC and MCMC and is able to find trees with higher log posterior faster than the baseline algorithms.

Fitting a Synthetic Dataset (Full)

In this subsection, we include the results of the synthetic data experiment against benchmarks with a more exhaustive list of hyperparameters, which we omitted in Section 6 due to space constraints. Figure 8, 9, and 10 demonstrate a similar trends as in Figure 6, namely that MAPTree generally the baseline algorithms with less training data, and is more robust to label noise than the baselines.

Accuracy, Likelihood and Size Comparison on Real World Benchmarks (Full)

In this subsection, we include the results of accuracy, likelihood, and size comparisons against benchmarks with a more exhaustive list hyperparameter settings, which we omitted in Section 6 due to space constraints. Figure 11 demonstrates a similar trend as in Figure 5, namely that MAPTree generally either a) outperforms the baseline algorithms in generalization performance, or b) performs comparably but with smaller trees. Further, we observe that CART and DL8.5 are sensitive to their hyperparameter settings: at higher maximum depths, both algorithms output much larger trees that do not perform any better than their shallower counterparts.

Hyperparameters of MAPTree

In this subsection, We demonstrate that MAPTree is not sensitive to the choice of hyperparameters $\alpha$ and $\beta$ . We run MAPTree on all 16 benchmark datasets from CP4IM (Guns, Nijssen, and De Raedt 2011) with seven different hyperparameter settings of the prior distribution used in MAPTree:

0.

$\alpha=0.999,\beta=0.1$
1.

$\alpha=0.99,\beta=0.2$
2.

$\alpha=0.95,\beta=0.5$
3.

$\alpha=0.9,\beta=1.0$
4.

$\alpha=0.8,\beta=2.0$
5.

$\alpha=0.5,\beta=4.0$
6.

$\alpha=0.2,\beta=8.0$

These prior specifications were chosen to cover a range of the hyperparameters $\alpha$ and $\beta$ that induce MAPTree to have different priors over the probability of splitting. We expect the size of trees generated to be higher for earlier priors and lower for later priors. Relative to the first prior specification, the last prior specification assumes over $1000\times$ lower probability of splitting at depth $1$ a priori.

We measure the test accuracy relative to CART, the per-sample test log likelihood relative to CART, and the model size of the trees generated by MAPTree with the different hyperparameter settings. Figure 12 demonstrates our results. MAPTree does not show significant sensitivity to its hyperparameters across any metric.

Appendix C Implementation Details

C.1 Reversible Sparse Bitset

In order to efficiently explore the search space $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ , described in Definition 4, MAPTree must be able to compactly represent subproblems and move between them efficiently. Subproblems in $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ correspond with a subset $\mathcal{I}$ of samples from the dataset $\mathcal{X},\mathcal{Y}$ as well as a given depth. We represent these subproblems using a reversible sparse bitset, a data structure also used in DL8.5 (Aglin, Nijssen, and Schaus 2020). Reversible sparse bitsets represent $\mathcal{I}$ as a list of indexed bitstring blocks that correspond with nonempty subarrays of the current bitset. These blocks also record a history of their previous values, allowing us to efficiently “reverse” a bitset to its previous value before branching into another region of the search space. For more details on reversible sparse bitsets, we direct the reader to Verhaeghe, Lecoutre, and Schaus (2018).

C.2 Caching Subproblems

It is also necessary to identify equivalent subproblems in our search. To do this, we cache each explored subproblem $o_{\mathcal{I},d}$ . Previous ODT algorithms cache $(\mathcal{I},d)$ explicitly or the path of splits used to reach the subproblem: loosely, $\text{path}(o_{\mathcal{I},d})$ (Demirović et al. 2022; Aglin, Nijssen, and Schaus 2020). The former approach takes up $O(N)$ memory per subproblem whereas the latter uses less memory but does not identify all equivalences between subproblems (i.e. two paths may result in the same subset of datapoints), meaning it is slower. In MAPTree, we provide a probably correct cache which takes $O(1)$ memory per subproblem and always identifies equivalent subproblems. The cache stores the depth $d$ and constructs a 128-bit hash of $\mathcal{I}$ for each subproblem $o_{\mathcal{I},d}$ . We describe the 128-bit hashing function in Algorithm 6.

Algorithm 6 subsetHash

Input: Subset of sample indices $\mathcal{I}$
Output: 128-bit hash value $h$

1: Let bitset be a bitset of size

N

2: for all

i\in[N]

\texttt{bitset}[i]:=i\in\mathcal{I}

4: end for

5: Let

h_{1},h_{2}

be 64-bit integers.

6: for all

b\in[N/64]

7: Let block be

\texttt{bitset}[64b:64(b+1)]

h_{1}:=h_{1}+\texttt{block}\times(377424577268497867)^{b}

h_{2}:=h_{2}+\texttt{block}\times(285989758769553131)^{b}

10: end for

11: return

(h_{1},h_{2})

MAPTree: Beating “Optimal” Decision Trees with Bayesian Decision Trees

Abstract

1 Introduction

2 Related Work

3 Preliminaries and Notation

3.1 AND/OR Graph Search

3.2 Bayesian Classification and Regression Trees (BCART)

Theorem 1.

Theorem 2.

Definition 3.

4 Connecting BCART with AND/OR Graphs

Definition 4 (BCART AND/OR graph 𝒢𝒳,𝒴subscript𝒢𝒳𝒴\mathcal{G}_{\mathcal{X},\mathcal{Y}}caligraphic_G start_POSTSUBSCRIPT caligraphic_X , caligraphic_Y end_POSTSUBSCRIPT).

Theorem 5.

Theorem 6.

5 MAPTree

Definition 7 (Perfect Split Heuristic).

5.1 Analysis of MAPTree

Theorem 8 (Consistency of the Perfect Split Heuristic).

Theorem 9 (Finiteness of MAPTree).

Theorem 10 (Correctness of MAPTree).

Corollary 11.

Theorem 12 (Anytime optimality of MAPTree).

6 Experiments

6.1 Speed Comparisons against MCMC and SMC

6.2 Fitting a Synthetic Dataset

6.3 Accuracy, Likelihood, and Size Comparisons on Real World Benchmarks

7 Discussion and Conclusions

Acknowledgements

References

Appendix A Proofs of Theorems

Proof of Theorem 1.

Proof of Theorem 2.

Proof of Theorem 5.

Proof of Theorem 6.

Lemma 13.

Proof of 13.

Lemma 14.

Proof of 14.

Proof of Theorem 8.

Corollary 15 (Admissibility of Perfect Split Heuristic).

Lemma 16.

Proof of Lemma 16.

Lemma 17.

Proof of Lemma 17.

Proof of Theorem 12.

Lemma 18.

Proof of Lemma 18.

Lemma 19.

Proof of Lemma 19.

Lemma 20.

Proof.

Proof of Theorem 9.

Proof of Theorem 10.

Appendix B Experiment Details and Additional Experiments

B.1 Experiment Details

Speed Comparisons against MCMC and SMC

Fitting a Synthetic Dataset

Accuracy, Likelihood and Size Comparisons on Real World Benchmarks

B.2 Additional Experiments

Speed Comparisons against MCMC and SMC (Full)

Fitting a Synthetic Dataset (Full)

Accuracy, Likelihood and Size Comparison on Real World Benchmarks (Full)

Hyperparameters of MAPTree

Appendix C Implementation Details

C.1 Reversible Sparse Bitset

C.2 Caching Subproblems

Definition 4 (BCART AND/OR graph $\mathcal{G}_{\mathcal{X},\mathcal{Y}}$ ).