Learnability of Parameter-Bounded Bayes Nets

Arnab Bhattacharyya
National University of Singapore Davin Choo
National University of Singapore Sutanu Gayen
Indian Institute of Technology Kanpur Dimitrios Myrisiotis
CNRS@CREATE LTD

Abstract

Bayes nets are extensively used in practice to efficiently represent joint probability distributions over a set of random variables and capture dependency relations. In a seminal paper, Chickering et al. (JMLR 2004) showed that given a distribution $\mathbbm{P}$ , that is defined as the marginal distribution of a Bayes net, it is $\mathsf{NP}$ -hard to decide whether there is a parameter-bounded Bayes net that represents $\mathbbm{P}$ . They called this problem LEARN. In this work, we extend the $\mathsf{NP}$ -hardness result of LEARN and prove the $\mathsf{NP}$ -hardness of a promise search variant of LEARN, whereby the Bayes net in question is guaranteed to exist and one is asked to find such a Bayes net. We complement our hardness result with a positive result about the sample complexity that is sufficient to recover a parameter-bounded Bayes net that is close (in TV distance) to a given distribution $\mathbbm{P}$ , that is represented by some parameter-bounded Bayes net, generalizing a degree-bounded sample complexity result of Brustle et al. (EC 2020).

1 Introduction

Bayesian networks [Pea88], or simply Bayes nets, are directed acyclic graphs (DAGs), accompanied by a collection of conditional probability distributions (that is, one for each vertex), that are used to represent joint probability distributions over dependent random variables in an elegant and succinct manner. As an example, consider a distribution $\mathbbm{P}$ over five Boolean variables $\bm{X}=\{X_{1},X_{2},X_{3},X_{4},X_{5}\}$ . Regardless of the dependencies between the variables, $\mathbbm{P}$ can always be represented by a lookup table that has $2^{5}-1=31$ entries, with one entry for each possible Boolean outcome except one. However, known dependencies between variables may induce a sparser representation. If $X_{2}$ depends on $X_{1}$ , $X_{3}$ depends on $\{X_{2},X_{5}\}$ , and $X_{4}$ depends on $X_{3}$ , then the joint distribution $\mathbbm{P}(x_{1},\ldots,x_{5})$ decomposes as

\mathbbm{P}(x_{1})\cdot\mathbbm{P}(x_{2}\mid x_{1})\cdot\mathbbm{P}(x_{3}\mid x% _{2},x_{5})\cdot\mathbbm{P}(x_{4}\mid x_{3})\cdot\mathbbm{P}(x_{5}).

In fact, one can represent $\mathbbm{P}$ with a relatively sparse Bayes net $\mathcal{G}$ (see Figure 1) with conditional probability tables (CPTs) associated with each vertex. Observe that $8<31$ numbers suffice to describe the CPTs: One for each of the Bernoulli distributions of $X_{1}$ and $X_{5}$ , two for the conditional probability distributions of $X_{2}$ and $X_{4}$ , and four for that of $X_{3}$ . In the rest of this work, we refer to the numbers used in defining the CPTs above as parameters.

Figure 1: Left: A Bayes net

\mathcal{G}

such that the distribution

\mathbbm{P}

of our example is represented by

\mathcal{G}

. Right: A Bayes net

\mathcal{H}

such that the distribution that arises from the distribution

\mathbbm{P}

after marginalizing out

X_{3}

is represented by

\mathcal{H}

It is a standard result [Pea88, CHM04] that there exists a Bayes net $\mathcal{G}$ of $p$ parameters that represents a probability distribution $\mathbbm{P}$ if and only if $\mathbbm{P}$ is Markov with respect to some Bayes net $\mathcal{H}$ of $p$ parameters that has the same underlying DAG as $\mathcal{G}$ . (It could be that $\mathcal{G}=\mathcal{H}$ , but this is not necessary.) Here, the property of a distribution $\mathbbm{P}$ being Markov with respect to a Bayes net $\mathcal{G}$ means that a certain graphical separation condition in the underlying DAG of $\mathcal{G}$ , known as d-separation, implies conditional independence in $\mathbbm{P}$ (see Section 2.2 for formal definitions). We shall make use of this equivalence in the sequel.

A series of works studied the problem of learning the underlying DAG of a Bayes net from data, by focusing on maximizing certain scoring criterion by the underlying DAG, see, e.g., [CH92, SDLC93, HGC95]. This task was later shown to be $\mathsf{NP}$ -hard by [Chi96], which then raised the following natural fundamental question:

Given a succinct description of a distribution $\mathbbm{P}$ (that is not in terms of a Bayes net), how easy is it to find a Bayes net $\mathcal{G}$ such that $\mathbbm{P}$ is Markov with respect to $\mathcal{G}$ ?

Unfortunately, [CHM04] showed that deciding whether a given distribution $\mathbbm{P}$ is Markov with respect to some Bayes net of at most $p\in\mathbb{N}$ parameters or not is $\mathsf{NP}$ -hard.

Remark 1.1.

In [CHM04], the distribution $\mathbbm{P}$ is described as the certain marginal of a Bayes net of small in-degree, i.e., a succinct Bayes net description over variables $\bm{X}$ , along with a subset of variables $\bm{S}\subseteq\bm{X}$ to marginalize out. For example, any distribution $\mathbbm{P}^{\prime}$ that is Markov with respect to the left Bayes net in Figure 1 is Markov with respect to the right Bayes net in Figure 1 after marginalizing out $X_{3}$ . Note that all possible distributions over $\bm{X}$ are Markov with respect to some Bayes net over a clique, but such a Bayes net requires $2^{|\bm{X}|}-1$ parameters.

Regarding upper bounds, there are well-known algorithms for learning the underlying DAG of a Bayes net from distributional samples such as the PC [SGS00] and GES [Chi02] algorithms. Recently, [BCD20] also gave finite sample guarantees of learning Bayes nets (that have $n$ nodes, each taking values over an alphabet $\Sigma$ ) from samples. When given the promise that the underlying DAG has bounded in-degree of $d$ , [BCD20, Theorem 10] asserts that using

\mathcal{O}\left(\frac{\log\frac{1}{\delta}}{\varepsilon^{2}}\left(n\left|% \Sigma\right|^{d+1}\log\left(\frac{n{\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}\left|\Sigma\right|}}{\varepsilon}\right)+n\cdot d\cdot\log n\right)\right)

(1)

samples from the underlying distribution $\mathbbm{P}$ , one can learn $\mathbbm{P}$ up to total variation (TV) distance $\varepsilon$ with probability at least $1-\delta$ .

One standard way to measure the complexity of a Bayes net is by imposing an upper bound on any node’s in-degree. In this work, we measure the complexity in terms of the number of parameters $p$ , which is a more fine-grained measure than that of maximum in-degree $d$ . For instance, a Bayes net on $n$ Boolean variables with maximum in-degree $d$ could have as few as $O(n+2^{d})$ parameters (e.g., a star graph where $d$ leaves point towards the center of the start) and as many as $\Omega(n\cdot 2^{d})$ parameters (e.g., a complete graph).

1.1 Our Contributions

Our two main contributions are that we extend the hardness result of [CHM04] and generalize the finite sample complexity result of [BCD20].

Contribution 1.

We extend the hardness result of [CHM04] to the setting where we are guaranteed that the Bayes net in question is promised to have a small number of parameters. In computational complexity theory, this is also known as a promise problem, which generalizes a decision problem in that the input is promised to belong to a certain subset of possible inputs. Our new hardness result confirms the common intuition that it is hard to search for a Bayes net $\mathcal{G}$ that is Markov with respect to a given probability distribution, even if it is known that the distribution in question is Markov with respect to a Bayes net that has a small number of parameters.

Definition 1.2 (The REALIZABLE-LEARN problem).

Given as input variables $\bm{X}=(X_{1},\ldots,X_{n})$ , a probability distribution $\mathbbm{P}$ on $\bm{X}$ , a parameter bound $p\in\mathbb{N}$ , and the promise that there exists a Bayes net $\mathcal{G}$ with at most $p$ parameters such that $\mathbbm{P}$ is Markov with respect to $\mathcal{G}$ , output a Bayes net $\mathcal{G}^{\prime}$ with at most $p$ parameters such that $\mathbbm{P}$ is Markov with respect to $\mathcal{G}^{\prime}$ .

Theorem 1.3.

REALIZABLE-LEARN is $\mathsf{NP}$ -hard.

Technically speaking, REALIZABLE-LEARN is a promise search problem. While $\mathsf{NP}$ -hardness results usually revolve around decision problems, $\mathsf{NP}$ -hardness naturally extends to the more general case of search problems when Turing reductions are considered. (Turing reductions comprise a very broad class of reductions, whereby an efficient algorithm for a problem yields an efficient algorithm for another.)

Contribution 2.

We generalized the finite sample result of [BCD20] from the degree-bounded setting to the parameter-bounded setting.

Theorem 1.4 (Approximating parameter-bounded Bayes nets using samples).

Fix any accuracy parameter $\varepsilon>0$ and confidence parameter $\delta>0$ . Given sample access to a distribution $\mathbbm{P}$ over $n$ variables, each defined on the alphabet $\Sigma$ , and the promise that $\mathbbm{P}$ is Markov with respect to a Bayes net with at most $p$ parameters,

\mathcal{O}\!\left(\frac{\log\frac{1}{\delta}}{\varepsilon^{2}}\left(p\log% \left(\frac{n\left|\Sigma\right|}{\varepsilon}\right)+n\frac{\log\left(\frac{p% }{n\left(|\Sigma|-1\right)}\right)}{\log\left|\Sigma\right|}\log n\right)\right)

samples from $\mathbbm{P}$ suffice to learn the underlying DAG of a Bayes net $\mathcal{G}$ with at most $p$ parameters and define a distribution $\mathbbm{Q}$ that is Markov with respect to $\mathcal{G}$ such that $d_{\mathrm{TV}}(\mathbbm{P},\mathbbm{Q})\leq\varepsilon$ with success probability at least $1-\delta$ .

Notice that when all in-degrees are at most $d$ , we have $p\leq(\left|\Sigma\right|-1)\cdot n\cdot\left|\Sigma\right|^{d}$ , so our result generalizes the bound of [BCD20] given in Equation 1.¹¹1One can further generalize Theorem 1.4 to the case where each node has a different alphabet size, e.g., $X_{i}$ has alphabet $\Sigma_{i}$ , but this is a straightforward extension.

Finally, we note that while Theorem 1.4 runs in time polynomial in $1/\delta$ and $1/\varepsilon^{2}$ , and it has exponential dependency on the number of samples from $\mathbbm{P}$ , similar to [BCD20].

1.2 Paper Outline

After preliminaries in Section 2, we give a high-level overview of the techniques behind our results in Section 3. We then formally prove Theorem 1.3 and Theorem 1.4 in Section 4 and Section 5, respectively. Finally, we conclude with an open problem in Section 6.

2 Preliminaries

2.1 Notation

We define the set of natural numbers by $\mathbb{N}$ and all logarithms refer to the natural $\log$ . Distributions are written as $\mathbbm{P}$ , $\mathbbm{Q}$ and graphs in calligraphic letters, e.g., $\mathcal{G},\mathcal{H},\mathcal{K}$ . For variables/nodes, we use capital letters, small letters for the values taken by them, and boldface versions for a collection of variables, e.g., $X=x$ and $\bm{X}=\bm{x}$ . As shorthands, we write $[n]$ for $\{1,\ldots,n\}$ and $\mathbbm{P}(\bm{x})$ for $\mathbbm{P}(\bm{X}=\bm{x})$ . We will often represent the same set of variables of distributions as nodes in a graph.

Problems and algorithms are named in the typewriter font in full caps and capitalized, respectively, e.g., PROBLEM and Algorithm.

We will also often use $\Sigma$ to denote the alphabet set of a variable and write $\Delta_{|\Sigma|}$ to denote the corresponding (conditional) probability simplex.

2.2 Graph-Theoretical Notions

Let $\mathcal{G}=(\bm{X},\bm{E})$ be a fully directed graph on $|\bm{X}|=n$ vertices and $|\bm{E}|$ edges where adjacencies are denoted with dashes, e.g., $X-Y$ , and arc directions are denoted with arrows, e.g., $X\to Y$ . For any node $X\in\bm{X}$ , we write $\mathbf{Pa}_{\mathcal{G}}(X)\subseteq\bm{X}$ to denote its parents and $\mathbf{pa}_{\mathcal{G}}(X)$ to denote the values they take.

A degree sequence of a graph $\mathcal{G}$ on vertex set $\bm{X}=\{X_{1},\ldots,X_{n}\}$ is a list of degrees $\bm{d}=(d_{1},\ldots,d_{n})$ of all vertices in the graph, i.e., vertex $X_{i}$ has degree $d_{i}$ . A graph $\mathcal{H}=(\bm{X},\bm{E})$ is said to realize degree sequence $\bm{d}$ if the degrees of $\bm{X}$ in $\mathcal{H}$ agree with $\bm{d}$ . Realizability is defined in a similar fashion for in-degree sequences $\bm{d}^{-}=(d_{1}^{-},\ldots,d_{n}^{-})$ .

The graph $\mathcal{G}$ is called a directed acyclic graph (DAG) if it does not contain any directed cycles and is said to be complete if for every two of its nodes $U,V\in\bm{X}$ either there is an edge $V\to U$ or an edge $U\to V$ , i.e., the underlying undirected graph is a clique. A vertex $V_{i}$ on any simple path $V_{1}-\ldots-V_{k}$ is called a collider if the arcs are such that $V_{i-1}\to V_{i}\leftarrow V_{i+1}$ .

A Bayesian network (or Bayes net) $\mathcal{G}$ for a set of $n$ variables $X_{1},\ldots,X_{n}$ is described by a DAG $(\bm{X},\bm{E})$ and $n$ corresponding conditional probability tables (CPTs), e.g., the CPT for $X_{i}\in\bm{X}$ describes $\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{G}}(X_{i}))$ for all possible values of $x_{i}$ and $\mathbf{pa}_{\mathcal{G}}(X_{i})$ . The joint distribution for $\mathbbm{P}$ factorizes as

\mathbbm{P}(\bm{x})=\prod_{i=1}^{n}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{% G}}(X_{i})),

and we say that $\mathcal{G}$ represents $\mathbbm{P}$ .

All independence constraints that hold in the joint distribution of a Bayes net that has underlying DAG $\mathcal{G}$ are exactly captured by the d-separation criterion [Pea88, Section 3.3.1]. Two nodes $X,Y\in\bm{X}$ are said to be d-separated in a DAG $\mathcal{G}=(\bm{X},\bm{E})$ given a set $\bm{Z}\in\bm{X}\setminus\{X,Y\}$ if and only if there is no $\bm{Z}$ -active path in $\mathcal{G}$ between $X$ and $Y$ ; a $\bm{Z}$ -active path is a simple path $Q$ such that any vertex from $\bm{Z}$ on $Q$ occurs as a collider and any vertex from $\bm{X}\setminus\bm{Z}$ appears as a non-collider. Two nodes are d-connected if they are not d-separated. It is known that $X$ is d-separated from its non-descendants given its parents [Pea88, Section 3.3.1, Corollary 4].

A probability distribution $\mathbbm{P}$ is said to be Markov with respect to a DAG $\mathcal{G}$ if d-separation in $\mathcal{G}$ implies conditional independence in $\mathbbm{P}$ . Note that any distribution is Markov with respect to the complete DAG, since there are no d-separations implied by this kind of DAG. (Moreover, any Bayes net over a complete DAG requires $2^{|\bm{X}|}-1$ parameters to describe.) A probability distribution $\mathbbm{P}$ is said to be Markov with respect to a Bayes net $\mathcal{G}$ if $\mathbbm{P}$ is Markov with respect to the underlying DAG of $\mathcal{G}$ .

2.3 Some Problems of Interest

[CHM04] introduced a decision problem about learning Bayes nets from data, called LEARN, and proved that LEARN is $\mathsf{NP}$ -hard by showing a reduction from the $\mathsf{NP}$ -hard decision problem degree-bounded feedback arc set (DBFAS). See Definition 2.1 and Definition 2.2.

Definition 2.1 (The DBFAS decision problem).

Given a directed graph $\mathcal{G}=(\bm{X},\bm{E})$ with maximum vertex degree of $3$ , and a positive integer $k\leq|\bm{E}|$ , determine whether there is a subset of edges $\bm{E}^{\prime}\subseteq\bm{E}$ with of size $|\bm{E}^{\prime}|\leq k$ such that $\bm{E}^{\prime}$ contains at least one directed edge from every directed cycle in $\mathcal{G}$ .

Definition 2.2 (The LEARN decision problem).

Given variables $\bm{X}=(X_{1},\ldots,X_{n})$ , a probability distribution $\mathbbm{P}$ over $\bm{X}$ , and a parameter bound $p\in\mathbb{N}$ , determine whether there exists a Bayes net $\mathcal{G}$ with at most $p$ parameters such that $\mathbbm{P}$ is Markov with respect to $\mathcal{G}$ .

In our work, we focus on the particular LEARN instances $(\bm{X},\mathbbm{P},p)$ used in [CHM04], namely LEARN-DBFAS.

Definition 2.3 (The LEARN-DBFAS decision problem).

Let $R$ denote the reduction of [CHM04] from DBFAS to LEARN. We define as LEARN-DBFAS the set of instances of LEARN that are in the range of $R$ .

That is, for any instance $I_{L}$ of LEARN-DBFAS, there is some instance $I_{D}$ of DBFAS such that $R\!\left(I_{D}\right)=I_{L}$ .

An independence oracle for a distribution $\mathbbm{P}$ is an oracle that can determine, in constant time, whether or not $U\perp\!\!\!\perp V\mid\bm{Z}$ for any $U,V\in\bm{X}$ and $\bm{Z}\subseteq\bm{X}\setminus\{U,V\}$ . We will use the following result by [CHM04].

Theorem 2.4 ([CHM04]).

LEARN-DBFAS is $\mathsf{NP}$ -hard even when one is given access to an independence oracle.

2.4 Selecting a Close Distribution With Finite Samples

The classic method to select an approximate distribution amongst a set of candidate distributions is via the Scheffé tournament of [DL01], which provides a logarithmic dependency on the number of candidates.

In our work, we will use the Scheffé-based algorithm of [DK14],²²2Their result is actually more general than what we stated here. For instance, they only require sample access to the distributions in $\boldsymbol{\mathbb{Q}}=\{\mathbbm{Q}_{1},\ldots,\mathbbm{Q}_{m}\}$ while our setting is simpler as we have explicit descriptions of each of these distributions. which given sample access to an input distribution and explicit access to some candidate distributions, outputs with high probability a candidate distribution that is sufficiently close, in total variation (TV) distance ( $d_{\mathrm{TV}}$ ), to the input distribution.

Theorem 2.5 ([DK14]).

Fix any accuracy parameter $\varepsilon>0$ and confidence parameter $\delta>0$ . Suppose there is a distribution $\mathbbm{P}$ over variables $\bm{X}$ and a collection of explicit distributions $\boldsymbol{\mathbb{Q}}=\{\mathbbm{Q}_{1},\ldots,\mathbbm{Q}_{m}\}$ , where each distribution $\mathbbm{Q}_{i}$ is defined over the same set $\bm{X}$ and there exists some $\mathbbm{Q}^{*}\in\boldsymbol{\mathbb{Q}}$ such that $d_{\mathrm{TV}}(\mathbbm{P},\mathbbm{Q})\leq\varepsilon$ . Then, there is an algorithm that uses $\mathcal{O}\left(\frac{\log 1/\delta}{\varepsilon^{2}}\log m\right)$ samples from $\mathbbm{P}$ and returns some $\mathbbm{Q}\in\boldsymbol{\mathbb{Q}}$ such that $d_{\mathrm{TV}}(\mathbbm{P},\mathbbm{Q})\leq 10\varepsilon$ with success probability at least $1-\delta$ and running time $\mathrm{poly}(m,1/\delta,1/\varepsilon^{2})$ .

To curate a set of candidates $\boldsymbol{\mathbb{Q}}$ , we rely on the following lemma of [BCD20] which states that any distribution $\mathbbm{Q}$ which approximately agrees with $\mathbbm{P}$ on the local conditional distribution at each node will be close in TV distance to $\mathbbm{P}$ on the entire domain.

Lemma 2.6 ([BCD20]).

Suppose $\mathbbm{P}$ and $\mathbbm{Q}$ are Bayes nets on the same DAG $\mathcal{G}=(\bm{X},\bm{E})$ with $n$ nodes. If

d_{\mathrm{TV}}\Big{(}\mathbbm{P}(X\mid\mathbf{Pa}_{\mathcal{G}}(X)=\sigma),% \mathbbm{Q}(X\mid\mathbf{Pa}_{\mathcal{G}}(X))=\sigma\Big{)}\leq\frac{% \varepsilon}{n}

for all nodes $X\in\bm{X}$ and possible parent values $\sigma\in\Sigma^{|\mathbf{Pa}_{\mathcal{G}}(X)|}$ , then $d_{\mathrm{TV}}(\mathbbm{P},\mathbbm{Q})\leq\varepsilon$ .

Although there are infinitely many possible distributions, since we are satisfied with an approximately close distribution, one can discretize the space via an $\varepsilon$ -net.

Definition 2.7 ( $\varepsilon$ -nets; [Ver18]).

Fix a metric space $(\bm{T},d)$ . For any subset $\bm{K}\subseteq\bm{T}$ and $\varepsilon>0$ , a subset $\bm{N}\subseteq\bm{K}$ is called an $\varepsilon$ -net of $\bm{K}$ if every point in $\bm{K}$ is within distance $\varepsilon$ to some point in $\bm{N}$ . That is, $\forall x\in\bm{K},\exists x_{0}\in\bm{N}$ such that $d(x,x_{0})\leq\varepsilon$ . We say that $\bm{N}$ $\varepsilon$ -covers $\bm{K}$ .

As we shall see in Section 5, the candidate set $\boldsymbol{\mathbb{Q}}$ will be created by computing an $\frac{\varepsilon}{n}$ -net with respect to the TV distance and then applying Lemma 2.6 suitably.

2.5 Other Related Work

We have already referred to some papers that are relevant to our work. We resume this discussion here. [Das99] considers the task of learning the maximum-likelihood polytree from data. The main result of this paper is that the optimal branching (or Chow-Liu tree) is a good approximation to the best polytree. This result is then complemented by the observation that this learning problem is $\mathsf{NP}$ -hard, even to approximately solve within some constant factor.

[TK05] propose a simple heuristic method for addressing the task of learning Bayes nets. Their approach is based on the fact that the best network (of bounded in-degree) consistent with a given node ordering can be found efficiently.

[EG08] present a method for learning Bayes nets of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. At the heart of their method is a dynamic triangulation, that they update in a way which facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one.

[FNP13] introduce an algorithm that achieves learning by restricting the search space. Their iterative algorithm restricts the parents of each variable to belong to a small subset of candidates. They then search for a network that satisfies these constraints and the learned network is then used for selecting better candidates for the next iteration.

[GK21] investigate the parameterized complexity of Bayesian Network Structure Learning (BNSL). They show that parameterizing BNSL by the size of a feedback edge set yields fixed-parameter tractability.

[KSM22] combine constraint-based methods with greedy or Markov chain Monte Carlo (MCMC) schemes in a method which reduces the complexity of MCMC approaches to that of a constraint-based method.

3 Technical Overview

Here, we give a brief high-level overview of the techniques used in our results of Theorem 1.3 and Theorem 1.4.

3.1 $\mathsf{NP}$ -Hardness of the Realizable Case

By Theorem 2.4, it would suffice to prove that the existence of a polynomial time algorithm for REALIZABLE-LEARN implies that LEARN-DBFAS instances can be solved in polynomial time if one has access to an independence oracle. The desired result will then follow from the facts that we can efficiently $(a)$ compute the number of parameters of a Bayes net and $(b)$ decide whether a given distribution is Markov with respect to a given Bayes net (when given access to an independence oracle).

Suppose we have a polynomial time algorithm Learner for REALIZABLE-LEARN. Note that it is conventional to assume that such an algorithm always halts within some polynomial-time bound, and outputs some Bayes net, even when the respective promise is violated. We define and analyze the following reduction:

Given an arbitrary instance $(\bm{X},\mathbbm{P},p)$ of REALIZABLE-LEARN, run Learner to obtain a Bayes net $\mathcal{G}$ . Then check whether $\mathcal{G}$ has at most $p$ parameters and (while using an independence oracle) check whether or not $\mathbbm{P}$ is Markov with respect to $\mathcal{G}$ . If both of these checks are positive, then output YES. Otherwise, output NO. See Section 4 for the formal proof of Theorem 1.3.

3.2 Approximately Learning Parameter-Bounded Bayes Networks

The main idea is to construct an $\varepsilon$ -net over all possible DAGs that satisfy the parameter upper bound $p$ , and then apply a well-known bound from the density estimation literature.

For this purpose, we need to count all possible Bayes nets that satisfy the parameter upper bound $p$ . By a counting argument, we see that there are not many possible DAGs that give rise to some Bayes net of at most $p$ parameters. Then, by a counting argument again, we see that there are only a few conditional distributions that are Markov with respect to a Bayes net $\mathcal{G}$ over a DAG that realizes a given in-degree sequence. Thus we are able to bound the number of distributions that cover all possible conditional distributions which are Markov with respect to $\mathcal{G}$ . See Section 5 for the formal proof of Theorem 1.4.

4 REALIZABLE-LEARN is $\mathsf{NP}$ -hard

To show that REALIZABLE-LEARN is hard, we reduce LEARN-DBFAS to REALIZABLE-LEARN by making polynomially-many calls to an independence oracle. Given any polynomial time algorithm Learner that solves REALIZABLE-LEARN, we will forward the LEARN-DBFAS instance to Learner and examine the produced Bayes net $\mathcal{G}$ . We will describe a polynomial time procedure Reduction that uses an independence oracle to determine whether we should correctly output YES or NO for the given LEARN-DBFAS instance. See Figure 2 for a pictorial illustration of our reduction strategy.

Figure 2: [Gav77] showed that DBFAS is

\mathsf{NP}

-hard and [CHM04] showed that LEARN-DBFAS is

\mathsf{NP}

-hard, even when given access to an independence oracle for

\mathbbm{P}

. REALIZABLE-LEARN is a variant of LEARN-DBFAS with the additional promise that there exists a Bayes net

\mathcal{G}

with at most

p

parameters such that

\mathbbm{P}

is Markov with respect to

\mathcal{G}

. In this work, we show that if one can learn such a Bayes net

\mathcal{G}

(via some blackbox polynomial time algorithm Learner), then there is a polynomial time algorithm Reduction that correctly answers LEARN-DBFAS. Therefore, REALIZABLE-LEARN is also

\mathsf{NP}

-hard.

We begin by observing that one can easily check the number of parameters of Bayes net given its full description.

Lemma 4.1.

Given a Bayes net over $\mathcal{G}=(\bm{X},\bm{E})$ , one can compute the number of its parameters in polynomial time.

Proof.

Let $\Sigma_{X}$ denote the alphabet set of $X\in\bm{X}$ . Then, the number of parameters of $\mathcal{G}$ is

\sum_{X\in\bm{X}}\left((|\Sigma_{X}|-1)\prod_{U\in\mathbf{Pa}_{\mathcal{G}}(X)% }|\Sigma_{U}|\right),

which can be computed in polynomial time. ∎

The following notion of important edges will come handy in the sequel.

Definition 4.2 (Important edges).

Let $\mathcal{G}=(\bm{V},\bm{E})$ be a DAG and $\mathbbm{P}$ be a distribution over $\bm{V}$ . Then, an edge $e\in\bm{E}$ is called $(\mathcal{G},\mathbbm{P})$ -important if $\mathbbm{P}$ is Markov with respect to $\mathcal{G}$ but is not Markov with respect to $\mathcal{G}^{\prime}=(\bm{V},\bm{E}\setminus\{e\})$ .

To check whether $\mathbbm{P}$ is Markov with respect to $\mathcal{G}$ , one could verify that any d-separation in $\mathcal{G}$ implies conditional independence in $\mathbbm{P}$ . However, this computation seems to be intractable. In contrast, Corollary 4.5 gives a polynomial time algorithm that checks this while using an independence oracle.

The correctness of Corollary 4.5 follows from Lemma 4.3 and Lemma 4.4.

Lemma 4.3.

Suppose $\mathbbm{P}$ on variables $\bm{X}$ is Markov with respect to $\mathcal{G}=(\bm{X},\bm{E})$ . Then, an edge $A\to B$ in $\bm{E}$ is not $(\mathcal{G},\mathbbm{P})$ -important if $A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}}(B)\setminus\{A\}$ .

Proof.

Consider an arbitrary edge $A\to B$ in $\bm{E}$ such that $A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}}(B)\setminus\{A\}$ . Say, $A=X_{j}$ and $B=X_{k}$ . Letting $\mathcal{G}^{\prime}=(\bm{V},\bm{E}\setminus(A,B))$ be a subgraph of $\mathcal{G}$ that does not contain the edge $A\to B$ , we see that

$\displaystyle\mathbbm{P}(\bm{x})=$	$\displaystyle\;\prod_{i=1}^{n}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{G}}(X% _{i}))$	$\displaystyle(\ast)$
$\displaystyle=$	$\displaystyle\;\mathbbm{P}(x_{k}\mid\mathbf{pa}_{\mathcal{G}}(X_{k}))\cdot% \prod_{i\in[n]\setminus k}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{G}}(X_{i}))$
$\displaystyle=$	$\displaystyle\;\mathbbm{P}(x_{k}\mid\mathbf{pa}_{\mathcal{G}}(X_{k})\setminus x% _{j})\cdot\prod_{i\in[n]\setminus k}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal% {G}}(X_{i}))$	$\displaystyle({\dagger})$
$\displaystyle=$	$\displaystyle\;\mathbbm{P}(x_{k}\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(X_{k}))% \cdot\prod_{i\in[n]\setminus k}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{G}^{% \prime}}(X_{i}))$	$\displaystyle({\ddagger})$
$\displaystyle=$	$\displaystyle\;\prod_{i=1}^{n}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{G}^{% \prime}}(X_{i})),$

where $(\ast)$ is due to $\mathbbm{P}$ being Markov with respect to $\mathcal{G}$ , $({\dagger})$ is due to $X_{j}\perp\!\!\!\perp X_{k}\mid\mathbf{Pa}_{\mathcal{G}}(X_{k})\setminus\{X_{j}\}$ , and $({\dagger})$ is due to the definition of $\mathcal{G}^{\prime}$ . Since $\mathbbm{P}(\bm{x})=\prod_{i=1}^{n}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{% G}^{\prime}}(X_{i}))$ , we see that $\mathbbm{P}$ is also Markov with respect to $\mathcal{G}^{\prime}$ , and so the edge $A\to B$ is not $(\mathcal{G},\mathbbm{P})$ -important. ∎

Lemma 4.4.

Suppose a distribution $\mathbbm{P}$ on variables $\bm{X}$ is Markov with respect to a DAG $\mathcal{G}=(\bm{X},\bm{E})$ . Let $\mathcal{G}^{\prime}=(\bm{X},\bm{E}^{\prime})$ be an edge-induced DAG of $\mathcal{G}$ with $\bm{E}^{\prime}\subseteq\bm{E}$ . Then, $\mathbbm{P}$ is Markov with respect to $\mathcal{G}^{\prime}$ if and only if $A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}}(B)\setminus\{A\}$ for all edges $A\to B$ in $\bm{E}\setminus\bm{E}^{\prime}$ .

Proof.

We prove each direction separately.

( $\Leftarrow$ )

Suppose that $\mathbbm{P}$ is Markov with respect to $\mathcal{G}^{\prime}$ . Consider an arbitrary edge $A\to B\in\bm{E}\setminus\bm{E}^{\prime}$ . Since $A$ is an ancestor of $B$ in $\mathcal{G}^{\prime}$ , we have that $A$ remains a non-descendant of $B$ in $\mathcal{G}^{\prime}$ after removing the edge $A\to B$ . So, $A$ and $B$ are d-separated in $\mathcal{G}^{\prime}$ given $\mathbf{Pa}_{\mathcal{G}^{\prime}}(B)\setminus\{A\}$ , and so $A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}^{\prime}}(B)$ by the Markov property. That is, $A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}}(B)\setminus\{A\}$ .

( $\Rightarrow$ )

Suppose that $A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}}(B)\setminus\{A\}$ for all edges $A\to B$ in $\bm{E}\setminus\bm{E}^{\prime}$ . Order the edges in $\bm{E}\setminus\bm{E}^{\prime}$ in an arbitrary sequence, say $e_{1},\ldots,e_{|\bm{E}\setminus\bm{E}^{\prime}|}$ . Let us remove these edges sequentially, resulting in a sequence of edge-induced DAGs $\mathcal{G}=\mathcal{G}_{0},\mathcal{G}_{1},\ldots,\mathcal{G}_{|\bm{E}% \setminus\bm{E}^{\prime}|}=\mathcal{G}^{\prime}$ , where $\mathcal{G}_{i}$ is the edge-induced DAG obtained from removing edges $\{e_{1},\ldots,e_{i}\}$ from $\mathcal{G}$ . Observe that non-descendant relationships are preserved as we remove edges, i.e., if $A$ is a non-descendant of $B$ in $\mathcal{G}$ , then it is also a non-descendant of $B$ in $\mathcal{G}_{i}$ for any $i\in\{1,\ldots,|\bm{E}\setminus\bm{E}^{\prime}|\}$ . So, we can apply Lemma 4.3 repeatedly: For any $i\in\{1,\ldots,|\bm{E}\setminus\bm{E}^{\prime}|\}$ , the edge $e_{i}$ is not $(\mathcal{G}_{i-1},\mathbbm{P})$ -important, so $\mathbbm{P}$ is Markov with respect to $\mathcal{G}_{i}$ . That is, $\mathbbm{P}$ is Markov with respect to $\mathcal{G}^{\prime}$ . ∎

Corollary 4.5.

Suppose $\mathbbm{P}$ is a distribution over $\bm{X}$ and $\mathcal{G}$ is a Bayes net over the same set of variables $\bm{X}$ . Then, there is a polynomial time algorithm that uses an independence oracle for $\mathbbm{P}$ to decide whether or not $\mathbbm{P}$ is Markov with respect to $\mathcal{G}$ .

Proof.

Consider the following algorithm Checker:

Given Bayes net $\mathcal{G}$ over a DAG $(\bm{X},\bm{E}^{\prime})$ , consider a DAG $\mathcal{K}=(\bm{X},\bm{E})$ , with $\bm{E}^{\prime}\subseteq\bm{E}$ , that is a complete supergraph of $(\bm{X},\bm{E}^{\prime})$ . If every edge $A\to B\in\bm{E}\setminus\bm{E}^{\prime}$ satisfies $A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}}(B)\setminus\{A\}$ , output YES. Otherwise, output NO.

Note that since any distribution is Markov with respect to the complete DAG (see Section 2.2), $\mathbbm{P}$ is Markov with respect to $\mathcal{K}$ .

The correctness of Checker follows from Lemma 4.4. Checker runs in polynomial time as $\mathcal{K}$ can be created in polynomial time with respect to the size of $\bm{X}$ , and the number of edges in $\bm{E}\subseteq\bm{E}^{\prime}$ to check is polynomial in the size of $\bm{X}$ . ∎

We are now ready to formally prove our first main result.

See 1.3

Proof.

It suffices to show the existence of a polynomial time algorithm for REALIZABLE-LEARN implies that LEARN-DBFAS instances can be answered in polynomial time with access to an independence oracle; see Figure 2.

Suppose we have a polynomial time algorithm Learner for REALIZABLE-LEARN. Let us define a reduction algorithm Reduction as follows:

Given an instance $(\bm{X},\mathbbm{P},p)$ of LEARN-DBFAS, run Learner to obtain a Bayes net $\mathcal{G}$ (see Section 3.1; this is a natural assumption for algorithms solving a search promise problem). Compute the number of parameters of $\mathcal{G}$ . Run algorithm Checker of Corollary 4.5 on $\mathcal{G}$ to check whether or not $\mathbbm{P}$ is Markov with respect to $\mathcal{G}$ . Output YES if $\mathcal{G}$ has at most $p$ parameters and $\mathbbm{P}$ is Markov with respect to $\mathcal{G}$ ; else, output NO.

The correctness of Reduction follows from the assumption that Learner produces a Bayes net $\mathcal{G}$ with at most $p$ parameters such that $\mathbbm{P}$ is Markov with respect to $\mathcal{G}$ (if the underlying promise is satisfied; otherwise, the output is an arbitrary Bayes net), and the correctness of Checker. By assumption, Learner is a polynomial time algorithm. By Lemma 4.1, we can compute the number of parameters of $\mathcal{G}$ in polynomial time. By Corollary 4.5, Checker is also a polynomial time algorithm. Therefore, the overall running time for Reduction is polynomial. ∎

5 Approximating Bayes Nets

Our strategy for proving our finite sample complexity result (Theorem 1.4) follows that of [BCD20, Theorem 10], but we specialize the analysis to the setting where we are given a parameter bound instead of a degree bound. As discussed in Section 1, our result is a generalization of their result since an upper bound on the in-degrees implies a (possibly loose) parameter upper bound.

5.1 Some Graph Counting Arguments

To prove Theorem 1.4, we require an upper bound on the number of possible Bayes nets on $n$ nodes that have at most $p$ parameters (Lemma 5.2). To obtain such a result, we first relate the number of parameters $p$ with a specific given in-degree sequence $(d_{1}^{-},\ldots,d_{n}^{-})$ of a Bayes net, then we upper bound the total number of Bayes nets that has at most $p$ parameters by summing over all suitable in-degree sequences $\bm{d}^{-}=(d_{1}^{-},\ldots,d_{n}^{-})$ .

Consider an arbitrary Bayes net $\mathcal{G}$ with in-degree sequence $(d_{1}^{-},\ldots,d_{n}^{-})$ and each node taking on $|\Sigma|$ values. Since the conditional distribution for vertex $X_{i}$ is fully described when we know $\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{G}}(X_{i}))$ for $|\Sigma|-1$ possible values of $x_{i}$ , with respect to $\left|\Sigma\right|^{d_{i}^{-}}$ possible values of $\mathbf{pa}_{\mathcal{G}}(X_{i})$ . Therefore, we see that the Bayes net has

\sum_{i=1}^{n}\left(\left(|\Sigma|-1\right)|\Sigma|^{d_{i}^{-}}\right)=\left(|% \Sigma|-1\right)\left(\sum_{i=1}^{n}|\Sigma|^{d_{i}^{-}}\right)

parameters. Note that this is the exact same reasoning as in Lemma 4.1. So, if the Bayes net has at most $p$ parameters, then

\sum_{i=1}^{n}|\Sigma|^{d_{i}^{-}}=|\Sigma|^{d_{1}^{-}}+\ldots+|\Sigma|^{d_{n}% ^{-}}\leq\frac{p}{|\Sigma|-1}.

(2)

By the AM-GM inequality, we have that

\sum_{i=1}^{n}|\Sigma|^{d_{i}^{-}}\geq n\left(\prod_{i=1}^{n}|\Sigma|^{d_{i}^{% -}}\right)^{\frac{1}{n}}=n|\Sigma|^{\frac{1}{n}\sum_{i=1}^{n}d_{i}^{-}}.

(3)

Combining Equations 2 and 3 together gives us

d_{1}^{-}+\ldots+d_{n}^{-}\leq n\frac{\log\left(\frac{p}{n\left(|\Sigma|-1% \right)}\right)}{\log|\Sigma|}.

(4)

The following lemma is a combinatorial fact upper bounding on the number of graphs that realize a given degree sequence, which may be of independent interest beyond being used to prove Lemma 5.2.

Lemma 5.1.

Given an in-degree sequence $\bm{d}^{-}=(d_{1}^{-},\ldots,d_{n}^{-})$ with non-negative integers $d_{1}^{-},\ldots,d_{n}^{-}$ , there are at most $\prod_{i=1}^{n}\binom{n-1}{d_{i}^{-}}$ DAGs that realize $\bm{d}^{-}$ .

Proof.

Fix an arbitrary labelling of vertices from $X_{1}$ to $X_{n}$ and consider the sequential process of adding edges into $X_{1},\ldots,X_{n}$ . For $X_{1}$ , there are $\binom{n-1}{d_{1}^{-}}$ ways to add $d_{1}^{-}$ incoming edges that end at $X_{1}$ . For $X_{2}$ , there are $\binom{n-1}{d_{2}^{-}}$ possibilities. For $X_{3}$ , there are at most $\binom{n-1}{d_{3}^{-}}$ possibilities. Note that some of these choices would be incompatible with earlier edge choices as the newly added edges may cause directed cycles to be formed. We repeat this edge adding process until all vertices have added their incoming edges to the graph. So, the upper bound is $\prod_{i=1}^{n}\binom{n-1}{d_{i}^{-}}$ . ∎

Lemma 5.2.

Suppose that every node takes on at most $|\Sigma|$ values. Then, there are at most

\left(n-1\right)^{\frac{n\log\left(\frac{p}{n\left(|\Sigma|-1\right)}\right)}{% \log\left|\Sigma\right|}}e^{n}\left(\frac{\log\left(\frac{p}{n\left(|\Sigma|-1% \right)}\right)}{\log\left|\Sigma\right|}+1\right)^{n}

possible DAGs over $n$ nodes that may be used to define some Bayes net that has at most $p$ parameters.

Proof.

By Lemma 5.1, there are $\prod_{i=1}^{n}\binom{n-1}{d_{i}^{-}}$ possible DAGs realizing any fixed in-degree sequence $\bm{d}^{-}=(d_{1}^{-},\ldots,d_{n}^{-})$ . Let $(\ast)$ denote the condition that an in-degree sequence $\bm{d}^{-}$ yields a graph that has at most $p$ parameters. Then,

	$\displaystyle\sum_{\text{$\bm{d}^{-}$ satisfies $(\ast)$}}\prod_{i=1}^{n}% \binom{n-1}{d_{i}^{-}}\leq$	$\displaystyle\;\sum_{\text{$\bm{d}^{-}$ satisfies $(\ast)$}}\left(n-1\right)^{% d_{1}^{-}+\ldots+d_{n}^{-}}$
	$\displaystyle\leq$	$\displaystyle\;\left(n-1\right)^{n\frac{\log\left(\frac{p}{n\left(\|\Sigma\|-1% \right)}\right)}{\log\|\Sigma\|}}\sum_{\text{$\bm{d}^{-}$ satisfies $(\ast)$}}1$
	$\displaystyle\leq$	$\displaystyle\;(n-1)^{n\frac{\log\left(\frac{p}{n\left(\|\Sigma\|-1\right)}% \right)}{\log\|\Sigma\|}}\binom{n\frac{\log\left(\frac{p}{n\left(\|\Sigma\|-1% \right)}\right)}{\log\|\Sigma\|}+n}{n}$
	$\displaystyle\leq$	$\displaystyle\;\left(n-1\right)^{n\frac{\log\left(\frac{p}{n\left(\|\Sigma\|-1% \right)}\right)}{\log\|\Sigma\|}}\left(e\left(\frac{\log\left(\frac{p}{n\left(\|% \Sigma\|-1\right)}\right)}{\log\|\Sigma\|}+1\right)\right)^{n}$

where the first and last inequalities is because $\binom{n}{k}\leq\left(\frac{en}{k}\right)^{k}\leq n^{k}$ , the second inequality is due to Equation 4, and the third inequality is obtained via standard “stars and bars” counting. That is, we introduce an auxiliary variable $d_{0}^{-}$ and count the number of non-negative integer solutions of

d_{0}^{-}+d_{1}^{-}+\ldots+d_{n}^{-}=n\frac{\log\left(\frac{p}{n\left(|\Sigma|% -1\right)}\right)}{\log|\Sigma|}.\qed

5.2 Proof of Theorem 1.4

We are now ready to prove Theorem 1.4. Since Lemma 2.6 tells us that it suffices to approximate each local conditional distribution at each node well. So, we will consider an $\frac{\varepsilon}{n}$ -net over all such distributions and then apply a tournament style argument (Theorem 2.5) to pick a good candidate amongst the joint distribution obtained by a combination of such candidate local distributions.

See 1.4

Proof.

Fix a DAG $\mathcal{G}$ satisfying an arbitrary in-degree sequence $\bm{d}^{-}=\left(d_{1}^{-},\ldots,d_{n}^{-}\right)$ . Then, there are $|\Sigma|^{d_{1}^{-}}+\ldots+|\Sigma|^{d_{n}^{-}}$ local conditional distributions for any Bayes net over $\mathcal{G}$ . From Equation 2 above, we know that

\sum_{i=1}^{n}|\Sigma|^{d_{i}^{-}}=|\Sigma|^{d_{1}^{-}}+\ldots+|\Sigma|^{d_{n}% ^{-}}\leq\frac{p}{|\Sigma|-1}.

Now, consider an arbitrary local distribution over $k=|\Sigma|$ values and let us upper bound the number of points in an $\frac{\varepsilon}{n}$ -net for this metric space. Observe that each possible distribution is essentially an element of the probability simplex $\Delta_{k}$ . To get an $\frac{\varepsilon}{n}$ -net of $\Delta_{k}$ , we discretize vectors by rounding them to their nearest multiple of $\frac{\varepsilon}{n\left|\Sigma\right|}$ . If $\pi$ is a probability vector, and $r_{\pi}$ is its rounding, then $\left\|\pi-r_{\pi}\right\|_{1}\leq\frac{\varepsilon}{n\left|\Sigma\right|}% \left|\Sigma\right|=\frac{\varepsilon}{n}$ . Therefore the number of discretized vectors is at most $\mathcal{O}\left(\left({n\left|\Sigma\right|}/{\varepsilon}\right)^{|\Sigma|}\right)$ .

Therefore, for any fixed DAG $\mathcal{G}$ , there is a set of

m_{1}\in\mathcal{O}\left(\left({n\left|\Sigma\right|}/{\varepsilon}\right)^{% \frac{p|\Sigma|}{|\Sigma|-1}}\right)

(5)

distributions that $\frac{\varepsilon}{n}$ -cover any possible joint distributions that can be Markov with respect to a Bayes net over $\mathcal{G}$ . Meanwhile, by Lemma 5.2, there are at most

\displaystyle m_{2}:=\left(n-1\right)^{\frac{n\log\left(\frac{p}{n\left(|% \Sigma|-1\right)}\right)}{\log\left|\Sigma\right|}}e^{n}\left(\frac{\log\left(% \frac{p}{n\left(|\Sigma|-1\right)}\right)}{\log\left|\Sigma\right|}+1\right)^{n}

(6)

possible DAGs that may be used to define a Bayes net on $n$ nodes that has at most $p$ parameters.

We can now define a set of distributions $\boldsymbol{\mathbb{Q}}$ over $n$ variables such that there exists $\mathbbm{Q}^{*}\in\boldsymbol{\mathbb{Q}}$ such that $d_{\mathrm{TV}}(\mathbbm{P},\mathbbm{Q})\leq\varepsilon$ . Let us denote $m=|\boldsymbol{\mathbb{Q}}|$ . Putting together the above bounds, we see that there are at most $m=m_{1}\cdot m_{2}$ candidates suffice, where $m_{1}$ and $m_{2}$ are from Equations 5 and 6. Therefore, with

\displaystyle\mathcal{O}\!\left(\frac{\log\frac{1}{\delta}}{\varepsilon^{2}}% \log m\right)\subseteq\mathcal{O}\!\left(\frac{\log\frac{1}{\delta}}{% \varepsilon^{2}}\left(p\log\left(\frac{n\left|\Sigma\right|}{\varepsilon}% \right)+\frac{n\log\left(\frac{p}{n\left(|\Sigma|-1\right)}\right)}{\log\left|% \Sigma\right|}\log n\right)\right)

samples from $\mathbbm{P}$ , Theorem 2.5 chooses a distribution $\mathbbm{Q}$ amongst the $m$ candidates such that $d_{\mathrm{TV}}(\mathbbm{P},\mathbbm{Q})\leq\varepsilon$ with success probability at least $1-\delta$ . ∎

6 Conclusion

In this work, we showed the hardness result of finding a parameter-bounded Bayes net that represents some distribution $\mathbbm{P}$ , given sample access to $\mathbbm{P}$ , even under the promise that such a Bayes net exists. On a positive note, we gave a finite sample complexity bound sufficient to produce a Bayes net representing a probability distribution $\mathbbm{Q}$ that is close in TV distance to $\mathbbm{P}$ . Our results generalize earlier known results of [CHM04] and [BCD20] respectively.

An intriguing open question is as follows:

Suppose we are given sample access to a distribution $\mathbbm{P}$ and are promised that there exists a Bayes net on $\mathcal{G}$ with at most $p$ parameters such that $\mathbbm{P}$ is Markov with respect to $\mathcal{G}$ . Is it hard to find a Bayes net $\mathcal{G}^{\prime}$ that has $\alpha\cdot p$ parameters such that $\mathbbm{P}$ is Markov with respect to $\mathcal{G}^{\prime}$ (where $\mathcal{G}^{\prime}$ may not be $\mathcal{G}$ ), for some constant $\alpha>1$ ?

Note that the hardness construction of [CHM04] only displayed an additive gap in the parameter bound. We conjecture that it is also hard to obtain such a multiplicative gap in the parameter bound, even in the promise setting.

Acknowledgements

We would like to thank Dimitris Zoros for helpful discussions. This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-PhD/2021-08-013). The work of AB was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme (NRF-NRFFAI-2019-0002) and an Amazon Faculty Research Award. SG’s work was partially supported by the SERB CRG Award CRG/2022/007985 and an IIT Kanpur initiation grant. Part of this work was done while the authors were visiting the Simons Institute for the Theory of Computing.

References

[BCD20] Johannes Brustle, Yang Cai, and Constantinos Daskalakis. Multi-item mechanisms without item-independence: Learnability via robustness. In Proceedings of the 21st ACM Conference on Economics and Computation, pages 715–761, 2020.
[CH92] Gregory F Cooper and Edward Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine learning, 9:309–347, 1992.
[Chi96] David Maxwell Chickering. Learning Bayesian networks is NP-complete. Learning from data: Artificial intelligence and statistics V, pages 121–130, 1996.
[Chi02] David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554, 2002.
[CHM04] Max Chickering, David Heckerman, and Chris Meek. Large-sample learning of Bayesian networks is NP-hard. Journal of Machine Learning Research, 5:1287–1330, 2004.
[Das99] Sanjoy Dasgupta. Learning polytrees. In Kathryn B. Laskey and Henri Prade, editors, UAI ’99: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, July 30 - August 1, 1999, pages 134–141. Morgan Kaufmann, 1999.
[DK14] Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of gaussians. In Conference on Learning Theory, pages 1183–1213. PMLR, 2014.
[DL01] Luc Devroye and Gábor Lugosi. Combinatorial methods in density estimation. Springer Science & Business Media, 2001.
[EG08] Gal Elidan and Stephen Gould. Learning bounded treewidth Bayesian networks. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Léon Bottou, editors, Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 417–424. Curran Associates, Inc., 2008.
[FNP13] Nir Friedman, Iftach Nachman, and Dana Pe’er. Learning Bayesian network structure from massive datasets: The ”sparse candidate” algorithm. CoRR, abs/1301.6696, 2013.
[Gav77] Fanica Gavril. Some $\mathsf{NP}$ -complete problems on graphs. In Proceedings of Conference on Information Sciences and Systems, pages 91–95, 1977.
[GK21] Robert Ganian and Viktoriia Korchemna. The complexity of Bayesian network learning: Revisiting the superstructure. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 430–442, 2021.
[HGC95] David Heckerman, Dan Geiger, and David Maxwell Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Mach. Learn., 20(3):197–243, 1995.
[KSM22] Jack Kuipers, Polina Suter, and Giusi Moffa. Efficient sampling and structure learning of Bayesian networks. J. Comput. Graph. Stat., 31(3):639–650, 2022.
[Pea88] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.
[SDLC93] David J Spiegelhalter, A Philip Dawid, Steffen L Lauritzen, and Robert G Cowell. Bayesian analysis in expert systems. Statistical science, pages 219–247, 1993.
[SGS00] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.
[TK05] Marc Teyssier and Daphne Koller. Ordering-based search: A simple and effective algorithm for learning Bayesian networks. In UAI ’05, Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence, Edinburgh, Scotland, July 26-29, 2005, pages 548–549. AUAI Press, 2005.
[Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Learnability of Parameter-Bounded Bayes Nets

Abstract

1 Introduction

Remark 1.1.

1.1 Our Contributions

Contribution 1.

Definition 1.2 (The REALIZABLE-LEARN problem).

Theorem 1.3.

Contribution 2.

Theorem 1.4 (Approximating parameter-bounded Bayes nets using samples).

1.2 Paper Outline

2 Preliminaries

2.1 Notation

2.2 Graph-Theoretical Notions

2.3 Some Problems of Interest

Definition 2.1 (The DBFAS decision problem).

Definition 2.2 (The LEARN decision problem).

Definition 2.3 (The LEARN-DBFAS decision problem).

Theorem 2.4 ([CHM04]).

2.4 Selecting a Close Distribution With Finite Samples

Theorem 2.5 ([DK14]).

Lemma 2.6 ([BCD20]).

Definition 2.7 (ε𝜀\varepsilonitalic_ε-nets; [Ver18]).

2.5 Other Related Work

3 Technical Overview

3.1 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-Hardness of the Realizable Case

3.2 Approximately Learning Parameter-Bounded Bayes Networks

4 REALIZABLE-LEARN is 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hard

Lemma 4.1.

Proof.

Definition 4.2 (Important edges).

Lemma 4.3.

Proof.

Lemma 4.4.

Proof.

(⇐⇐\Leftarrow⇐)

(⇒⇒\Rightarrow⇒)

Corollary 4.5.

Proof.

Proof.

5 Approximating Bayes Nets

5.1 Some Graph Counting Arguments

Lemma 5.1.

Proof.

Lemma 5.2.

Proof.

5.2 Proof of Theorem 1.4

Proof.

6 Conclusion

Acknowledgements

References

Definition 2.7 ( $\varepsilon$ -nets; [Ver18]).

3.1 $\mathsf{NP}$ -Hardness of the Realizable Case

4 REALIZABLE-LEARN is $\mathsf{NP}$ -hard

( $\Leftarrow$ )

( $\Rightarrow$ )