Learnability of Parameter-Bounded Bayes Nets

Arnab Bhattacharyya
National University of Singapore
   Davin Choo
National University of Singapore
   Sutanu Gayen
Indian Institute of Technology Kanpur
   Dimitrios Myrisiotis
CNRS@CREATE LTD
Abstract

Bayes nets are extensively used in practice to efficiently represent joint probability distributions over a set of random variables and capture dependency relations. In a seminal paper, Chickering et al. (JMLR 2004) showed that given a distribution \mathbbm{P}blackboard_P, that is defined as the marginal distribution of a Bayes net, it is 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hard to decide whether there is a parameter-bounded Bayes net that represents \mathbbm{P}blackboard_P. They called this problem LEARN. In this work, we extend the 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hardness result of LEARN and prove the 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hardness of a promise search variant of LEARN, whereby the Bayes net in question is guaranteed to exist and one is asked to find such a Bayes net. We complement our hardness result with a positive result about the sample complexity that is sufficient to recover a parameter-bounded Bayes net that is close (in TV distance) to a given distribution \mathbbm{P}blackboard_P, that is represented by some parameter-bounded Bayes net, generalizing a degree-bounded sample complexity result of Brustle et al. (EC 2020).

1 Introduction

Bayesian networks [Pea88], or simply Bayes nets, are directed acyclic graphs (DAGs), accompanied by a collection of conditional probability distributions (that is, one for each vertex), that are used to represent joint probability distributions over dependent random variables in an elegant and succinct manner. As an example, consider a distribution \mathbbm{P}blackboard_P over five Boolean variables 𝑿={X1,X2,X3,X4,X5}𝑿subscript𝑋1subscript𝑋2subscript𝑋3subscript𝑋4subscript𝑋5\bm{X}=\{X_{1},X_{2},X_{3},X_{4},X_{5}\}bold_italic_X = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT }. Regardless of the dependencies between the variables, \mathbbm{P}blackboard_P can always be represented by a lookup table that has 251=31superscript251312^{5}-1=312 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT - 1 = 31 entries, with one entry for each possible Boolean outcome except one. However, known dependencies between variables may induce a sparser representation. If X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT depends on X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT depends on {X2,X5}subscript𝑋2subscript𝑋5\{X_{2},X_{5}\}{ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT }, and X4subscript𝑋4X_{4}italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT depends on X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, then the joint distribution (x1,,x5)subscript𝑥1subscript𝑥5\mathbbm{P}(x_{1},\ldots,x_{5})blackboard_P ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) decomposes as

(x1)(x2x1)(x3x2,x5)(x4x3)(x5).subscript𝑥1conditionalsubscript𝑥2subscript𝑥1conditionalsubscript𝑥3subscript𝑥2subscript𝑥5conditionalsubscript𝑥4subscript𝑥3subscript𝑥5\mathbbm{P}(x_{1})\cdot\mathbbm{P}(x_{2}\mid x_{1})\cdot\mathbbm{P}(x_{3}\mid x% _{2},x_{5})\cdot\mathbbm{P}(x_{4}\mid x_{3})\cdot\mathbbm{P}(x_{5}).blackboard_P ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ blackboard_P ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ blackboard_P ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) ⋅ blackboard_P ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ⋅ blackboard_P ( italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) .

In fact, one can represent \mathbbm{P}blackboard_P with a relatively sparse Bayes net 𝒢𝒢\mathcal{G}caligraphic_G (see Figure 1) with conditional probability tables (CPTs) associated with each vertex. Observe that 8<318318<318 < 31 numbers suffice to describe the CPTs: One for each of the Bernoulli distributions of X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X5subscript𝑋5X_{5}italic_X start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, two for the conditional probability distributions of X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and X4subscript𝑋4X_{4}italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, and four for that of X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. In the rest of this work, we refer to the numbers used in defining the CPTs above as parameters.

X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTX2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTX3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTX4subscript𝑋4X_{4}italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPTX5subscript𝑋5X_{5}italic_X start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT𝒢𝒢\mathcal{G}caligraphic_GX1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTX2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTX4subscript𝑋4X_{4}italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPTX5subscript𝑋5X_{5}italic_X start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT\mathcal{H}caligraphic_H
Figure 1: Left: A Bayes net 𝒢𝒢\mathcal{G}caligraphic_G such that the distribution \mathbbm{P}blackboard_P of our example is represented by 𝒢𝒢\mathcal{G}caligraphic_G. Right: A Bayes net \mathcal{H}caligraphic_H such that the distribution that arises from the distribution \mathbbm{P}blackboard_P after marginalizing out X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is represented by \mathcal{H}caligraphic_H.

It is a standard result [Pea88, CHM04] that there exists a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G of p𝑝pitalic_p parameters that represents a probability distribution \mathbbm{P}blackboard_P if and only if \mathbbm{P}blackboard_P is Markov with respect to some Bayes net \mathcal{H}caligraphic_H of p𝑝pitalic_p parameters that has the same underlying DAG as 𝒢𝒢\mathcal{G}caligraphic_G. (It could be that 𝒢=𝒢\mathcal{G}=\mathcal{H}caligraphic_G = caligraphic_H, but this is not necessary.) Here, the property of a distribution \mathbbm{P}blackboard_P being Markov with respect to a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G means that a certain graphical separation condition in the underlying DAG of 𝒢𝒢\mathcal{G}caligraphic_G, known as d-separation, implies conditional independence in \mathbbm{P}blackboard_P (see Section 2.2 for formal definitions). We shall make use of this equivalence in the sequel.

A series of works studied the problem of learning the underlying DAG of a Bayes net from data, by focusing on maximizing certain scoring criterion by the underlying DAG, see, e.g., [CH92, SDLC93, HGC95]. This task was later shown to be 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hard by [Chi96], which then raised the following natural fundamental question:

Given a succinct description of a distribution \mathbbm{P}blackboard_P (that is not in terms of a Bayes net), how easy is it to find a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G such that \mathbbm{P}blackboard_P is Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G?

Unfortunately, [CHM04] showed that deciding whether a given distribution \mathbbm{P}blackboard_P is Markov with respect to some Bayes net of at most p𝑝p\in\mathbb{N}italic_p ∈ blackboard_N parameters or not is 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hard.

Remark 1.1.

In [CHM04], the distribution \mathbbm{P}blackboard_P is described as the certain marginal of a Bayes net of small in-degree, i.e., a succinct Bayes net description over variables 𝑿𝑿\bm{X}bold_italic_X, along with a subset of variables 𝑺𝑿𝑺𝑿\bm{S}\subseteq\bm{X}bold_italic_S ⊆ bold_italic_X to marginalize out. For example, any distribution superscript\mathbbm{P}^{\prime}blackboard_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that is Markov with respect to the left Bayes net in Figure 1 is Markov with respect to the right Bayes net in Figure 1 after marginalizing out X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Note that all possible distributions over 𝑿𝑿\bm{X}bold_italic_X are Markov with respect to some Bayes net over a clique, but such a Bayes net requires 2|𝑿|1superscript2𝑿12^{|\bm{X}|}-12 start_POSTSUPERSCRIPT | bold_italic_X | end_POSTSUPERSCRIPT - 1 parameters.

Regarding upper bounds, there are well-known algorithms for learning the underlying DAG of a Bayes net from distributional samples such as the PC [SGS00] and GES [Chi02] algorithms. Recently, [BCD20] also gave finite sample guarantees of learning Bayes nets (that have n𝑛nitalic_n nodes, each taking values over an alphabet ΣΣ\Sigmaroman_Σ) from samples. When given the promise that the underlying DAG has bounded in-degree of d𝑑ditalic_d, [BCD20, Theorem 10] asserts that using

𝒪(log1δε2(n|Σ|d+1log(n|Σ|ε)+ndlogn))𝒪1𝛿superscript𝜀2𝑛superscriptΣ𝑑1𝑛Σ𝜀𝑛𝑑𝑛\mathcal{O}\left(\frac{\log\frac{1}{\delta}}{\varepsilon^{2}}\left(n\left|% \Sigma\right|^{d+1}\log\left(\frac{n{\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}\left|\Sigma\right|}}{\varepsilon}\right)+n\cdot d\cdot\log n\right)\right)caligraphic_O ( divide start_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_n | roman_Σ | start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_n | roman_Σ | end_ARG start_ARG italic_ε end_ARG ) + italic_n ⋅ italic_d ⋅ roman_log italic_n ) ) (1)

samples from the underlying distribution \mathbbm{P}blackboard_P, one can learn \mathbbm{P}blackboard_P up to total variation (TV) distance ε𝜀\varepsilonitalic_ε with probability at least 1δ1𝛿1-\delta1 - italic_δ.

One standard way to measure the complexity of a Bayes net is by imposing an upper bound on any node’s in-degree. In this work, we measure the complexity in terms of the number of parameters p𝑝pitalic_p, which is a more fine-grained measure than that of maximum in-degree d𝑑ditalic_d. For instance, a Bayes net on n𝑛nitalic_n Boolean variables with maximum in-degree d𝑑ditalic_d could have as few as O(n+2d)𝑂𝑛superscript2𝑑O(n+2^{d})italic_O ( italic_n + 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) parameters (e.g., a star graph where d𝑑ditalic_d leaves point towards the center of the start) and as many as Ω(n2d)Ω𝑛superscript2𝑑\Omega(n\cdot 2^{d})roman_Ω ( italic_n ⋅ 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) parameters (e.g., a complete graph).

1.1 Our Contributions

Our two main contributions are that we extend the hardness result of [CHM04] and generalize the finite sample complexity result of [BCD20].

Contribution 1.

We extend the hardness result of [CHM04] to the setting where we are guaranteed that the Bayes net in question is promised to have a small number of parameters. In computational complexity theory, this is also known as a promise problem, which generalizes a decision problem in that the input is promised to belong to a certain subset of possible inputs. Our new hardness result confirms the common intuition that it is hard to search for a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G that is Markov with respect to a given probability distribution, even if it is known that the distribution in question is Markov with respect to a Bayes net that has a small number of parameters.

Definition 1.2 (The REALIZABLE-LEARN problem).

Given as input variables 𝑿=(X1,,Xn)𝑿subscript𝑋1subscript𝑋𝑛\bm{X}=(X_{1},\ldots,X_{n})bold_italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), a probability distribution \mathbbm{P}blackboard_P on 𝑿𝑿\bm{X}bold_italic_X, a parameter bound p𝑝p\in\mathbb{N}italic_p ∈ blackboard_N, and the promise that there exists a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G with at most p𝑝pitalic_p parameters such that \mathbbm{P}blackboard_P is Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G, output a Bayes net 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with at most p𝑝pitalic_p parameters such that \mathbbm{P}blackboard_P is Markov with respect to 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Theorem 1.3.

REALIZABLE-LEARN is 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hard.

Technically speaking, REALIZABLE-LEARN is a promise search problem. While 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hardness results usually revolve around decision problems, 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hardness naturally extends to the more general case of search problems when Turing reductions are considered. (Turing reductions comprise a very broad class of reductions, whereby an efficient algorithm for a problem yields an efficient algorithm for another.)

Contribution 2.

We generalized the finite sample result of [BCD20] from the degree-bounded setting to the parameter-bounded setting.

Theorem 1.4 (Approximating parameter-bounded Bayes nets using samples).

Fix any accuracy parameter ε>0𝜀0\varepsilon>0italic_ε > 0 and confidence parameter δ>0𝛿0\delta>0italic_δ > 0. Given sample access to a distribution \mathbbm{P}blackboard_P over n𝑛nitalic_n variables, each defined on the alphabet ΣΣ\Sigmaroman_Σ, and the promise that \mathbbm{P}blackboard_P is Markov with respect to a Bayes net with at most p𝑝pitalic_p parameters,

𝒪(log1δε2(plog(n|Σ|ε)+nlog(pn(|Σ|1))log|Σ|logn))𝒪1𝛿superscript𝜀2𝑝𝑛Σ𝜀𝑛𝑝𝑛Σ1Σ𝑛\mathcal{O}\!\left(\frac{\log\frac{1}{\delta}}{\varepsilon^{2}}\left(p\log% \left(\frac{n\left|\Sigma\right|}{\varepsilon}\right)+n\frac{\log\left(\frac{p% }{n\left(|\Sigma|-1\right)}\right)}{\log\left|\Sigma\right|}\log n\right)\right)caligraphic_O ( divide start_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_p roman_log ( divide start_ARG italic_n | roman_Σ | end_ARG start_ARG italic_ε end_ARG ) + italic_n divide start_ARG roman_log ( divide start_ARG italic_p end_ARG start_ARG italic_n ( | roman_Σ | - 1 ) end_ARG ) end_ARG start_ARG roman_log | roman_Σ | end_ARG roman_log italic_n ) )

samples from \mathbbm{P}blackboard_P suffice to learn the underlying DAG of a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G with at most p𝑝pitalic_p parameters and define a distribution \mathbbm{Q}blackboard_Q that is Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G such that dTV(,)εsubscript𝑑TV𝜀d_{\mathrm{TV}}(\mathbbm{P},\mathbbm{Q})\leq\varepsilonitalic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( blackboard_P , blackboard_Q ) ≤ italic_ε with success probability at least 1δ1𝛿1-\delta1 - italic_δ.

Notice that when all in-degrees are at most d𝑑ditalic_d, we have p(|Σ|1)n|Σ|d𝑝Σ1𝑛superscriptΣ𝑑p\leq(\left|\Sigma\right|-1)\cdot n\cdot\left|\Sigma\right|^{d}italic_p ≤ ( | roman_Σ | - 1 ) ⋅ italic_n ⋅ | roman_Σ | start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, so our result generalizes the bound of [BCD20] given in Equation 1.111One can further generalize Theorem 1.4 to the case where each node has a different alphabet size, e.g., Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has alphabet ΣisubscriptΣ𝑖\Sigma_{i}roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, but this is a straightforward extension.

Finally, we note that while Theorem 1.4 runs in time polynomial in 1/δ1𝛿1/\delta1 / italic_δ and 1/ε21superscript𝜀21/\varepsilon^{2}1 / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and it has exponential dependency on the number of samples from \mathbbm{P}blackboard_P, similar to [BCD20].

1.2 Paper Outline

After preliminaries in Section 2, we give a high-level overview of the techniques behind our results in Section 3. We then formally prove Theorem 1.3 and Theorem 1.4 in Section 4 and Section 5, respectively. Finally, we conclude with an open problem in Section 6.

2 Preliminaries

2.1 Notation

We define the set of natural numbers by \mathbb{N}blackboard_N and all logarithms refer to the natural log\logroman_log. Distributions are written as \mathbbm{P}blackboard_P, \mathbbm{Q}blackboard_Q and graphs in calligraphic letters, e.g., 𝒢,,𝒦𝒢𝒦\mathcal{G},\mathcal{H},\mathcal{K}caligraphic_G , caligraphic_H , caligraphic_K. For variables/nodes, we use capital letters, small letters for the values taken by them, and boldface versions for a collection of variables, e.g., X=x𝑋𝑥X=xitalic_X = italic_x and 𝑿=𝒙𝑿𝒙\bm{X}=\bm{x}bold_italic_X = bold_italic_x. As shorthands, we write [n]delimited-[]𝑛[n][ italic_n ] for {1,,n}1𝑛\{1,\ldots,n\}{ 1 , … , italic_n } and (𝒙)𝒙\mathbbm{P}(\bm{x})blackboard_P ( bold_italic_x ) for (𝑿=𝒙)𝑿𝒙\mathbbm{P}(\bm{X}=\bm{x})blackboard_P ( bold_italic_X = bold_italic_x ). We will often represent the same set of variables of distributions as nodes in a graph.

Problems and algorithms are named in the typewriter font in full caps and capitalized, respectively, e.g., PROBLEM and Algorithm.

We will also often use ΣΣ\Sigmaroman_Σ to denote the alphabet set of a variable and write Δ|Σ|subscriptΔΣ\Delta_{|\Sigma|}roman_Δ start_POSTSUBSCRIPT | roman_Σ | end_POSTSUBSCRIPT to denote the corresponding (conditional) probability simplex.

2.2 Graph-Theoretical Notions

Let 𝒢=(𝑿,𝑬)𝒢𝑿𝑬\mathcal{G}=(\bm{X},\bm{E})caligraphic_G = ( bold_italic_X , bold_italic_E ) be a fully directed graph on |𝑿|=n𝑿𝑛|\bm{X}|=n| bold_italic_X | = italic_n vertices and |𝑬|𝑬|\bm{E}|| bold_italic_E | edges where adjacencies are denoted with dashes, e.g., XY𝑋𝑌X-Yitalic_X - italic_Y, and arc directions are denoted with arrows, e.g., XY𝑋𝑌X\to Yitalic_X → italic_Y. For any node X𝑿𝑋𝑿X\in\bm{X}italic_X ∈ bold_italic_X, we write 𝐏𝐚𝒢(X)𝑿subscript𝐏𝐚𝒢𝑋𝑿\mathbf{Pa}_{\mathcal{G}}(X)\subseteq\bm{X}bold_Pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X ) ⊆ bold_italic_X to denote its parents and 𝐩𝐚𝒢(X)subscript𝐩𝐚𝒢𝑋\mathbf{pa}_{\mathcal{G}}(X)bold_pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X ) to denote the values they take.

A degree sequence of a graph 𝒢𝒢\mathcal{G}caligraphic_G on vertex set 𝑿={X1,,Xn}𝑿subscript𝑋1subscript𝑋𝑛\bm{X}=\{X_{1},\ldots,X_{n}\}bold_italic_X = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is a list of degrees 𝒅=(d1,,dn)𝒅subscript𝑑1subscript𝑑𝑛\bm{d}=(d_{1},\ldots,d_{n})bold_italic_d = ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) of all vertices in the graph, i.e., vertex Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has degree disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. A graph =(𝑿,𝑬)𝑿𝑬\mathcal{H}=(\bm{X},\bm{E})caligraphic_H = ( bold_italic_X , bold_italic_E ) is said to realize degree sequence 𝒅𝒅\bm{d}bold_italic_d if the degrees of 𝑿𝑿\bm{X}bold_italic_X in \mathcal{H}caligraphic_H agree with 𝒅𝒅\bm{d}bold_italic_d. Realizability is defined in a similar fashion for in-degree sequences 𝒅=(d1,,dn)superscript𝒅superscriptsubscript𝑑1superscriptsubscript𝑑𝑛\bm{d}^{-}=(d_{1}^{-},\ldots,d_{n}^{-})bold_italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ).

The graph 𝒢𝒢\mathcal{G}caligraphic_G is called a directed acyclic graph (DAG) if it does not contain any directed cycles and is said to be complete if for every two of its nodes U,V𝑿𝑈𝑉𝑿U,V\in\bm{X}italic_U , italic_V ∈ bold_italic_X either there is an edge VU𝑉𝑈V\to Uitalic_V → italic_U or an edge UV𝑈𝑉U\to Vitalic_U → italic_V, i.e., the underlying undirected graph is a clique. A vertex Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on any simple path V1Vksubscript𝑉1subscript𝑉𝑘V_{1}-\ldots-V_{k}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - … - italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is called a collider if the arcs are such that Vi1ViVi+1subscript𝑉𝑖1subscript𝑉𝑖subscript𝑉𝑖1V_{i-1}\to V_{i}\leftarrow V_{i+1}italic_V start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT → italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_V start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT.

A Bayesian network (or Bayes net) 𝒢𝒢\mathcal{G}caligraphic_G for a set of n𝑛nitalic_n variables X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\ldots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is described by a DAG (𝑿,𝑬)𝑿𝑬(\bm{X},\bm{E})( bold_italic_X , bold_italic_E ) and n𝑛nitalic_n corresponding conditional probability tables (CPTs), e.g., the CPT for Xi𝑿subscript𝑋𝑖𝑿X_{i}\in\bm{X}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_italic_X describes (xi𝐩𝐚𝒢(Xi))conditionalsubscript𝑥𝑖subscript𝐩𝐚𝒢subscript𝑋𝑖\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{G}}(X_{i}))blackboard_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) for all possible values of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐩𝐚𝒢(Xi)subscript𝐩𝐚𝒢subscript𝑋𝑖\mathbf{pa}_{\mathcal{G}}(X_{i})bold_pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The joint distribution for \mathbbm{P}blackboard_P factorizes as

(𝒙)=i=1n(xi𝐩𝐚𝒢(Xi)),𝒙superscriptsubscriptproduct𝑖1𝑛conditionalsubscript𝑥𝑖subscript𝐩𝐚𝒢subscript𝑋𝑖\mathbbm{P}(\bm{x})=\prod_{i=1}^{n}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{% G}}(X_{i})),blackboard_P ( bold_italic_x ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ,

and we say that 𝒢𝒢\mathcal{G}caligraphic_G represents \mathbbm{P}blackboard_P.

All independence constraints that hold in the joint distribution of a Bayes net that has underlying DAG 𝒢𝒢\mathcal{G}caligraphic_G are exactly captured by the d-separation criterion [Pea88, Section 3.3.1]. Two nodes X,Y𝑿𝑋𝑌𝑿X,Y\in\bm{X}italic_X , italic_Y ∈ bold_italic_X are said to be d-separated in a DAG 𝒢=(𝑿,𝑬)𝒢𝑿𝑬\mathcal{G}=(\bm{X},\bm{E})caligraphic_G = ( bold_italic_X , bold_italic_E ) given a set 𝒁𝑿{X,Y}𝒁𝑿𝑋𝑌\bm{Z}\in\bm{X}\setminus\{X,Y\}bold_italic_Z ∈ bold_italic_X ∖ { italic_X , italic_Y } if and only if there is no 𝒁𝒁\bm{Z}bold_italic_Z-active path in 𝒢𝒢\mathcal{G}caligraphic_G between X𝑋Xitalic_X and Y𝑌Yitalic_Y; a 𝒁𝒁\bm{Z}bold_italic_Z-active path is a simple path Q𝑄Qitalic_Q such that any vertex from 𝒁𝒁\bm{Z}bold_italic_Z on Q𝑄Qitalic_Q occurs as a collider and any vertex from 𝑿𝒁𝑿𝒁\bm{X}\setminus\bm{Z}bold_italic_X ∖ bold_italic_Z appears as a non-collider. Two nodes are d-connected if they are not d-separated. It is known that X𝑋Xitalic_X is d-separated from its non-descendants given its parents [Pea88, Section 3.3.1, Corollary 4].

A probability distribution \mathbbm{P}blackboard_P is said to be Markov with respect to a DAG 𝒢𝒢\mathcal{G}caligraphic_G if d-separation in 𝒢𝒢\mathcal{G}caligraphic_G implies conditional independence in \mathbbm{P}blackboard_P. Note that any distribution is Markov with respect to the complete DAG, since there are no d-separations implied by this kind of DAG. (Moreover, any Bayes net over a complete DAG requires 2|𝑿|1superscript2𝑿12^{|\bm{X}|}-12 start_POSTSUPERSCRIPT | bold_italic_X | end_POSTSUPERSCRIPT - 1 parameters to describe.) A probability distribution \mathbbm{P}blackboard_P is said to be Markov with respect to a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G if \mathbbm{P}blackboard_P is Markov with respect to the underlying DAG of 𝒢𝒢\mathcal{G}caligraphic_G.

2.3 Some Problems of Interest

[CHM04] introduced a decision problem about learning Bayes nets from data, called LEARN, and proved that LEARN is 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hard by showing a reduction from the 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hard decision problem degree-bounded feedback arc set (DBFAS). See Definition 2.1 and Definition 2.2.

Definition 2.1 (The DBFAS decision problem).

Given a directed graph 𝒢=(𝑿,𝑬)𝒢𝑿𝑬\mathcal{G}=(\bm{X},\bm{E})caligraphic_G = ( bold_italic_X , bold_italic_E ) with maximum vertex degree of 3333, and a positive integer k|𝑬|𝑘𝑬k\leq|\bm{E}|italic_k ≤ | bold_italic_E |, determine whether there is a subset of edges 𝑬𝑬superscript𝑬𝑬\bm{E}^{\prime}\subseteq\bm{E}bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ bold_italic_E with of size |𝑬|ksuperscript𝑬𝑘|\bm{E}^{\prime}|\leq k| bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ≤ italic_k such that 𝑬superscript𝑬\bm{E}^{\prime}bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT contains at least one directed edge from every directed cycle in 𝒢𝒢\mathcal{G}caligraphic_G.

Definition 2.2 (The LEARN decision problem).

Given variables 𝑿=(X1,,Xn)𝑿subscript𝑋1subscript𝑋𝑛\bm{X}=(X_{1},\ldots,X_{n})bold_italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), a probability distribution \mathbbm{P}blackboard_P over 𝑿𝑿\bm{X}bold_italic_X, and a parameter bound p𝑝p\in\mathbb{N}italic_p ∈ blackboard_N, determine whether there exists a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G with at most p𝑝pitalic_p parameters such that \mathbbm{P}blackboard_P is Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G.

In our work, we focus on the particular LEARN instances (𝑿,,p)𝑿𝑝(\bm{X},\mathbbm{P},p)( bold_italic_X , blackboard_P , italic_p ) used in [CHM04], namely LEARN-DBFAS.

Definition 2.3 (The LEARN-DBFAS decision problem).

Let R𝑅Ritalic_R denote the reduction of [CHM04] from DBFAS to LEARN. We define as LEARN-DBFAS the set of instances of LEARN that are in the range of R𝑅Ritalic_R.

That is, for any instance ILsubscript𝐼𝐿I_{L}italic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT of LEARN-DBFAS, there is some instance IDsubscript𝐼𝐷I_{D}italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT of DBFAS such that R(ID)=IL𝑅subscript𝐼𝐷subscript𝐼𝐿R\!\left(I_{D}\right)=I_{L}italic_R ( italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) = italic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT.

An independence oracle for a distribution \mathbbm{P}blackboard_P is an oracle that can determine, in constant time, whether or not UV𝒁U\perp\!\!\!\perp V\mid\bm{Z}italic_U ⟂ ⟂ italic_V ∣ bold_italic_Z for any U,V𝑿𝑈𝑉𝑿U,V\in\bm{X}italic_U , italic_V ∈ bold_italic_X and 𝒁𝑿{U,V}𝒁𝑿𝑈𝑉\bm{Z}\subseteq\bm{X}\setminus\{U,V\}bold_italic_Z ⊆ bold_italic_X ∖ { italic_U , italic_V }. We will use the following result by [CHM04].

Theorem 2.4 ([CHM04]).

LEARN-DBFAS is 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hard even when one is given access to an independence oracle.

2.4 Selecting a Close Distribution With Finite Samples

The classic method to select an approximate distribution amongst a set of candidate distributions is via the Scheffé tournament of [DL01], which provides a logarithmic dependency on the number of candidates.

In our work, we will use the Scheffé-based algorithm of [DK14],222Their result is actually more general than what we stated here. For instance, they only require sample access to the distributions in ={1,,m}subscript1subscript𝑚\boldsymbol{\mathbb{Q}}=\{\mathbbm{Q}_{1},\ldots,\mathbbm{Q}_{m}\}blackboard_bold_Q = { blackboard_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , blackboard_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } while our setting is simpler as we have explicit descriptions of each of these distributions. which given sample access to an input distribution and explicit access to some candidate distributions, outputs with high probability a candidate distribution that is sufficiently close, in total variation (TV) distance (dTVsubscript𝑑TVd_{\mathrm{TV}}italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT), to the input distribution.

Theorem 2.5 ([DK14]).

Fix any accuracy parameter ε>0𝜀0\varepsilon>0italic_ε > 0 and confidence parameter δ>0𝛿0\delta>0italic_δ > 0. Suppose there is a distribution \mathbbm{P}blackboard_P over variables 𝐗𝐗\bm{X}bold_italic_X and a collection of explicit distributions ={1,,m}subscript1subscript𝑚\boldsymbol{\mathbb{Q}}=\{\mathbbm{Q}_{1},\ldots,\mathbbm{Q}_{m}\}blackboard_bold_Q = { blackboard_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , blackboard_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, where each distribution isubscript𝑖\mathbbm{Q}_{i}blackboard_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is defined over the same set 𝐗𝐗\bm{X}bold_italic_X and there exists some superscript\mathbbm{Q}^{*}\in\boldsymbol{\mathbb{Q}}blackboard_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_bold_Q such that dTV(,)εsubscript𝑑TV𝜀d_{\mathrm{TV}}(\mathbbm{P},\mathbbm{Q})\leq\varepsilonitalic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( blackboard_P , blackboard_Q ) ≤ italic_ε. Then, there is an algorithm that uses 𝒪(log1/δε2logm)𝒪1𝛿superscript𝜀2𝑚\mathcal{O}\left(\frac{\log 1/\delta}{\varepsilon^{2}}\log m\right)caligraphic_O ( divide start_ARG roman_log 1 / italic_δ end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log italic_m ) samples from \mathbbm{P}blackboard_P and returns some \mathbbm{Q}\in\boldsymbol{\mathbb{Q}}blackboard_Q ∈ blackboard_bold_Q such that dTV(,)10εsubscript𝑑TV10𝜀d_{\mathrm{TV}}(\mathbbm{P},\mathbbm{Q})\leq 10\varepsilonitalic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( blackboard_P , blackboard_Q ) ≤ 10 italic_ε with success probability at least 1δ1𝛿1-\delta1 - italic_δ and running time poly(m,1/δ,1/ε2)poly𝑚1𝛿1superscript𝜀2\mathrm{poly}(m,1/\delta,1/\varepsilon^{2})roman_poly ( italic_m , 1 / italic_δ , 1 / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

To curate a set of candidates \boldsymbol{\mathbb{Q}}blackboard_bold_Q, we rely on the following lemma of [BCD20] which states that any distribution \mathbbm{Q}blackboard_Q which approximately agrees with \mathbbm{P}blackboard_P on the local conditional distribution at each node will be close in TV distance to \mathbbm{P}blackboard_P on the entire domain.

Lemma 2.6 ([BCD20]).

Suppose \mathbbm{P}blackboard_P and \mathbbm{Q}blackboard_Q are Bayes nets on the same DAG 𝒢=(𝐗,𝐄)𝒢𝐗𝐄\mathcal{G}=(\bm{X},\bm{E})caligraphic_G = ( bold_italic_X , bold_italic_E ) with n𝑛nitalic_n nodes. If

dTV((X𝐏𝐚𝒢(X)=σ),(X𝐏𝐚𝒢(X))=σ)εnsubscript𝑑TVconditional𝑋subscript𝐏𝐚𝒢𝑋𝜎conditional𝑋subscript𝐏𝐚𝒢𝑋𝜎𝜀𝑛d_{\mathrm{TV}}\Big{(}\mathbbm{P}(X\mid\mathbf{Pa}_{\mathcal{G}}(X)=\sigma),% \mathbbm{Q}(X\mid\mathbf{Pa}_{\mathcal{G}}(X))=\sigma\Big{)}\leq\frac{% \varepsilon}{n}italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( blackboard_P ( italic_X ∣ bold_Pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X ) = italic_σ ) , blackboard_Q ( italic_X ∣ bold_Pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X ) ) = italic_σ ) ≤ divide start_ARG italic_ε end_ARG start_ARG italic_n end_ARG

for all nodes X𝐗𝑋𝐗X\in\bm{X}italic_X ∈ bold_italic_X and possible parent values σΣ|𝐏𝐚𝒢(X)|𝜎superscriptΣsubscript𝐏𝐚𝒢𝑋\sigma\in\Sigma^{|\mathbf{Pa}_{\mathcal{G}}(X)|}italic_σ ∈ roman_Σ start_POSTSUPERSCRIPT | bold_Pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X ) | end_POSTSUPERSCRIPT, then dTV(,)εsubscript𝑑TV𝜀d_{\mathrm{TV}}(\mathbbm{P},\mathbbm{Q})\leq\varepsilonitalic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( blackboard_P , blackboard_Q ) ≤ italic_ε.

Although there are infinitely many possible distributions, since we are satisfied with an approximately close distribution, one can discretize the space via an ε𝜀\varepsilonitalic_ε-net.

Definition 2.7 (ε𝜀\varepsilonitalic_ε-nets; [Ver18]).

Fix a metric space (𝑻,d)𝑻𝑑(\bm{T},d)( bold_italic_T , italic_d ). For any subset 𝑲𝑻𝑲𝑻\bm{K}\subseteq\bm{T}bold_italic_K ⊆ bold_italic_T and ε>0𝜀0\varepsilon>0italic_ε > 0, a subset 𝑵𝑲𝑵𝑲\bm{N}\subseteq\bm{K}bold_italic_N ⊆ bold_italic_K is called an ε𝜀\varepsilonitalic_ε-net of 𝑲𝑲\bm{K}bold_italic_K if every point in 𝑲𝑲\bm{K}bold_italic_K is within distance ε𝜀\varepsilonitalic_ε to some point in 𝑵𝑵\bm{N}bold_italic_N. That is, x𝑲,x0𝑵formulae-sequencefor-all𝑥𝑲subscript𝑥0𝑵\forall x\in\bm{K},\exists x_{0}\in\bm{N}∀ italic_x ∈ bold_italic_K , ∃ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ bold_italic_N such that d(x,x0)ε𝑑𝑥subscript𝑥0𝜀d(x,x_{0})\leq\varepsilonitalic_d ( italic_x , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ italic_ε. We say that 𝑵𝑵\bm{N}bold_italic_N ε𝜀\varepsilonitalic_ε-covers 𝑲𝑲\bm{K}bold_italic_K.

As we shall see in Section 5, the candidate set \boldsymbol{\mathbb{Q}}blackboard_bold_Q will be created by computing an εn𝜀𝑛\frac{\varepsilon}{n}divide start_ARG italic_ε end_ARG start_ARG italic_n end_ARG-net with respect to the TV distance and then applying Lemma 2.6 suitably.

2.5 Other Related Work

We have already referred to some papers that are relevant to our work. We resume this discussion here. [Das99] considers the task of learning the maximum-likelihood polytree from data. The main result of this paper is that the optimal branching (or Chow-Liu tree) is a good approximation to the best polytree. This result is then complemented by the observation that this learning problem is 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hard, even to approximately solve within some constant factor.

[TK05] propose a simple heuristic method for addressing the task of learning Bayes nets. Their approach is based on the fact that the best network (of bounded in-degree) consistent with a given node ordering can be found efficiently.

[EG08] present a method for learning Bayes nets of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. At the heart of their method is a dynamic triangulation, that they update in a way which facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one.

[FNP13] introduce an algorithm that achieves learning by restricting the search space. Their iterative algorithm restricts the parents of each variable to belong to a small subset of candidates. They then search for a network that satisfies these constraints and the learned network is then used for selecting better candidates for the next iteration.

[GK21] investigate the parameterized complexity of Bayesian Network Structure Learning (BNSL). They show that parameterizing BNSL by the size of a feedback edge set yields fixed-parameter tractability.

[KSM22] combine constraint-based methods with greedy or Markov chain Monte Carlo (MCMC) schemes in a method which reduces the complexity of MCMC approaches to that of a constraint-based method.

3 Technical Overview

Here, we give a brief high-level overview of the techniques used in our results of Theorem 1.3 and Theorem 1.4.

3.1 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-Hardness of the Realizable Case

By Theorem 2.4, it would suffice to prove that the existence of a polynomial time algorithm for REALIZABLE-LEARN implies that LEARN-DBFAS instances can be solved in polynomial time if one has access to an independence oracle. The desired result will then follow from the facts that we can efficiently (a)𝑎(a)( italic_a ) compute the number of parameters of a Bayes net and (b)𝑏(b)( italic_b ) decide whether a given distribution is Markov with respect to a given Bayes net (when given access to an independence oracle).

Suppose we have a polynomial time algorithm Learner for REALIZABLE-LEARN. Note that it is conventional to assume that such an algorithm always halts within some polynomial-time bound, and outputs some Bayes net, even when the respective promise is violated. We define and analyze the following reduction:

Given an arbitrary instance (𝑿,,p)𝑿𝑝(\bm{X},\mathbbm{P},p)( bold_italic_X , blackboard_P , italic_p ) of REALIZABLE-LEARN, run Learner to obtain a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G. Then check whether 𝒢𝒢\mathcal{G}caligraphic_G has at most p𝑝pitalic_p parameters and (while using an independence oracle) check whether or not \mathbbm{P}blackboard_P is Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G. If both of these checks are positive, then output YES. Otherwise, output NO. See Section 4 for the formal proof of Theorem 1.3.

3.2 Approximately Learning Parameter-Bounded Bayes Networks

The main idea is to construct an ε𝜀\varepsilonitalic_ε-net over all possible DAGs that satisfy the parameter upper bound p𝑝pitalic_p, and then apply a well-known bound from the density estimation literature.

For this purpose, we need to count all possible Bayes nets that satisfy the parameter upper bound p𝑝pitalic_p. By a counting argument, we see that there are not many possible DAGs that give rise to some Bayes net of at most p𝑝pitalic_p parameters. Then, by a counting argument again, we see that there are only a few conditional distributions that are Markov with respect to a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G over a DAG that realizes a given in-degree sequence. Thus we are able to bound the number of distributions that cover all possible conditional distributions which are Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G. See Section 5 for the formal proof of Theorem 1.4.

4 REALIZABLE-LEARN is 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hard

To show that REALIZABLE-LEARN is hard, we reduce LEARN-DBFAS to REALIZABLE-LEARN by making polynomially-many calls to an independence oracle. Given any polynomial time algorithm Learner that solves REALIZABLE-LEARN, we will forward the LEARN-DBFAS instance to Learner and examine the produced Bayes net 𝒢𝒢\mathcal{G}caligraphic_G. We will describe a polynomial time procedure Reduction that uses an independence oracle to determine whether we should correctly output YES or NO for the given LEARN-DBFAS instance. See Figure 2 for a pictorial illustration of our reduction strategy.

DBFASinstanceLEARN-DBFASinstanceREALIZABLE-LEARNinstanceLearnerReduction𝑿,,p𝑿𝑝\bm{X},\mathbbm{P},pbold_italic_X , blackboard_P , italic_pYES or NO\pgfmathresultpt𝑿,,p𝑿𝑝\bm{X},\mathbbm{P},pbold_italic_X , blackboard_P , italic_p\pgfmathresultptAdditionalpromise𝑿,,p𝑿𝑝\bm{X},\mathbbm{P},pbold_italic_X , blackboard_P , italic_pPromise\pgfmathresultptBayesnet 𝒢𝒢\mathcal{G}caligraphic_GYES or NO
Figure 2: [Gav77] showed that DBFAS is 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hard and [CHM04] showed that LEARN-DBFAS is 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hard, even when given access to an independence oracle for \mathbbm{P}blackboard_P. REALIZABLE-LEARN is a variant of LEARN-DBFAS with the additional promise that there exists a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G with at most p𝑝pitalic_p parameters such that \mathbbm{P}blackboard_P is Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G. In this work, we show that if one can learn such a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G (via some blackbox polynomial time algorithm Learner), then there is a polynomial time algorithm Reduction that correctly answers LEARN-DBFAS. Therefore, REALIZABLE-LEARN is also 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-hard.

We begin by observing that one can easily check the number of parameters of Bayes net given its full description.

Lemma 4.1.

Given a Bayes net over 𝒢=(𝐗,𝐄)𝒢𝐗𝐄\mathcal{G}=(\bm{X},\bm{E})caligraphic_G = ( bold_italic_X , bold_italic_E ), one can compute the number of its parameters in polynomial time.

Proof.

Let ΣXsubscriptΣ𝑋\Sigma_{X}roman_Σ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT denote the alphabet set of X𝑿𝑋𝑿X\in\bm{X}italic_X ∈ bold_italic_X. Then, the number of parameters of 𝒢𝒢\mathcal{G}caligraphic_G is

X𝑿((|ΣX|1)U𝐏𝐚𝒢(X)|ΣU|),subscript𝑋𝑿subscriptΣ𝑋1subscriptproduct𝑈subscript𝐏𝐚𝒢𝑋subscriptΣ𝑈\sum_{X\in\bm{X}}\left((|\Sigma_{X}|-1)\prod_{U\in\mathbf{Pa}_{\mathcal{G}}(X)% }|\Sigma_{U}|\right),∑ start_POSTSUBSCRIPT italic_X ∈ bold_italic_X end_POSTSUBSCRIPT ( ( | roman_Σ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT | - 1 ) ∏ start_POSTSUBSCRIPT italic_U ∈ bold_Pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X ) end_POSTSUBSCRIPT | roman_Σ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT | ) ,

which can be computed in polynomial time. ∎

The following notion of important edges will come handy in the sequel.

Definition 4.2 (Important edges).

Let 𝒢=(𝑽,𝑬)𝒢𝑽𝑬\mathcal{G}=(\bm{V},\bm{E})caligraphic_G = ( bold_italic_V , bold_italic_E ) be a DAG and \mathbbm{P}blackboard_P be a distribution over 𝑽𝑽\bm{V}bold_italic_V. Then, an edge e𝑬𝑒𝑬e\in\bm{E}italic_e ∈ bold_italic_E is called (𝒢,)𝒢(\mathcal{G},\mathbbm{P})( caligraphic_G , blackboard_P )-important if \mathbbm{P}blackboard_P is Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G but is not Markov with respect to 𝒢=(𝑽,𝑬{e})superscript𝒢𝑽𝑬𝑒\mathcal{G}^{\prime}=(\bm{V},\bm{E}\setminus\{e\})caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( bold_italic_V , bold_italic_E ∖ { italic_e } ).

To check whether \mathbbm{P}blackboard_P is Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G, one could verify that any d-separation in 𝒢𝒢\mathcal{G}caligraphic_G implies conditional independence in \mathbbm{P}blackboard_P. However, this computation seems to be intractable. In contrast, Corollary 4.5 gives a polynomial time algorithm that checks this while using an independence oracle.

The correctness of Corollary 4.5 follows from Lemma 4.3 and Lemma 4.4.

Lemma 4.3.

Suppose \mathbbm{P}blackboard_P on variables 𝐗𝐗\bm{X}bold_italic_X is Markov with respect to 𝒢=(𝐗,𝐄)𝒢𝐗𝐄\mathcal{G}=(\bm{X},\bm{E})caligraphic_G = ( bold_italic_X , bold_italic_E ). Then, an edge AB𝐴𝐵A\to Bitalic_A → italic_B in 𝐄𝐄\bm{E}bold_italic_E is not (𝒢,)𝒢(\mathcal{G},\mathbbm{P})( caligraphic_G , blackboard_P )-important if AB𝐏𝐚𝒢(B){A}A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}}(B)\setminus\{A\}italic_A ⟂ ⟂ italic_B ∣ bold_Pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_B ) ∖ { italic_A }.

Proof.

Consider an arbitrary edge AB𝐴𝐵A\to Bitalic_A → italic_B in 𝑬𝑬\bm{E}bold_italic_E such that AB𝐏𝐚𝒢(B){A}A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}}(B)\setminus\{A\}italic_A ⟂ ⟂ italic_B ∣ bold_Pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_B ) ∖ { italic_A }. Say, A=Xj𝐴subscript𝑋𝑗A=X_{j}italic_A = italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and B=Xk𝐵subscript𝑋𝑘B=X_{k}italic_B = italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Letting 𝒢=(𝑽,𝑬(A,B))superscript𝒢𝑽𝑬𝐴𝐵\mathcal{G}^{\prime}=(\bm{V},\bm{E}\setminus(A,B))caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( bold_italic_V , bold_italic_E ∖ ( italic_A , italic_B ) ) be a subgraph of 𝒢𝒢\mathcal{G}caligraphic_G that does not contain the edge AB𝐴𝐵A\to Bitalic_A → italic_B, we see that

(𝒙)=𝒙absent\displaystyle\mathbbm{P}(\bm{x})=blackboard_P ( bold_italic_x ) = i=1n(xi𝐩𝐚𝒢(Xi))superscriptsubscriptproduct𝑖1𝑛conditionalsubscript𝑥𝑖subscript𝐩𝐚𝒢subscript𝑋𝑖\displaystyle\;\prod_{i=1}^{n}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{G}}(X% _{i}))∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ()\displaystyle(\ast)( ∗ )
=\displaystyle== (xk𝐩𝐚𝒢(Xk))i[n]k(xi𝐩𝐚𝒢(Xi))conditionalsubscript𝑥𝑘subscript𝐩𝐚𝒢subscript𝑋𝑘subscriptproduct𝑖delimited-[]𝑛𝑘conditionalsubscript𝑥𝑖subscript𝐩𝐚𝒢subscript𝑋𝑖\displaystyle\;\mathbbm{P}(x_{k}\mid\mathbf{pa}_{\mathcal{G}}(X_{k}))\cdot% \prod_{i\in[n]\setminus k}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{G}}(X_{i}))blackboard_P ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ bold_pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⋅ ∏ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] ∖ italic_k end_POSTSUBSCRIPT blackboard_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
=\displaystyle== (xk𝐩𝐚𝒢(Xk)xj)i[n]k(xi𝐩𝐚𝒢(Xi))conditionalsubscript𝑥𝑘subscript𝐩𝐚𝒢subscript𝑋𝑘subscript𝑥𝑗subscriptproduct𝑖delimited-[]𝑛𝑘conditionalsubscript𝑥𝑖subscript𝐩𝐚𝒢subscript𝑋𝑖\displaystyle\;\mathbbm{P}(x_{k}\mid\mathbf{pa}_{\mathcal{G}}(X_{k})\setminus x% _{j})\cdot\prod_{i\in[n]\setminus k}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal% {G}}(X_{i}))blackboard_P ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ bold_pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∖ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] ∖ italic_k end_POSTSUBSCRIPT blackboard_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ()\displaystyle({\dagger})( † )
=\displaystyle== (xk𝐩𝐚𝒢(Xk))i[n]k(xi𝐩𝐚𝒢(Xi))conditionalsubscript𝑥𝑘subscript𝐩𝐚superscript𝒢subscript𝑋𝑘subscriptproduct𝑖delimited-[]𝑛𝑘conditionalsubscript𝑥𝑖subscript𝐩𝐚superscript𝒢subscript𝑋𝑖\displaystyle\;\mathbbm{P}(x_{k}\mid\mathbf{pa}_{\mathcal{G}^{\prime}}(X_{k}))% \cdot\prod_{i\in[n]\setminus k}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{G}^{% \prime}}(X_{i}))blackboard_P ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ bold_pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⋅ ∏ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] ∖ italic_k end_POSTSUBSCRIPT blackboard_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ()\displaystyle({\ddagger})( ‡ )
=\displaystyle== i=1n(xi𝐩𝐚𝒢(Xi)),superscriptsubscriptproduct𝑖1𝑛conditionalsubscript𝑥𝑖subscript𝐩𝐚superscript𝒢subscript𝑋𝑖\displaystyle\;\prod_{i=1}^{n}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{G}^{% \prime}}(X_{i})),∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ,

where ()(\ast)( ∗ ) is due to \mathbbm{P}blackboard_P being Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G, ()({\dagger})( † ) is due to XjXk𝐏𝐚𝒢(Xk){Xj}X_{j}\perp\!\!\!\perp X_{k}\mid\mathbf{Pa}_{\mathcal{G}}(X_{k})\setminus\{X_{j}\}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟂ ⟂ italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ bold_Pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∖ { italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }, and ()({\dagger})( † ) is due to the definition of 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Since (𝒙)=i=1n(xi𝐩𝐚𝒢(Xi))𝒙superscriptsubscriptproduct𝑖1𝑛conditionalsubscript𝑥𝑖subscript𝐩𝐚superscript𝒢subscript𝑋𝑖\mathbbm{P}(\bm{x})=\prod_{i=1}^{n}\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{% G}^{\prime}}(X_{i}))blackboard_P ( bold_italic_x ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ), we see that \mathbbm{P}blackboard_P is also Markov with respect to 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and so the edge AB𝐴𝐵A\to Bitalic_A → italic_B is not (𝒢,)𝒢(\mathcal{G},\mathbbm{P})( caligraphic_G , blackboard_P )-important. ∎

Lemma 4.4.

Suppose a distribution \mathbbm{P}blackboard_P on variables 𝐗𝐗\bm{X}bold_italic_X is Markov with respect to a DAG 𝒢=(𝐗,𝐄)𝒢𝐗𝐄\mathcal{G}=(\bm{X},\bm{E})caligraphic_G = ( bold_italic_X , bold_italic_E ). Let 𝒢=(𝐗,𝐄)superscript𝒢𝐗superscript𝐄\mathcal{G}^{\prime}=(\bm{X},\bm{E}^{\prime})caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( bold_italic_X , bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) be an edge-induced DAG of 𝒢𝒢\mathcal{G}caligraphic_G with 𝐄𝐄superscript𝐄𝐄\bm{E}^{\prime}\subseteq\bm{E}bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ bold_italic_E. Then, \mathbbm{P}blackboard_P is Markov with respect to 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if and only if AB𝐏𝐚𝒢(B){A}A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}}(B)\setminus\{A\}italic_A ⟂ ⟂ italic_B ∣ bold_Pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_B ) ∖ { italic_A } for all edges AB𝐴𝐵A\to Bitalic_A → italic_B in 𝐄𝐄𝐄superscript𝐄\bm{E}\setminus\bm{E}^{\prime}bold_italic_E ∖ bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Proof.

We prove each direction separately.

(\Leftarrow)

Suppose that \mathbbm{P}blackboard_P is Markov with respect to 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Consider an arbitrary edge AB𝑬𝑬𝐴𝐵𝑬superscript𝑬A\to B\in\bm{E}\setminus\bm{E}^{\prime}italic_A → italic_B ∈ bold_italic_E ∖ bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Since A𝐴Aitalic_A is an ancestor of B𝐵Bitalic_B in 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we have that A𝐴Aitalic_A remains a non-descendant of B𝐵Bitalic_B in 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT after removing the edge AB𝐴𝐵A\to Bitalic_A → italic_B. So, A𝐴Aitalic_A and B𝐵Bitalic_B are d-separated in 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT given 𝐏𝐚𝒢(B){A}subscript𝐏𝐚superscript𝒢𝐵𝐴\mathbf{Pa}_{\mathcal{G}^{\prime}}(B)\setminus\{A\}bold_Pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_B ) ∖ { italic_A }, and so AB𝐏𝐚𝒢(B)A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}^{\prime}}(B)italic_A ⟂ ⟂ italic_B ∣ bold_Pa start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_B ) by the Markov property. That is, AB𝐏𝐚𝒢(B){A}A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}}(B)\setminus\{A\}italic_A ⟂ ⟂ italic_B ∣ bold_Pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_B ) ∖ { italic_A }.

(\Rightarrow)

Suppose that AB𝐏𝐚𝒢(B){A}A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}}(B)\setminus\{A\}italic_A ⟂ ⟂ italic_B ∣ bold_Pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_B ) ∖ { italic_A } for all edges AB𝐴𝐵A\to Bitalic_A → italic_B in 𝑬𝑬𝑬superscript𝑬\bm{E}\setminus\bm{E}^{\prime}bold_italic_E ∖ bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Order the edges in 𝑬𝑬𝑬superscript𝑬\bm{E}\setminus\bm{E}^{\prime}bold_italic_E ∖ bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in an arbitrary sequence, say e1,,e|𝑬𝑬|subscript𝑒1subscript𝑒𝑬superscript𝑬e_{1},\ldots,e_{|\bm{E}\setminus\bm{E}^{\prime}|}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT | bold_italic_E ∖ bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | end_POSTSUBSCRIPT. Let us remove these edges sequentially, resulting in a sequence of edge-induced DAGs 𝒢=𝒢0,𝒢1,,𝒢|𝑬𝑬|=𝒢formulae-sequence𝒢subscript𝒢0subscript𝒢1subscript𝒢𝑬superscript𝑬superscript𝒢\mathcal{G}=\mathcal{G}_{0},\mathcal{G}_{1},\ldots,\mathcal{G}_{|\bm{E}% \setminus\bm{E}^{\prime}|}=\mathcal{G}^{\prime}caligraphic_G = caligraphic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_G start_POSTSUBSCRIPT | bold_italic_E ∖ bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | end_POSTSUBSCRIPT = caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the edge-induced DAG obtained from removing edges {e1,,ei}subscript𝑒1subscript𝑒𝑖\{e_{1},\ldots,e_{i}\}{ italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } from 𝒢𝒢\mathcal{G}caligraphic_G. Observe that non-descendant relationships are preserved as we remove edges, i.e., if A𝐴Aitalic_A is a non-descendant of B𝐵Bitalic_B in 𝒢𝒢\mathcal{G}caligraphic_G, then it is also a non-descendant of B𝐵Bitalic_B in 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for any i{1,,|𝑬𝑬|}𝑖1𝑬superscript𝑬i\in\{1,\ldots,|\bm{E}\setminus\bm{E}^{\prime}|\}italic_i ∈ { 1 , … , | bold_italic_E ∖ bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | }. So, we can apply Lemma 4.3 repeatedly: For any i{1,,|𝑬𝑬|}𝑖1𝑬superscript𝑬i\in\{1,\ldots,|\bm{E}\setminus\bm{E}^{\prime}|\}italic_i ∈ { 1 , … , | bold_italic_E ∖ bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | }, the edge eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is not (𝒢i1,)subscript𝒢𝑖1(\mathcal{G}_{i-1},\mathbbm{P})( caligraphic_G start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , blackboard_P )-important, so \mathbbm{P}blackboard_P is Markov with respect to 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. That is, \mathbbm{P}blackboard_P is Markov with respect to 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. ∎

Corollary 4.5.

Suppose \mathbbm{P}blackboard_P is a distribution over 𝐗𝐗\bm{X}bold_italic_X and 𝒢𝒢\mathcal{G}caligraphic_G is a Bayes net over the same set of variables 𝐗𝐗\bm{X}bold_italic_X. Then, there is a polynomial time algorithm that uses an independence oracle for \mathbbm{P}blackboard_P to decide whether or not \mathbbm{P}blackboard_P is Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G.

Proof.

Consider the following algorithm Checker:

Given Bayes net 𝒢𝒢\mathcal{G}caligraphic_G over a DAG (𝑿,𝑬)𝑿superscript𝑬(\bm{X},\bm{E}^{\prime})( bold_italic_X , bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), consider a DAG 𝒦=(𝑿,𝑬)𝒦𝑿𝑬\mathcal{K}=(\bm{X},\bm{E})caligraphic_K = ( bold_italic_X , bold_italic_E ), with 𝑬𝑬superscript𝑬𝑬\bm{E}^{\prime}\subseteq\bm{E}bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ bold_italic_E, that is a complete supergraph of (𝑿,𝑬)𝑿superscript𝑬(\bm{X},\bm{E}^{\prime})( bold_italic_X , bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). If every edge AB𝑬𝑬𝐴𝐵𝑬superscript𝑬A\to B\in\bm{E}\setminus\bm{E}^{\prime}italic_A → italic_B ∈ bold_italic_E ∖ bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT satisfies AB𝐏𝐚𝒢(B){A}A\perp\!\!\!\perp B\mid\mathbf{Pa}_{\mathcal{G}}(B)\setminus\{A\}italic_A ⟂ ⟂ italic_B ∣ bold_Pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_B ) ∖ { italic_A }, output YES. Otherwise, output NO.

Note that since any distribution is Markov with respect to the complete DAG (see Section 2.2), \mathbbm{P}blackboard_P is Markov with respect to 𝒦𝒦\mathcal{K}caligraphic_K.

The correctness of Checker follows from Lemma 4.4. Checker runs in polynomial time as 𝒦𝒦\mathcal{K}caligraphic_K can be created in polynomial time with respect to the size of 𝑿𝑿\bm{X}bold_italic_X, and the number of edges in 𝑬𝑬𝑬superscript𝑬\bm{E}\subseteq\bm{E}^{\prime}bold_italic_E ⊆ bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to check is polynomial in the size of 𝑿𝑿\bm{X}bold_italic_X. ∎

We are now ready to formally prove our first main result.

See 1.3

Proof.

It suffices to show the existence of a polynomial time algorithm for REALIZABLE-LEARN implies that LEARN-DBFAS instances can be answered in polynomial time with access to an independence oracle; see Figure 2.

Suppose we have a polynomial time algorithm Learner for REALIZABLE-LEARN. Let us define a reduction algorithm Reduction as follows:

Given an instance (𝑿,,p)𝑿𝑝(\bm{X},\mathbbm{P},p)( bold_italic_X , blackboard_P , italic_p ) of LEARN-DBFAS, run Learner to obtain a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G (see Section 3.1; this is a natural assumption for algorithms solving a search promise problem). Compute the number of parameters of 𝒢𝒢\mathcal{G}caligraphic_G. Run algorithm Checker of Corollary 4.5 on 𝒢𝒢\mathcal{G}caligraphic_G to check whether or not \mathbbm{P}blackboard_P is Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G. Output YES if 𝒢𝒢\mathcal{G}caligraphic_G has at most p𝑝pitalic_p parameters and \mathbbm{P}blackboard_P is Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G; else, output NO.

The correctness of Reduction follows from the assumption that Learner produces a Bayes net 𝒢𝒢\mathcal{G}caligraphic_G with at most p𝑝pitalic_p parameters such that \mathbbm{P}blackboard_P is Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G (if the underlying promise is satisfied; otherwise, the output is an arbitrary Bayes net), and the correctness of Checker. By assumption, Learner is a polynomial time algorithm. By Lemma 4.1, we can compute the number of parameters of 𝒢𝒢\mathcal{G}caligraphic_G in polynomial time. By Corollary 4.5, Checker is also a polynomial time algorithm. Therefore, the overall running time for Reduction is polynomial. ∎

5 Approximating Bayes Nets

Our strategy for proving our finite sample complexity result (Theorem 1.4) follows that of [BCD20, Theorem 10], but we specialize the analysis to the setting where we are given a parameter bound instead of a degree bound. As discussed in Section 1, our result is a generalization of their result since an upper bound on the in-degrees implies a (possibly loose) parameter upper bound.

5.1 Some Graph Counting Arguments

To prove Theorem 1.4, we require an upper bound on the number of possible Bayes nets on n𝑛nitalic_n nodes that have at most p𝑝pitalic_p parameters (Lemma 5.2). To obtain such a result, we first relate the number of parameters p𝑝pitalic_p with a specific given in-degree sequence (d1,,dn)superscriptsubscript𝑑1superscriptsubscript𝑑𝑛(d_{1}^{-},\ldots,d_{n}^{-})( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) of a Bayes net, then we upper bound the total number of Bayes nets that has at most p𝑝pitalic_p parameters by summing over all suitable in-degree sequences 𝒅=(d1,,dn)superscript𝒅superscriptsubscript𝑑1superscriptsubscript𝑑𝑛\bm{d}^{-}=(d_{1}^{-},\ldots,d_{n}^{-})bold_italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ).

Consider an arbitrary Bayes net 𝒢𝒢\mathcal{G}caligraphic_G with in-degree sequence (d1,,dn)superscriptsubscript𝑑1superscriptsubscript𝑑𝑛(d_{1}^{-},\ldots,d_{n}^{-})( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) and each node taking on |Σ|Σ|\Sigma|| roman_Σ | values. Since the conditional distribution for vertex Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is fully described when we know (xi𝐩𝐚𝒢(Xi))conditionalsubscript𝑥𝑖subscript𝐩𝐚𝒢subscript𝑋𝑖\mathbbm{P}(x_{i}\mid\mathbf{pa}_{\mathcal{G}}(X_{i}))blackboard_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) for |Σ|1Σ1|\Sigma|-1| roman_Σ | - 1 possible values of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, with respect to |Σ|disuperscriptΣsuperscriptsubscript𝑑𝑖\left|\Sigma\right|^{d_{i}^{-}}| roman_Σ | start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT possible values of 𝐩𝐚𝒢(Xi)subscript𝐩𝐚𝒢subscript𝑋𝑖\mathbf{pa}_{\mathcal{G}}(X_{i})bold_pa start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Therefore, we see that the Bayes net has

i=1n((|Σ|1)|Σ|di)=(|Σ|1)(i=1n|Σ|di)superscriptsubscript𝑖1𝑛Σ1superscriptΣsuperscriptsubscript𝑑𝑖Σ1superscriptsubscript𝑖1𝑛superscriptΣsuperscriptsubscript𝑑𝑖\sum_{i=1}^{n}\left(\left(|\Sigma|-1\right)|\Sigma|^{d_{i}^{-}}\right)=\left(|% \Sigma|-1\right)\left(\sum_{i=1}^{n}|\Sigma|^{d_{i}^{-}}\right)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ( | roman_Σ | - 1 ) | roman_Σ | start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) = ( | roman_Σ | - 1 ) ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | roman_Σ | start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT )

parameters. Note that this is the exact same reasoning as in Lemma 4.1. So, if the Bayes net has at most p𝑝pitalic_p parameters, then

i=1n|Σ|di=|Σ|d1++|Σ|dnp|Σ|1.superscriptsubscript𝑖1𝑛superscriptΣsuperscriptsubscript𝑑𝑖superscriptΣsuperscriptsubscript𝑑1superscriptΣsuperscriptsubscript𝑑𝑛𝑝Σ1\sum_{i=1}^{n}|\Sigma|^{d_{i}^{-}}=|\Sigma|^{d_{1}^{-}}+\ldots+|\Sigma|^{d_{n}% ^{-}}\leq\frac{p}{|\Sigma|-1}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | roman_Σ | start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = | roman_Σ | start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + … + | roman_Σ | start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≤ divide start_ARG italic_p end_ARG start_ARG | roman_Σ | - 1 end_ARG . (2)

By the AM-GM inequality, we have that

i=1n|Σ|din(i=1n|Σ|di)1n=n|Σ|1ni=1ndi.superscriptsubscript𝑖1𝑛superscriptΣsuperscriptsubscript𝑑𝑖𝑛superscriptsuperscriptsubscriptproduct𝑖1𝑛superscriptΣsuperscriptsubscript𝑑𝑖1𝑛𝑛superscriptΣ1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑑𝑖\sum_{i=1}^{n}|\Sigma|^{d_{i}^{-}}\geq n\left(\prod_{i=1}^{n}|\Sigma|^{d_{i}^{% -}}\right)^{\frac{1}{n}}=n|\Sigma|^{\frac{1}{n}\sum_{i=1}^{n}d_{i}^{-}}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | roman_Σ | start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≥ italic_n ( ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | roman_Σ | start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT = italic_n | roman_Σ | start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT . (3)

Combining Equations 2 and 3 together gives us

d1++dnnlog(pn(|Σ|1))log|Σ|.superscriptsubscript𝑑1superscriptsubscript𝑑𝑛𝑛𝑝𝑛Σ1Σd_{1}^{-}+\ldots+d_{n}^{-}\leq n\frac{\log\left(\frac{p}{n\left(|\Sigma|-1% \right)}\right)}{\log|\Sigma|}.italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + … + italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ≤ italic_n divide start_ARG roman_log ( divide start_ARG italic_p end_ARG start_ARG italic_n ( | roman_Σ | - 1 ) end_ARG ) end_ARG start_ARG roman_log | roman_Σ | end_ARG . (4)

The following lemma is a combinatorial fact upper bounding on the number of graphs that realize a given degree sequence, which may be of independent interest beyond being used to prove Lemma 5.2.

Lemma 5.1.

Given an in-degree sequence 𝐝=(d1,,dn)superscript𝐝superscriptsubscript𝑑1superscriptsubscript𝑑𝑛\bm{d}^{-}=(d_{1}^{-},\ldots,d_{n}^{-})bold_italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) with non-negative integers d1,,dnsuperscriptsubscript𝑑1superscriptsubscript𝑑𝑛d_{1}^{-},\ldots,d_{n}^{-}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT, there are at most i=1n(n1di)superscriptsubscriptproduct𝑖1𝑛binomial𝑛1superscriptsubscript𝑑𝑖\prod_{i=1}^{n}\binom{n-1}{d_{i}^{-}}∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n - 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG ) DAGs that realize 𝐝superscript𝐝\bm{d}^{-}bold_italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT.

Proof.

Fix an arbitrary labelling of vertices from X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to Xnsubscript𝑋𝑛X_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and consider the sequential process of adding edges into X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\ldots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. For X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, there are (n1d1)binomial𝑛1superscriptsubscript𝑑1\binom{n-1}{d_{1}^{-}}( FRACOP start_ARG italic_n - 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG ) ways to add d1superscriptsubscript𝑑1d_{1}^{-}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT incoming edges that end at X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. For X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, there are (n1d2)binomial𝑛1superscriptsubscript𝑑2\binom{n-1}{d_{2}^{-}}( FRACOP start_ARG italic_n - 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG ) possibilities. For X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, there are at most (n1d3)binomial𝑛1superscriptsubscript𝑑3\binom{n-1}{d_{3}^{-}}( FRACOP start_ARG italic_n - 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG ) possibilities. Note that some of these choices would be incompatible with earlier edge choices as the newly added edges may cause directed cycles to be formed. We repeat this edge adding process until all vertices have added their incoming edges to the graph. So, the upper bound is i=1n(n1di)superscriptsubscriptproduct𝑖1𝑛binomial𝑛1superscriptsubscript𝑑𝑖\prod_{i=1}^{n}\binom{n-1}{d_{i}^{-}}∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n - 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG ). ∎

Lemma 5.2.

Suppose that every node takes on at most |Σ|Σ|\Sigma|| roman_Σ | values. Then, there are at most

(n1)nlog(pn(|Σ|1))log|Σ|en(log(pn(|Σ|1))log|Σ|+1)nsuperscript𝑛1𝑛𝑝𝑛Σ1Σsuperscript𝑒𝑛superscript𝑝𝑛Σ1Σ1𝑛\left(n-1\right)^{\frac{n\log\left(\frac{p}{n\left(|\Sigma|-1\right)}\right)}{% \log\left|\Sigma\right|}}e^{n}\left(\frac{\log\left(\frac{p}{n\left(|\Sigma|-1% \right)}\right)}{\log\left|\Sigma\right|}+1\right)^{n}( italic_n - 1 ) start_POSTSUPERSCRIPT divide start_ARG italic_n roman_log ( divide start_ARG italic_p end_ARG start_ARG italic_n ( | roman_Σ | - 1 ) end_ARG ) end_ARG start_ARG roman_log | roman_Σ | end_ARG end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( divide start_ARG roman_log ( divide start_ARG italic_p end_ARG start_ARG italic_n ( | roman_Σ | - 1 ) end_ARG ) end_ARG start_ARG roman_log | roman_Σ | end_ARG + 1 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT

possible DAGs over n𝑛nitalic_n nodes that may be used to define some Bayes net that has at most p𝑝pitalic_p parameters.

Proof.

By Lemma 5.1, there are i=1n(n1di)superscriptsubscriptproduct𝑖1𝑛binomial𝑛1superscriptsubscript𝑑𝑖\prod_{i=1}^{n}\binom{n-1}{d_{i}^{-}}∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n - 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG ) possible DAGs realizing any fixed in-degree sequence 𝒅=(d1,,dn)superscript𝒅superscriptsubscript𝑑1superscriptsubscript𝑑𝑛\bm{d}^{-}=(d_{1}^{-},\ldots,d_{n}^{-})bold_italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ). Let ()(\ast)( ∗ ) denote the condition that an in-degree sequence 𝒅superscript𝒅\bm{d}^{-}bold_italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT yields a graph that has at most p𝑝pitalic_p parameters. Then,

𝒅 satisfies ()i=1n(n1di)subscript𝒅 satisfies ()superscriptsubscriptproduct𝑖1𝑛binomial𝑛1superscriptsubscript𝑑𝑖absent\displaystyle\sum_{\text{$\bm{d}^{-}$ satisfies $(\ast)$}}\prod_{i=1}^{n}% \binom{n-1}{d_{i}^{-}}\leq∑ start_POSTSUBSCRIPT bold_italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT satisfies ( ∗ ) end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n - 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG ) ≤ 𝒅 satisfies ()(n1)d1++dnsubscript𝒅 satisfies ()superscript𝑛1superscriptsubscript𝑑1superscriptsubscript𝑑𝑛\displaystyle\;\sum_{\text{$\bm{d}^{-}$ satisfies $(\ast)$}}\left(n-1\right)^{% d_{1}^{-}+\ldots+d_{n}^{-}}∑ start_POSTSUBSCRIPT bold_italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT satisfies ( ∗ ) end_POSTSUBSCRIPT ( italic_n - 1 ) start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + … + italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT
\displaystyle\leq (n1)nlog(pn(|Σ|1))log|Σ|𝒅 satisfies ()1superscript𝑛1𝑛𝑝𝑛Σ1Σsubscript𝒅 satisfies ()1\displaystyle\;\left(n-1\right)^{n\frac{\log\left(\frac{p}{n\left(|\Sigma|-1% \right)}\right)}{\log|\Sigma|}}\sum_{\text{$\bm{d}^{-}$ satisfies $(\ast)$}}1( italic_n - 1 ) start_POSTSUPERSCRIPT italic_n divide start_ARG roman_log ( divide start_ARG italic_p end_ARG start_ARG italic_n ( | roman_Σ | - 1 ) end_ARG ) end_ARG start_ARG roman_log | roman_Σ | end_ARG end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT bold_italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT satisfies ( ∗ ) end_POSTSUBSCRIPT 1
\displaystyle\leq (n1)nlog(pn(|Σ|1))log|Σ|(nlog(pn(|Σ|1))log|Σ|+nn)superscript𝑛1𝑛𝑝𝑛Σ1Σbinomial𝑛𝑝𝑛Σ1Σ𝑛𝑛\displaystyle\;(n-1)^{n\frac{\log\left(\frac{p}{n\left(|\Sigma|-1\right)}% \right)}{\log|\Sigma|}}\binom{n\frac{\log\left(\frac{p}{n\left(|\Sigma|-1% \right)}\right)}{\log|\Sigma|}+n}{n}( italic_n - 1 ) start_POSTSUPERSCRIPT italic_n divide start_ARG roman_log ( divide start_ARG italic_p end_ARG start_ARG italic_n ( | roman_Σ | - 1 ) end_ARG ) end_ARG start_ARG roman_log | roman_Σ | end_ARG end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n divide start_ARG roman_log ( divide start_ARG italic_p end_ARG start_ARG italic_n ( | roman_Σ | - 1 ) end_ARG ) end_ARG start_ARG roman_log | roman_Σ | end_ARG + italic_n end_ARG start_ARG italic_n end_ARG )
\displaystyle\leq (n1)nlog(pn(|Σ|1))log|Σ|(e(log(pn(|Σ|1))log|Σ|+1))nsuperscript𝑛1𝑛𝑝𝑛Σ1Σsuperscript𝑒𝑝𝑛Σ1Σ1𝑛\displaystyle\;\left(n-1\right)^{n\frac{\log\left(\frac{p}{n\left(|\Sigma|-1% \right)}\right)}{\log|\Sigma|}}\left(e\left(\frac{\log\left(\frac{p}{n\left(|% \Sigma|-1\right)}\right)}{\log|\Sigma|}+1\right)\right)^{n}( italic_n - 1 ) start_POSTSUPERSCRIPT italic_n divide start_ARG roman_log ( divide start_ARG italic_p end_ARG start_ARG italic_n ( | roman_Σ | - 1 ) end_ARG ) end_ARG start_ARG roman_log | roman_Σ | end_ARG end_POSTSUPERSCRIPT ( italic_e ( divide start_ARG roman_log ( divide start_ARG italic_p end_ARG start_ARG italic_n ( | roman_Σ | - 1 ) end_ARG ) end_ARG start_ARG roman_log | roman_Σ | end_ARG + 1 ) ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT

where the first and last inequalities is because (nk)(enk)knkbinomial𝑛𝑘superscript𝑒𝑛𝑘𝑘superscript𝑛𝑘\binom{n}{k}\leq\left(\frac{en}{k}\right)^{k}\leq n^{k}( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ) ≤ ( divide start_ARG italic_e italic_n end_ARG start_ARG italic_k end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ≤ italic_n start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, the second inequality is due to Equation 4, and the third inequality is obtained via standard “stars and bars” counting. That is, we introduce an auxiliary variable d0superscriptsubscript𝑑0d_{0}^{-}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT and count the number of non-negative integer solutions of

d0+d1++dn=nlog(pn(|Σ|1))log|Σ|.superscriptsubscript𝑑0superscriptsubscript𝑑1superscriptsubscript𝑑𝑛𝑛𝑝𝑛Σ1Σd_{0}^{-}+d_{1}^{-}+\ldots+d_{n}^{-}=n\frac{\log\left(\frac{p}{n\left(|\Sigma|% -1\right)}\right)}{\log|\Sigma|}.\qeditalic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + … + italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = italic_n divide start_ARG roman_log ( divide start_ARG italic_p end_ARG start_ARG italic_n ( | roman_Σ | - 1 ) end_ARG ) end_ARG start_ARG roman_log | roman_Σ | end_ARG . italic_∎

5.2 Proof of Theorem 1.4

We are now ready to prove Theorem 1.4. Since Lemma 2.6 tells us that it suffices to approximate each local conditional distribution at each node well. So, we will consider an εn𝜀𝑛\frac{\varepsilon}{n}divide start_ARG italic_ε end_ARG start_ARG italic_n end_ARG-net over all such distributions and then apply a tournament style argument (Theorem 2.5) to pick a good candidate amongst the joint distribution obtained by a combination of such candidate local distributions.

See 1.4

Proof.

Fix a DAG 𝒢𝒢\mathcal{G}caligraphic_G satisfying an arbitrary in-degree sequence 𝒅=(d1,,dn)superscript𝒅superscriptsubscript𝑑1superscriptsubscript𝑑𝑛\bm{d}^{-}=\left(d_{1}^{-},\ldots,d_{n}^{-}\right)bold_italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ). Then, there are |Σ|d1++|Σ|dnsuperscriptΣsuperscriptsubscript𝑑1superscriptΣsuperscriptsubscript𝑑𝑛|\Sigma|^{d_{1}^{-}}+\ldots+|\Sigma|^{d_{n}^{-}}| roman_Σ | start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + … + | roman_Σ | start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT local conditional distributions for any Bayes net over 𝒢𝒢\mathcal{G}caligraphic_G. From Equation 2 above, we know that

i=1n|Σ|di=|Σ|d1++|Σ|dnp|Σ|1.superscriptsubscript𝑖1𝑛superscriptΣsuperscriptsubscript𝑑𝑖superscriptΣsuperscriptsubscript𝑑1superscriptΣsuperscriptsubscript𝑑𝑛𝑝Σ1\sum_{i=1}^{n}|\Sigma|^{d_{i}^{-}}=|\Sigma|^{d_{1}^{-}}+\ldots+|\Sigma|^{d_{n}% ^{-}}\leq\frac{p}{|\Sigma|-1}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | roman_Σ | start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = | roman_Σ | start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + … + | roman_Σ | start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≤ divide start_ARG italic_p end_ARG start_ARG | roman_Σ | - 1 end_ARG .

Now, consider an arbitrary local distribution over k=|Σ|𝑘Σk=|\Sigma|italic_k = | roman_Σ | values and let us upper bound the number of points in an εn𝜀𝑛\frac{\varepsilon}{n}divide start_ARG italic_ε end_ARG start_ARG italic_n end_ARG-net for this metric space. Observe that each possible distribution is essentially an element of the probability simplex ΔksubscriptΔ𝑘\Delta_{k}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. To get an εn𝜀𝑛\frac{\varepsilon}{n}divide start_ARG italic_ε end_ARG start_ARG italic_n end_ARG-net of ΔksubscriptΔ𝑘\Delta_{k}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we discretize vectors by rounding them to their nearest multiple of εn|Σ|𝜀𝑛Σ\frac{\varepsilon}{n\left|\Sigma\right|}divide start_ARG italic_ε end_ARG start_ARG italic_n | roman_Σ | end_ARG. If π𝜋\piitalic_π is a probability vector, and rπsubscript𝑟𝜋r_{\pi}italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is its rounding, then πrπ1εn|Σ||Σ|=εnsubscriptnorm𝜋subscript𝑟𝜋1𝜀𝑛ΣΣ𝜀𝑛\left\|\pi-r_{\pi}\right\|_{1}\leq\frac{\varepsilon}{n\left|\Sigma\right|}% \left|\Sigma\right|=\frac{\varepsilon}{n}∥ italic_π - italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ divide start_ARG italic_ε end_ARG start_ARG italic_n | roman_Σ | end_ARG | roman_Σ | = divide start_ARG italic_ε end_ARG start_ARG italic_n end_ARG. Therefore the number of discretized vectors is at most 𝒪((n|Σ|/ε)|Σ|)𝒪superscript𝑛Σ𝜀Σ\mathcal{O}\left(\left({n\left|\Sigma\right|}/{\varepsilon}\right)^{|\Sigma|}\right)caligraphic_O ( ( italic_n | roman_Σ | / italic_ε ) start_POSTSUPERSCRIPT | roman_Σ | end_POSTSUPERSCRIPT ).

Therefore, for any fixed DAG 𝒢𝒢\mathcal{G}caligraphic_G, there is a set of

m1𝒪((n|Σ|/ε)p|Σ||Σ|1)subscript𝑚1𝒪superscript𝑛Σ𝜀𝑝ΣΣ1m_{1}\in\mathcal{O}\left(\left({n\left|\Sigma\right|}/{\varepsilon}\right)^{% \frac{p|\Sigma|}{|\Sigma|-1}}\right)italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ caligraphic_O ( ( italic_n | roman_Σ | / italic_ε ) start_POSTSUPERSCRIPT divide start_ARG italic_p | roman_Σ | end_ARG start_ARG | roman_Σ | - 1 end_ARG end_POSTSUPERSCRIPT ) (5)

distributions that εn𝜀𝑛\frac{\varepsilon}{n}divide start_ARG italic_ε end_ARG start_ARG italic_n end_ARG-cover any possible joint distributions that can be Markov with respect to a Bayes net over 𝒢𝒢\mathcal{G}caligraphic_G. Meanwhile, by Lemma 5.2, there are at most

m2:=(n1)nlog(pn(|Σ|1))log|Σ|en(log(pn(|Σ|1))log|Σ|+1)nassignsubscript𝑚2superscript𝑛1𝑛𝑝𝑛Σ1Σsuperscript𝑒𝑛superscript𝑝𝑛Σ1Σ1𝑛\displaystyle m_{2}:=\left(n-1\right)^{\frac{n\log\left(\frac{p}{n\left(|% \Sigma|-1\right)}\right)}{\log\left|\Sigma\right|}}e^{n}\left(\frac{\log\left(% \frac{p}{n\left(|\Sigma|-1\right)}\right)}{\log\left|\Sigma\right|}+1\right)^{n}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT := ( italic_n - 1 ) start_POSTSUPERSCRIPT divide start_ARG italic_n roman_log ( divide start_ARG italic_p end_ARG start_ARG italic_n ( | roman_Σ | - 1 ) end_ARG ) end_ARG start_ARG roman_log | roman_Σ | end_ARG end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( divide start_ARG roman_log ( divide start_ARG italic_p end_ARG start_ARG italic_n ( | roman_Σ | - 1 ) end_ARG ) end_ARG start_ARG roman_log | roman_Σ | end_ARG + 1 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (6)

possible DAGs that may be used to define a Bayes net on n𝑛nitalic_n nodes that has at most p𝑝pitalic_p parameters.

We can now define a set of distributions \boldsymbol{\mathbb{Q}}blackboard_bold_Q over n𝑛nitalic_n variables such that there exists superscript\mathbbm{Q}^{*}\in\boldsymbol{\mathbb{Q}}blackboard_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_bold_Q such that dTV(,)εsubscript𝑑TV𝜀d_{\mathrm{TV}}(\mathbbm{P},\mathbbm{Q})\leq\varepsilonitalic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( blackboard_P , blackboard_Q ) ≤ italic_ε. Let us denote m=||𝑚m=|\boldsymbol{\mathbb{Q}}|italic_m = | blackboard_bold_Q |. Putting together the above bounds, we see that there are at most m=m1m2𝑚subscript𝑚1subscript𝑚2m=m_{1}\cdot m_{2}italic_m = italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT candidates suffice, where m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are from Equations 5 and 6. Therefore, with

𝒪(log1δε2logm)𝒪(log1δε2(plog(n|Σ|ε)+nlog(pn(|Σ|1))log|Σ|logn))𝒪1𝛿superscript𝜀2𝑚𝒪1𝛿superscript𝜀2𝑝𝑛Σ𝜀𝑛𝑝𝑛Σ1Σ𝑛\displaystyle\mathcal{O}\!\left(\frac{\log\frac{1}{\delta}}{\varepsilon^{2}}% \log m\right)\subseteq\mathcal{O}\!\left(\frac{\log\frac{1}{\delta}}{% \varepsilon^{2}}\left(p\log\left(\frac{n\left|\Sigma\right|}{\varepsilon}% \right)+\frac{n\log\left(\frac{p}{n\left(|\Sigma|-1\right)}\right)}{\log\left|% \Sigma\right|}\log n\right)\right)caligraphic_O ( divide start_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log italic_m ) ⊆ caligraphic_O ( divide start_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_p roman_log ( divide start_ARG italic_n | roman_Σ | end_ARG start_ARG italic_ε end_ARG ) + divide start_ARG italic_n roman_log ( divide start_ARG italic_p end_ARG start_ARG italic_n ( | roman_Σ | - 1 ) end_ARG ) end_ARG start_ARG roman_log | roman_Σ | end_ARG roman_log italic_n ) )

samples from \mathbbm{P}blackboard_P, Theorem 2.5 chooses a distribution \mathbbm{Q}blackboard_Q amongst the m𝑚mitalic_m candidates such that dTV(,)εsubscript𝑑TV𝜀d_{\mathrm{TV}}(\mathbbm{P},\mathbbm{Q})\leq\varepsilonitalic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( blackboard_P , blackboard_Q ) ≤ italic_ε with success probability at least 1δ1𝛿1-\delta1 - italic_δ. ∎

6 Conclusion

In this work, we showed the hardness result of finding a parameter-bounded Bayes net that represents some distribution \mathbbm{P}blackboard_P, given sample access to \mathbbm{P}blackboard_P, even under the promise that such a Bayes net exists. On a positive note, we gave a finite sample complexity bound sufficient to produce a Bayes net representing a probability distribution \mathbbm{Q}blackboard_Q that is close in TV distance to \mathbbm{P}blackboard_P. Our results generalize earlier known results of [CHM04] and [BCD20] respectively.

An intriguing open question is as follows:

Suppose we are given sample access to a distribution \mathbbm{P}blackboard_P and are promised that there exists a Bayes net on 𝒢𝒢\mathcal{G}caligraphic_G with at most p𝑝pitalic_p parameters such that \mathbbm{P}blackboard_P is Markov with respect to 𝒢𝒢\mathcal{G}caligraphic_G. Is it hard to find a Bayes net 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that has αp𝛼𝑝\alpha\cdot pitalic_α ⋅ italic_p parameters such that \mathbbm{P}blackboard_P is Markov with respect to 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (where 𝒢superscript𝒢\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT may not be 𝒢𝒢\mathcal{G}caligraphic_G), for some constant α>1𝛼1\alpha>1italic_α > 1?

Note that the hardness construction of [CHM04] only displayed an additive gap in the parameter bound. We conjecture that it is also hard to obtain such a multiplicative gap in the parameter bound, even in the promise setting.

Acknowledgements

We would like to thank Dimitris Zoros for helpful discussions. This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-PhD/2021-08-013). The work of AB was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme (NRF-NRFFAI-2019-0002) and an Amazon Faculty Research Award. SG’s work was partially supported by the SERB CRG Award CRG/2022/007985 and an IIT Kanpur initiation grant. Part of this work was done while the authors were visiting the Simons Institute for the Theory of Computing.

References

  • [BCD20] Johannes Brustle, Yang Cai, and Constantinos Daskalakis. Multi-item mechanisms without item-independence: Learnability via robustness. In Proceedings of the 21st ACM Conference on Economics and Computation, pages 715–761, 2020.
  • [CH92] Gregory F Cooper and Edward Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine learning, 9:309–347, 1992.
  • [Chi96] David Maxwell Chickering. Learning Bayesian networks is NP-complete. Learning from data: Artificial intelligence and statistics V, pages 121–130, 1996.
  • [Chi02] David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554, 2002.
  • [CHM04] Max Chickering, David Heckerman, and Chris Meek. Large-sample learning of Bayesian networks is NP-hard. Journal of Machine Learning Research, 5:1287–1330, 2004.
  • [Das99] Sanjoy Dasgupta. Learning polytrees. In Kathryn B. Laskey and Henri Prade, editors, UAI ’99: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, July 30 - August 1, 1999, pages 134–141. Morgan Kaufmann, 1999.
  • [DK14] Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of gaussians. In Conference on Learning Theory, pages 1183–1213. PMLR, 2014.
  • [DL01] Luc Devroye and Gábor Lugosi. Combinatorial methods in density estimation. Springer Science & Business Media, 2001.
  • [EG08] Gal Elidan and Stephen Gould. Learning bounded treewidth Bayesian networks. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Léon Bottou, editors, Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 417–424. Curran Associates, Inc., 2008.
  • [FNP13] Nir Friedman, Iftach Nachman, and Dana Pe’er. Learning Bayesian network structure from massive datasets: The ”sparse candidate” algorithm. CoRR, abs/1301.6696, 2013.
  • [Gav77] Fanica Gavril. Some 𝖭𝖯𝖭𝖯\mathsf{NP}sansserif_NP-complete problems on graphs. In Proceedings of Conference on Information Sciences and Systems, pages 91–95, 1977.
  • [GK21] Robert Ganian and Viktoriia Korchemna. The complexity of Bayesian network learning: Revisiting the superstructure. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 430–442, 2021.
  • [HGC95] David Heckerman, Dan Geiger, and David Maxwell Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Mach. Learn., 20(3):197–243, 1995.
  • [KSM22] Jack Kuipers, Polina Suter, and Giusi Moffa. Efficient sampling and structure learning of Bayesian networks. J. Comput. Graph. Stat., 31(3):639–650, 2022.
  • [Pea88] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.
  • [SDLC93] David J Spiegelhalter, A Philip Dawid, Steffen L Lauritzen, and Robert G Cowell. Bayesian analysis in expert systems. Statistical science, pages 219–247, 1993.
  • [SGS00] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.
  • [TK05] Marc Teyssier and Daphne Koller. Ordering-based search: A simple and effective algorithm for learning Bayesian networks. In UAI ’05, Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence, Edinburgh, Scotland, July 26-29, 2005, pages 548–549. AUAI Press, 2005.
  • [Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.