Subtractive Mixture Models via Squaring:
Representation and Learning

Lorenzo Loconte1  Aleksanteri M. Sladek2  Stefan Mengel3
Martin Trapp2  Arno Solin2  Nicolas Gillis4  Antonio Vergari1
1
School of Informatics, University of Edinburgh, UK
2 Department of Computer Science, Aalto University, Finland
3 University of Artois, CNRS, Centre de Recherche en Informatique de Lens (CRIL), France
4 Department of Mathematics and Operational Research, Université de Mons, Belgium
Corresponding author, [email protected]
Abstract

Mixture models are traditionally represented and learned by adding several distributions as components. Allowing mixtures to subtract probability mass or density can drastically reduce the number of components needed to model complex distributions. However, learning such subtractive mixtures while ensuring they still encode a non-negative function is challenging. We investigate how to learn and perform inference on deep subtractive mixtures by squaring them. We do this in the framework of probabilistic circuits, which enable us to represent tensorized mixtures and generalize several other subtractive models. We theoretically prove that the class of squared circuits allowing subtractions can be exponentially more expressive than traditional additive mixtures; and, we empirically show this increased expressiveness on a series of real-world distribution estimation tasks.

1 Introduction

Finite mixture models (MMs) are a staple in probabilistic machine learning, as they offer a simple and elegant solution to model complex distributions by blending simpler ones in a linear combination (McLachlan et al., 2019). The classical recipe to design MMs is to compute a convex combination over input components. That is, a MM representing a probability distribution p𝑝pitalic_p over a set of random variables 𝐗={X1,X2,,XD}𝐗subscript𝑋1subscript𝑋2subscript𝑋𝐷\bm{\mathrm{X}}=\{X_{1},X_{2},\ldots,X_{D}\}bold_X = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } is usually defined as

p(𝐗)=i=1Kwipi(𝐗),withwi0,i=1Kwi=1,formulae-sequence𝑝𝐗superscriptsubscript𝑖1𝐾subscript𝑤𝑖subscript𝑝𝑖𝐗withformulae-sequencesubscript𝑤𝑖0superscriptsubscript𝑖1𝐾subscript𝑤𝑖1\textstyle p(\bm{\mathrm{X}})=\sum\nolimits_{i=1}^{K}w_{i}p_{i}(\bm{\mathrm{X}% }),\quad\text{with}\quad w_{i}\geq 0,\quad\sum\nolimits_{i=1}^{K}w_{i}=1,italic_p ( bold_X ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X ) , with italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , (1)

where wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the mixture parameters and each component pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a mass or density function. This is the case for widely-used MMs such as Gaussian mixture models (GMMs) and hidden Markov models (HMMs) but also mixtures of generative models such as normalizing flows (Papamakarios et al., 2021) and deep mixture models such as probabilistic circuits (PCs, Vergari et al., 2019b).

[Uncaptioned image] [Uncaptioned image]
[Uncaptioned image] [Uncaptioned image]

The convexity constraint in Eq. 1 is the simplest sufficient condition to ensure p𝑝pitalic_p is a non-negative function integrating to 1,111Across the paper we will abuse the term integration to also refer to summation in case of discrete variables. i.e., is a valid probability distribution, and is often assumed in practice. However, this implies that the components pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can only be combined in an additive manner, and thus it can greatly impact their ability to estimate a distribution efficiently. For instance, consider approximating distributions having “holes” in their domain, such as the simple 2-dimensional ring distribution on the left (ground truth). A classical additive MM such a GMM would ultimately recover it, as it is a universal approximator of density functions (Nguyen et al., 2019), but only by using an unnecessarily high number of components (depicted as red ellipsoids). A MM allowing negative mixture weights, i.e., wi<0subscript𝑤𝑖0w_{i}<0italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 0, would instead require only two components, as it can subtract one inner Gaussian density from an outer one (NGMM, see dotted ellipsoid). We call these MMs subtractive or non-monotonic MMs (NMMs), as opposed to their classical additive counterpart, called monotonic MMs (Shpilka & Yehudayoff, 2010).

The challenge with NMMs is ensuring that the modeled p(𝐗)𝑝𝐗p(\bm{\mathrm{X}})italic_p ( bold_X ) is a valid distribution, as the convexity constraint does not hold anymore. This problem has been investigated in the past in a number of ways, in its simplest form by imposing ad-hoc constraints over the mixture parameters wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, derived for simple components such as Gaussian and Weibull distributions (Zhang & Zhang, 2005; Rabusseau & Denis, 2014; Jiang et al., 1999). However, different families of components would require formulating different constraints, whose closed-form existence is not guaranteed.

In this paper, we study a more general principle to design NMMs that circumvents the aforementioned limitation while ensuring non-negativity of the modeled function: squaring the encoded linear combination. For example, the NGMM above is a squared combination of Gaussian densities with negative mixture parameters. We theoretically investigate the expressive efficiency of squared NMMs, i.e., their expressiveness w.r.t. their model size, and show how to effectively represent and learn them in practice. Specifically, we do so in the framework of PCs, tractable models generalizing classical shallow MMs into deep MMs represented as structured neural networks. Deep PCs are already more expressive efficient than shallow MMs as they compactly encode a mixture with an exponential number of components (Jaini et al., 2018; Vergari et al., 2019b). However, they are classically represented with non-negative parameters, hence being restricted to encode deep but additive MMs. Instead, as a main theoretical contribution we prove that our squared non-monotonic PCs (NPC2s) can be exponentially more parameter-efficient than their monotonic counterparts.

Contributions. i) We introduce a general framework to represent NMMs via squaring (Sec. 2), within the language of tensorized PCs (Mari et al., 2023), and show how NPC2s can be effectively learned and used for tractable inference (Sec. 3). ii) We show how NPC2s generalize not only monotonic PCs but other apparently different models allowing negative parameters that have emerged in different literatures, such as square root of density models in signal processing (Pinheiro & Vidakovic, 1997), positive semi-definite (PSD) models in kernel methods (Rudi & Ciliberto, 2021), and Born machines from quantum mechanics (Orús, 2013; Glasser et al., 2019) (Sec. 4). This allows us to understand why they lead to tractable inference via the property-oriented framework of PCs. iii) We derive an exponential lower bound over the size of monotonic PCs to represent functions that can be compactly encoded by one NPC2 (Sec. 4.1), hence showing that NPC2s (and thus the aforementioned models) can be much more expressive for a given size. Finally, iv) we provide empirical evidence (Sec. 5) that NPC2s can approximate distributions better than monotonic PCs for a variety of experimental settings involving learning from real-world data and distilling intractable models such as large language models to unlock tractable inference (Zhang et al., 2023).

2 Subtractive Mixtures via Squaring

We start by formalizing how to represent shallow NMMs by squaring non-convex combinations of K𝐾Kitalic_K simple functions. Like exponentiation in energy-based models (LeCun et al., 2006), squaring ensures the non-negativity of our models, but differently from it, allows to tractably renormalize them. A squared NMM encodes a (possibly unnormalized) distribution c2(𝐗)superscript𝑐2𝐗c^{2}(\bm{\mathrm{X}})italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_X ) over variables 𝐗𝐗\bm{\mathrm{X}}bold_X as

c2(𝐗)=(i=1Kwici(𝐗))2=i=1Kj=1Kwiwjci(𝐗)cj(𝐗),superscript𝑐2𝐗superscriptsuperscriptsubscript𝑖1𝐾subscript𝑤𝑖subscript𝑐𝑖𝐗2superscriptsubscript𝑖1𝐾superscriptsubscript𝑗1𝐾subscript𝑤𝑖subscript𝑤𝑗subscript𝑐𝑖𝐗subscript𝑐𝑗𝐗\textstyle c^{2}(\bm{\mathrm{X}})=\left(\sum\nolimits_{i=1}^{K}w_{i}c_{i}(\bm{% \mathrm{X}})\right)^{2}=\sum\nolimits_{i=1}^{K}\sum\nolimits_{j=1}^{K}w_{i}w_{% j}c_{i}(\bm{\mathrm{X}})c_{j}(\bm{\mathrm{X}}),italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_X ) = ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X ) italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_X ) , (2)

where cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the learnable components and the mixture parameters wisubscript𝑤𝑖w_{i}\in\mathbb{R}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R are unconstrained, as opposed to Eq. 1. Squared NMMs can therefore represent (K+12)binomial𝐾12\binom{K+1}{2}( FRACOP start_ARG italic_K + 1 end_ARG start_ARG 2 end_ARG ) components within the same parameter budget of K𝐾Kitalic_K components of an additive MM. Each component of a squared NMM computes a product of experts ci(𝐗)cj(𝐗)subscript𝑐𝑖𝐗subscript𝑐𝑗𝐗c_{i}(\bm{\mathrm{X}})c_{j}(\bm{\mathrm{X}})italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X ) italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_X ) (Hinton, 2002) allowing negative parameters 2wiwj2subscript𝑤𝑖subscript𝑤𝑗2w_{i}w_{j}2 italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT if ij𝑖𝑗i\neq jitalic_i ≠ italic_j, and ci2(𝐗)superscriptsubscript𝑐𝑖2𝐗c_{i}^{2}(\bm{\mathrm{X}})italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_X ) with wi2superscriptsubscript𝑤𝑖2w_{i}^{2}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT otherwise. Fig. 1 shows a concrete example of this construction, which constitutes the simplest NPC2 we can build (see Sec. 3), i.e., comprising a single layer and having depth one.

Tractable marginalization. Analogously to traditional MMs, squared NMMs support tractable marginalization and conditioning, if their component distributions do as well. The distribution encoded by c2(𝐗)superscript𝑐2𝐗c^{2}(\bm{\mathrm{X}})italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_X ) can be normalized to compute a valid probability distribution p(𝐗)=c2(𝐗)/Z𝑝𝐗superscript𝑐2𝐗𝑍p(\bm{\mathrm{X}})=c^{2}(\bm{\mathrm{X}})/Zitalic_p ( bold_X ) = italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_X ) / italic_Z, by computing its partition function Z𝑍Zitalic_Z as

Z=c2(𝐱)d𝐱=i=1Kj=1Kwiwjci(𝐱)cj(𝐱)d𝐱.𝑍superscript𝑐2𝐱differential-d𝐱superscriptsubscript𝑖1𝐾superscriptsubscript𝑗1𝐾subscript𝑤𝑖subscript𝑤𝑗subscript𝑐𝑖𝐱subscript𝑐𝑗𝐱differential-d𝐱\textstyle Z=\int c^{2}(\bm{\mathrm{x}})\,\mathrm{d}\bm{\mathrm{x}}=\sum% \nolimits_{i=1}^{K}\sum\nolimits_{j=1}^{K}w_{i}w_{j}\int c_{i}(\bm{\mathrm{x}}% )c_{j}(\bm{\mathrm{x}})\,\mathrm{d}\bm{\mathrm{x}}.italic_Z = ∫ italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) roman_d bold_x = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∫ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) roman_d bold_x . (3)

Computing Z𝑍Zitalic_Z translates to evaluating (K+12)binomial𝐾12\binom{K+1}{2}( FRACOP start_ARG italic_K + 1 end_ARG start_ARG 2 end_ARG ) integrals over products of components ci(𝐗)cj(𝐗)subscript𝑐𝑖𝐗subscript𝑐𝑗𝐗c_{i}(\bm{\mathrm{X}})c_{j}(\bm{\mathrm{X}})italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X ) italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_X ). More generally, marginalizing any subset of variables in 𝐗𝐗\bm{\mathrm{X}}bold_X can be done in 𝒪(K2)𝒪superscript𝐾2\mathcal{O}(K^{2})caligraphic_O ( italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). This however implies that the components cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are chosen from a family of functions such that their product ci(𝐗)cj(𝐗)subscript𝑐𝑖𝐗subscript𝑐𝑗𝐗c_{i}(\bm{\mathrm{X}})c_{j}(\bm{\mathrm{X}})italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X ) italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_X ) can be tractably integrated, and Z𝑍Zitalic_Z is non-zero and finite. This is true for many parametric families, including exponential families (Seeger, 2005). For instance, the product of two Gaussian or two categorical distributions is another Gaussian (Rasmussen & Williams, 2005) or categorical up to a multiplicative factor, which can be computed in polynomial time.

A wider choice of components. Note that we do not require each cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to model a probability distribution, e.g., we might have ci(𝐱)<0subscript𝑐𝑖𝐱0c_{i}(\bm{\mathrm{x}})<0italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) < 0. This allows us to employ more expressive tractable functions as base components in squared NMMs such as splines (see App. E for details) or potentially small neural networks (see discussion in App. G). However, if the components are already flexible enough there might not be an increase in expressiveness when mixing them in a linear combination or squaring them. E.g., a simple categorical distribution can already capture any discrete distribution with finite support and a (subtractive) mixture thereof might not yield additional benefits besides being easier to learn. An additive mixture of Binomials is instead more expressive than a single Binomial, but expected to be less expressive than its subtractive version (as illustrated in Sec. 5).

Learning squared NMMs. The canonical way to learn traditional MMs (Eq. 1) is by maximum-likelihood estimation (MLE), i.e., by maximizing 𝐱𝒟logp(𝐱)subscript𝐱𝒟𝑝𝐱\sum_{\bm{\mathrm{x}}\in\mathcal{D}}\log p(\bm{\mathrm{x}})∑ start_POSTSUBSCRIPT bold_x ∈ caligraphic_D end_POSTSUBSCRIPT roman_log italic_p ( bold_x ) where 𝒟𝒟\mathcal{D}caligraphic_D is a set of independent and identically distributed (i.i.d.) samples. For squared NMMs, the MLE objective is

𝐱𝒟log(c2(𝐱)/Z)=|𝒟|logZ+2𝐱𝒟log|c(𝐱)|,subscript𝐱𝒟superscript𝑐2𝐱𝑍𝒟𝑍2subscript𝐱𝒟𝑐𝐱\textstyle\sum\nolimits_{\bm{\mathrm{x}}\in\mathcal{D}}\log\left(c^{2}(\bm{% \mathrm{x}})/Z\right)=-|\mathcal{D}|\log Z+2\sum\nolimits_{\bm{\mathrm{x}}\in% \mathcal{D}}\log|c(\bm{\mathrm{x}})|,∑ start_POSTSUBSCRIPT bold_x ∈ caligraphic_D end_POSTSUBSCRIPT roman_log ( italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) / italic_Z ) = - | caligraphic_D | roman_log italic_Z + 2 ∑ start_POSTSUBSCRIPT bold_x ∈ caligraphic_D end_POSTSUBSCRIPT roman_log | italic_c ( bold_x ) | , (4)

where c(𝐱)=i=1Kwici(𝐱)𝑐𝐱superscriptsubscript𝑖1𝐾subscript𝑤𝑖subscript𝑐𝑖𝐱c(\bm{\mathrm{x}})=\sum_{i=1}^{K}w_{i}c_{i}(\bm{\mathrm{x}})italic_c ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ). Unlike other NMMs mentioned in Sec. 1, we do not need to derive additional closed-form constraints for the parameters to preserve non-negativity. Although materializing the squared mixture having (K+12)binomial𝐾12\binom{K+1}{2}( FRACOP start_ARG italic_K + 1 end_ARG start_ARG 2 end_ARG ) components is required to compute Z𝑍Zitalic_Z as in Eq. 3, evaluating log|c(𝐱)|𝑐𝐱\log|c(\bm{\mathrm{x}})|roman_log | italic_c ( bold_x ) | is linear in K𝐾Kitalic_K. Hence, we can perform batched stochastic gradient-based optimization and compute Z𝑍Zitalic_Z just once per batch, which makes NMMs efficient to learn (see App. C).

Refer to caption
Figure 1: Shallow MMs and squared NMMs represented as PCs, mapped to a computational graph having input components and a weighted sum unit as output. Squaring a mixture with K=3𝐾3K=3italic_K = 3 components (left) can yield more components that share parameters (right).

3 Squaring Deep Mixture Models

So far, we dealt with mixtures that are shallow, i.e., that can be represented as simple computational graphs with a single weighted sum unit (e.g., Fig. 1). We now generalize them in the framework of PCs (Vergari et al., 2019b; Choi et al., 2020; Darwiche, 2001) as they offer a property-driven language to model structured neural networks which allow tractable inference. PCs enable us to encode an exponential number of mixture components in a compact but deep computational graph.

PCs are usually defined in terms of scalar computational units: sum, product and input (see App. A). Following Vergari et al. (2019a); Mari et al. (2023), we instead formalize them as tensorized computational graphs. That is, we group several computational units together in layers, whose advantage is twofold. First, we are able to derive a simplified tractable algorithm for squaring that requires only linear algebra operations and benefits from GPU acceleration (Alg. 1). Second, we can more easily generalize many recent PC architectures (Peharz et al., 2020b; a; Liu & Van den Broeck, 2021), as well as other tractable tensor representations (Sec. 4). Fig. A.1 illustrates how scalar computational units are mapped to tensorized layers. We start by defining deep computational graphs that can model possibly negative functions, simply named circuits (Vergari et al., 2021).

Definition 1 (Tensorized circuit).

A tensorized circuit c𝑐citalic_c is a parameterized computational graph encoding a function c(𝐗)𝑐𝐗c(\bm{\mathrm{X}})italic_c ( bold_X ) and comprising of three kinds of layers: input, product and sum. Each layer comprises computational units defined over the same set of variables, also called its scope, and every non-input layer receives input from one or more layers. The scope of each non-input layer is the union of the scope of its inputs, and the scope of the output layer computing c(𝐗)𝑐𝐗c(\bm{\mathrm{X}})italic_c ( bold_X ) is 𝐗𝐗\bm{\mathrm{X}}bold_X. Each input layer bold-ℓ\bm{\mathrm{\ell}}bold_ℓ has scope 𝐘𝐗𝐘𝐗\bm{\mathrm{Y}}\subseteq\bm{\mathrm{X}}bold_Y ⊆ bold_X and computes a collection of K𝐾Kitalic_K functions fi(𝐘)subscript𝑓𝑖𝐘f_{i}(\bm{\mathrm{Y}})\in\mathbb{R}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_Y ) ∈ blackboard_R, i.e., bold-ℓ\bm{\mathrm{\ell}}bold_ℓ outputs a K𝐾Kitalic_K-dimensional vector. Each product layer bold-ℓ\bm{\mathrm{\ell}}bold_ℓ computes an Hadamard (or element-wise) product over the N𝑁Nitalic_N layers it receives as input, i.e., =i=1Ni\bm{\mathrm{\ell}}=\odot_{i=1}^{N}\bm{\mathrm{\ell}}_{i}bold_ℓ = ⊙ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. A sum layer with S𝑆Sitalic_S sum units and receiving input from a previous layer Kbold-ℓsuperscript𝐾\bm{\mathrm{\ell}}\in\mathbb{R}^{K}bold_ℓ ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, is parameterized by 𝐖S×K𝐖superscript𝑆𝐾\bm{\mathrm{W}}\in\mathbb{R}^{S\times K}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_K end_POSTSUPERSCRIPT and computes 𝐖𝐖bold-ℓ\bm{\mathrm{W}}\bm{\mathrm{\ell}}bold_W bold_ℓ.

Fig. 2 shows a deep circuit in tensorized form. To model a distribution via circuits we first require that the output of the computational graph is non-negative. We call such a circuit a PC. Similarly to shallow additive MM (Eq. 1), a sufficient condition to ensure non-negativity of the output is make the PC monotonic, i.e., to parameterize all sum layers with non-negative matrices and to restrict input layers to encode non-negative functions (e.g., probability mass or density functions). So far, monotonic PCs have been the canonical way to represent and learn PCs (App. G). In Def. 1 we presented product layers computing Hadamard products only, to simplify notation and as this implementation choice is commonly used in many existing PC architectures (Darwiche, 2009; Liu & Van den Broeck, 2021; Mari et al., 2023). We generalize our treatment of PCs in Def. A.6 to deal with another popular product layer implementation: Kronecker products (Peharz et al., 2020b; a; Mari et al., 2023). Our results still hold for both kinds of product layers, if not specified otherwise.

Refer to caption
\phantomsubcaption

Refer to caption \phantomsubcaption

Refer to caption\phantomsubcaption
Figure 2: Squaring tensorized structured-decomposable circuits reduces to squaring layers, depicted as colored boxes of ​ ​ (input), ​ ​ (product), and ​

a classic, real deep, Voltaire

​ (sum). Connections to a sum layer are labeled by the matrix parameterizing the layer, while connections to product layers are labeled by the Hadamard product sign (see also Fig. A.1). A tensorized structured-decomposable circuit (b) over three variables defined from the RG in (a) is squared in (c) by recursively squaring each layer via Alg. 1. Squared layers contain a quadratic number of units, but still output vectors.

3.1 Building Tractable Circuits for Marginalization

Deep PCs can be renormalized and marginalize out any subset of 𝐗𝐗\bm{\mathrm{X}}bold_X in a single feed-forward pass if they are smooth and decomposable, i.e., each sum layer receives inputs from layers whose units are defined over the same scopes, and each product layer receives inputs from layers whose scopes are pairwise disjoint, respectively (Darwiche, 2001; Choi et al., 2020). See Prop. A.1 for more background. Sum layers in our Def. 1 guarantee smoothness by design as they have exactly one input. A simple way to ensure decomposability is to create a circuit that follows a hierarchical scope partitioning of variables 𝐗𝐗\bm{\mathrm{X}}bold_X, also called a region graph, which is formalized next.

Definition 2 (Region graph (Dennis & Ventura, 2012)).

Given a set of variables 𝐗𝐗\bm{\mathrm{X}}bold_X, a region graph (RG) is a bipartite and rooted graph whose nodes are either regions, denoting subsets \mathcal{R}caligraphic_R of 𝐗𝐗\bm{\mathrm{X}}bold_X, or partitions specifying how a region is partitioned into other regions.

Fig. 2 shows an example of a RG. Given a RG, we can build a smooth and decomposable tensorized circuit as follows. First, we parameterize regions 𝐗𝐗\mathcal{R}\subseteq\bm{\mathrm{X}}caligraphic_R ⊆ bold_X that are not further partitioned with an input layer encoding some functions over variables in \mathcal{R}caligraphic_R. Then, we parameterize each partitioning {i}i=1Nsuperscriptsubscriptsubscript𝑖𝑖1𝑁\{\mathcal{R}_{i}\}_{i=1}^{N}{ caligraphic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT with a product layer having as inputs one layer for each isubscript𝑖\mathcal{R}_{i}caligraphic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Each product layer is then followed by a sum layer. Figs. 2 and 2 illustrate such a construction by color-coding regions and corresponding layers. As we will show in Sec. 3.2, this also provides us a clean recipe to efficiently square a deep circuit. The literature on PCs provides several ways to build RGs (Peharz et al., 2020b; a; Mari et al., 2023). In our experiments (Sec. 5), we recursively partition sets of variables randomly until no further partitioning is possible (Peharz et al., 2020b). Moreover, we experiment with RGs that partitions variables one by one (e.g., the one in Fig. 2), as they are related to other classes of models (see Sec. 4). App. F further details how to construct these RGs.

3.2 Squaring Deep Tensorized Circuits

(Squared negative) MMs as circuits. It is easy to see that traditional shallow MMs (Eq. 1) can be readily represented as tensorized smooth and decomposable PCs consisting of an input layer encoding K𝐾Kitalic_K components pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT followed by a sum layer, which is parameterized by a non-negative row-vector 𝐖+1×K𝐖superscriptsubscript1𝐾\bm{\mathrm{W}}\in\mathbb{R}_{+}^{1\times K}bold_W ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 × italic_K end_POSTSUPERSCRIPT whose entries sum up to one. Squared NMMs (Eq. 2) can be represented in a similar way, as they can be viewed as mixtures over an increased number of components (see Fig. 1 and Fig. A.1), where the sum layer is parameterized by a vector with real entries, instead. Next, we discuss how to square deep tensorized circuits as to retrieve our NPC2s model class.

Squaring (and renormalizing) tensorized circuits. The challenge of squaring a tensorized non-monotonic circuit c𝑐citalic_c (potentially encoding a negative function) is guaranteeing c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to be representable as a smooth and decomposable PC with polynomial size, as these two properties are necessary conditions to being able to renormalize c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT efficiently and in a single feed-forward pass (Choi et al., 2020). In general, even squaring a decomposable circuit while preserving decomposability of the squared circuit is a #P-hard problem (Shen et al., 2016; Vergari et al., 2021). Fortunately, it is possible to obtain a decomposable representation of c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT efficiently for circuits c𝑐citalic_c that are structured-decomposable (Pipatsrisawat & Darwiche, 2008; Vergari et al., 2021). Intuitively, in a tensorized structured-decomposable circuit all product layers having the same scope 𝐘𝐗𝐘𝐗\bm{\mathrm{Y}}\subseteq\bm{\mathrm{X}}bold_Y ⊆ bold_X decompose 𝐘𝐘\bm{\mathrm{Y}}bold_Y over their input layers in the exact same way. We formalize this property in the Appendix in Def. A.3.

Tensorized circuits satisfying this property by design can be easily constructed by stacking layers conforming to a RG, as discussed before, and requiring that such a RG is a tree, i.e., in which there is a single way to partition each region, and whose input regions do not have overlap** scopes. E.g., the RG in Fig. 2 is a tree RG. From here on, w.l.o.g. we assume our tree RGs to be binary trees, i.e., they partition each region into two other regions only. Given a tensorized structured-decomposable circuit c𝑐citalic_c defined on such a tree RG, Alg. 1 efficiently constructs a smooth and decomposable tensorized circuit c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Moreover, let L𝐿Litalic_L be the number of layers and M𝑀Mitalic_M the maximum time required to evaluate one layer in c𝑐citalic_c, then the following proposition holds.

Proposition 1 (Tractable marginalization of squared circuits).

Let c𝑐citalic_c be a tensorized structured-decomposable circuit where the products of functions computed by each input layer can be tractably integrated. Any marginalization of c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT obtained via Alg. 1 requires time and space 𝒪(LM2)𝒪𝐿superscript𝑀2\mathcal{O}(L\cdot M^{2})caligraphic_O ( italic_L ⋅ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

See Sec. B.2 for a proof. In a nutshell, this is possible because Alg. 1 recursively squares each layer bold-ℓ\bm{\mathrm{\ell}}bold_ℓ in c𝑐citalic_c such as 2=superscriptbold-ℓ2tensor-productbold-ℓbold-ℓ\bm{\mathrm{\ell}}^{2}=\bm{\mathrm{\ell}}\otimes\bm{\mathrm{\ell}}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = bold_ℓ ⊗ bold_ℓ in c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where tensor-product\otimes denotes the Kronecker product of two vectors.222In Alg. B.1 we provide a generalization of Alg. 1 to square Kronecker product layers (Peharz et al., 2020b). Our tensorized treatment of circuits allows for a much more compact version of the more general algorithm proposed in Vergari et al. (2021) which was defined in terms of squaring scalar computational units. At the same time, it lets us derive a tighter worst-case upper-bound than the one usually reported for squaring structured-decomposable circuits (Pipatsrisawat & Darwiche, 2008; Choi et al., 2015; Vergari et al., 2021), which is the squared number of computations in the whole computational graph, or 𝒪(L2M2)𝒪superscript𝐿2superscript𝑀2\mathcal{O}(L^{2}\cdot M^{2})caligraphic_O ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Note that materializing c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is needed when we want to efficiently compute the normalization constant Z𝑍Zitalic_Z of c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT or marginalizing any subset of variables. As such, when learning by MLE (Eq. 4) and by batched gradient descent, we need to evaluate c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT only once per batch, thus greatly amortizing its cost. In App. C, we investigate the time and memory costs of learning NPC2s having different size and on different data set dimensionalities. Finally, tractable marginalization enables tractable sampling from the distribution modeled by NPC2s, as we discuss in Sec. A.2.

Algorithm 1 squareTensorizedCircuit(,)squareTensorizedCircuitbold-ℓ\textsf{squareTensorizedCircuit}(\bm{\mathrm{\ell}},\mathcal{R})squareTensorizedCircuit ( bold_ℓ , caligraphic_R )

Input: A tensorized circuit having output layer bold-ℓ\bm{\mathrm{\ell}}bold_ℓ and defined on a tree RG rooted by \mathcal{R}caligraphic_R.
Output: The tensorized squared circuit defined on the same tree RG having 2superscriptbold-ℓ2\bm{\mathrm{\ell}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as output layer computing tensor-productbold-ℓbold-ℓ\bm{\mathrm{\ell}}\otimes\bm{\mathrm{\ell}}bold_ℓ ⊗ bold_ℓ.

1:if bold-ℓ\bm{\mathrm{\ell}}bold_ℓ is an input layer then
2:     bold-ℓ\bm{\mathrm{\ell}}bold_ℓ computes K𝐾Kitalic_K functions fi()subscript𝑓𝑖f_{i}(\mathcal{R})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_R )
3:     return An input layer 2superscriptbold-ℓ2\bm{\mathrm{\ell}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT computing all K2superscript𝐾2K^{2}italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
4:          product combinations fi()fj()subscript𝑓𝑖subscript𝑓𝑗f_{i}(\mathcal{R})f_{j}(\mathcal{R})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_R ) italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_R )
5:else if bold-ℓ\bm{\mathrm{\ell}}bold_ℓ is a product layer then
6:     {(i,i),(ii,ii)}getInputs(,)subscriptbold-ℓisubscriptisubscriptbold-ℓiisubscriptiigetInputsbold-ℓ\{(\bm{\mathrm{\ell}}_{\textsf{i}},\mathcal{R}_{\textsf{i}}),\ (\bm{\mathrm{% \ell}}_{\textsf{ii}},\mathcal{R}_{\textsf{ii}})\}\leftarrow\textsf{getInputs}(% \bm{\mathrm{\ell}},\mathcal{R}){ ( bold_ℓ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) , ( bold_ℓ start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT ) } ← getInputs ( bold_ℓ , caligraphic_R )
7:     i2squareTensorizedCircuit(i,i)superscriptsubscriptbold-ℓi2squareTensorizedCircuitsubscriptbold-ℓisubscripti\bm{\mathrm{\ell}}_{\textsf{i}}^{2}\leftarrow\textsf{squareTensorizedCircuit}(% \bm{\mathrm{\ell}}_{\textsf{i}},\mathcal{R}_{\textsf{i}})bold_ℓ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ← squareTensorizedCircuit ( bold_ℓ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT i end_POSTSUBSCRIPT )
8:     ii2squareTensorizedCircuit(ii,ii)superscriptsubscriptbold-ℓii2squareTensorizedCircuitsubscriptbold-ℓiisubscriptii\bm{\mathrm{\ell}}_{\textsf{ii}}^{2}\leftarrow\textsf{squareTensorizedCircuit}% (\bm{\mathrm{\ell}}_{\textsf{ii}},\mathcal{R}_{\textsf{ii}})bold_ℓ start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ← squareTensorizedCircuit ( bold_ℓ start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT )
9:     return i2ii2direct-productsuperscriptsubscriptbold-ℓi2superscriptsubscriptbold-ℓii2\bm{\mathrm{\ell}}_{\textsf{i}}^{2}\odot\bm{\mathrm{\ell}}_{\textsf{ii}}^{2}bold_ℓ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⊙ bold_ℓ start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
10:else\triangleright bold-ℓ\bm{\mathrm{\ell}}bold_ℓ is a sum layer
11:     {(𝗂,)}getInputs(,)subscriptbold-ℓ𝗂getInputsbold-ℓ\{(\bm{\mathrm{\ell}}_{\mathsf{i}},\mathcal{R})\}\leftarrow\textsf{getInputs}(% \bm{\mathrm{\ell}},\mathcal{R}){ ( bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT , caligraphic_R ) } ← getInputs ( bold_ℓ , caligraphic_R )
12:     𝗂2squareTensorizedCircuit(𝗂,)superscriptsubscriptbold-ℓ𝗂2squareTensorizedCircuitsubscriptbold-ℓ𝗂\bm{\mathrm{\ell}}_{\mathsf{i}}^{2}\leftarrow\textsf{squareTensorizedCircuit}(% \bm{\mathrm{\ell}}_{\mathsf{i}},\mathcal{R})bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ← squareTensorizedCircuit ( bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT , caligraphic_R )
13:     𝐖S×KgetParameters()𝐖superscript𝑆𝐾getParametersbold-ℓ\bm{\mathrm{W}}\in\mathbb{R}^{S\times K}\leftarrow\textsf{getParameters}(\bm{% \mathrm{\ell}})bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_K end_POSTSUPERSCRIPT ← getParameters ( bold_ℓ )
14:     𝐖S2×K2𝐖𝐖superscript𝐖superscriptsuperscript𝑆2superscript𝐾2tensor-product𝐖𝐖\bm{\mathrm{W}}^{\prime}\in\mathbb{R}^{S^{2}\times K^{2}}\leftarrow\bm{\mathrm% {W}}\otimes\bm{\mathrm{W}}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ← bold_W ⊗ bold_W
15:     return 𝐖𝗂2superscript𝐖superscriptsubscriptbold-ℓ𝗂2\bm{\mathrm{W}}^{\prime}\bm{\mathrm{\ell}}_{\mathsf{i}}^{2}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

3.3 Numerically Stable Inference and Learning

Renormalizing deep PCs can easily lead to underflows and/or overflows. In monotonic PCs, this is usually addressed by performing computations in log-space and utilizing the log-sum-exp trick (Blanchard et al., 2021). However, this is not applicable to non-monotonic PCs as intermediate layers can compute negative values. Therefore, we instead evaluate circuits by propagating the logarithm of absolute values and the sign values of the outputs of each layer. Then, sum layers are evaluated with a sign-aware version of the log-sum-exp trick. A similar idea has been already applied to evaluate expectations of negative functions with monotonic PCs (Mauá et al., 2018; Correia & de Campos, 2019). App. D extends it to tensorized non-monotonic circuits.

4 Expressiveness of NPC2s and Relationship to Other Models

Circuits have been used as the “lingua franca” to represent apparently different tractable model representations (Darwiche & Marquis, 2002; Shpilka & Yehudayoff, 2010), and to investigate their ability to exactly represent certain function families with only a polynomial increase in model size – also called the expressive efficiency (Martens & Medabalimi, 2014), or succinctness (de Colnet & Mengel, 2021) of a model class. This is because the size of circuits directly translates to the computational complexity of performing inference. As we extend the language of monotonic PCs to include negative parameters, here we provide polytime reductions from tractable probabilistic model classes emerging from different application fields that can encode subtractions, to (deep) non-monotonic PCs. By doing so, we not only shed light on why they are tractable, by explicitly stating their structural properties as circuits, but also on why they can be more expressive than classical additive MMs, as we prove that NPC2s can be exponentially more compact in Sec. 4.1.

Simple shallow NMMs have been investigated for a limited set of component families, as discussed in Sec. 1. Notably, this can also be done by directly learning to approximate the square root of a density function, as done in signal processing with wavelet functions as components (Daubechies, 1992; Pinheiro & Vidakovic, 1997) or RBF kernels, i.e., unnormalized Gaussians centered over data points (Schölkopf & Smola, 2001), as in Hong & Gao (2021). As discussed in Sec. 3, we can readily represent these NMMs as simple NPC2s where kernel functions are computed by input layers.

Positive semi-definite (PSD) models (Rudi & Ciliberto, 2021; Marteau-Ferey et al., 2020) are a recent class of models from the kernel and optimization literature. Given a kernel function κ𝜅\kappaitalic_κ (e.g., an RBF kernel as in Rudi & Ciliberto (2021)) and a set of d𝑑ditalic_d data points 𝐱(1),,𝐱(d)superscript𝐱1superscript𝐱𝑑\bm{\mathrm{x}}^{(1)},\ldots,\bm{\mathrm{x}}^{(d)}bold_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , bold_x start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT with 𝜿(𝐱)=[κ(𝐱,𝐱(1)),,κ(𝐱,𝐱(d))]d𝜿𝐱superscript𝜅𝐱superscript𝐱1𝜅𝐱superscript𝐱𝑑topsuperscript𝑑\bm{\mathrm{\kappa}}(\bm{\mathrm{x}})=[\kappa(\bm{\mathrm{x}},\bm{\mathrm{x}}^% {(1)}),\ldots,\kappa(\bm{\mathrm{x}},\bm{\mathrm{x}}^{(d)})]^{\top}\in\mathbb{% R}^{d}bold_italic_κ ( bold_x ) = [ italic_κ ( bold_x , bold_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) , … , italic_κ ( bold_x , bold_x start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and a real d×d𝑑𝑑d\times ditalic_d × italic_d PSD matrix 𝐀𝐀\bm{\mathrm{A}}bold_A, they define an unnormalized distribution as the non-negative function f(𝐱;𝐀,𝜿)=𝜿(𝐱)𝐀𝜿(𝐱)𝑓𝐱𝐀𝜿𝜿superscript𝐱top𝐀𝜿𝐱f(\bm{\mathrm{x}};\bm{\mathrm{A}},\bm{\mathrm{\kappa}})=\bm{\mathrm{\kappa}}(% \bm{\mathrm{x}})^{\top}\bm{\mathrm{A}}\bm{\mathrm{\kappa}}(\bm{\mathrm{x}})italic_f ( bold_x ; bold_A , bold_italic_κ ) = bold_italic_κ ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_A bold_italic_κ ( bold_x ). Although apparently different, they can be translated to NPC2s in polynomial time.

Proposition 2 (Reduction from PSD models).

A PSD model with kernel function κ𝜅\kappaitalic_κ, defined over d𝑑ditalic_d data points, and parameterized by a PSD matrix 𝐀𝐀\bm{\mathrm{A}}bold_A, can be represented as a mixture of squared NMMs (hence NPC2s) in time 𝒪(d3)𝒪superscript𝑑3\mathcal{O}(d^{3})caligraphic_O ( italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ).

We prove this in Sec. B.3. Note that while PSD models are shallow non-monotonic PCs, we can stack them into deeper NPC2s that support tractable marginalization via structured-decomposability.

Tensor networks and the Born rule. Squaring a possibly negative function to retrieve an unnormalized distribution is related to the Born rule in quantum mechanics (Dirac, 1930), used to characterize the distribution of particles by squaring their wave function (Schollwoeck, 2010; Orús, 2013). These functions can be represented as a large D𝐷Ditalic_D-dimensional tensor 𝒯𝒯\mathcal{T}caligraphic_T over discrete variables 𝐗={X1,,XD}𝐗subscript𝑋1subscript𝑋𝐷\bm{\mathrm{X}}=\{X_{1},\ldots,X_{D}\}bold_X = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } taking value {1,,m}1𝑚\{1,\ldots,m\}{ 1 , … , italic_m }, compactly factorized in a tensor network (TN) such as a matrix-product state (MPS) (Pérez-García et al., 2007), also called tensor-train. Given an assignment 𝐱=x1,,xD𝐱subscript𝑥1subscript𝑥𝐷\bm{\mathrm{x}}=\langle x_{1},\ldots,x_{D}\ranglebold_x = ⟨ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ⟩ to 𝐗𝐗\bm{\mathrm{X}}bold_X, a rank r𝑟ritalic_r MPS compactly represents 𝒯𝒯\mathcal{T}caligraphic_T as

𝒯[x1,,xD]=i1=1ri2=1riD1=1r𝐀1[x1,i1]𝐀2[x2,i1,i2]𝐀D[xD,iD1],𝒯subscript𝑥1subscript𝑥𝐷superscriptsubscriptsubscript𝑖11𝑟superscriptsubscriptsubscript𝑖21𝑟superscriptsubscriptsubscript𝑖𝐷11𝑟subscript𝐀1subscript𝑥1subscript𝑖1subscript𝐀2subscript𝑥2subscript𝑖1subscript𝑖2subscript𝐀𝐷subscript𝑥𝐷subscript𝑖𝐷1\mathcal{T}[x_{1},\ldots,x_{D}]=\sum\nolimits_{i_{1}=1}^{r}\sum\nolimits_{i_{2% }=1}^{r}\cdots\sum\nolimits_{i_{D-1}=1}^{r}\bm{\mathrm{A}}_{1}[x_{1},i_{1}]\bm% {\mathrm{A}}_{2}[x_{2},i_{1},i_{2}]\cdots\bm{\mathrm{A}}_{D}[x_{D},i_{D-1}],caligraphic_T [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ] = ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ⋯ ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT bold_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] bold_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ⋯ bold_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ] , (5)

where 𝐀1,𝐀Dm×rsubscript𝐀1subscript𝐀𝐷superscript𝑚𝑟\bm{\mathrm{A}}_{1},\bm{\mathrm{A}}_{D}\in\mathbb{R}^{m\times r}bold_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_r end_POSTSUPERSCRIPT, 𝐀jm×r×rsubscript𝐀𝑗superscript𝑚𝑟𝑟\bm{\mathrm{A}}_{j}\in\mathbb{R}^{m\times r\times r}bold_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_r × italic_r end_POSTSUPERSCRIPT with 1<j<D1𝑗𝐷1<j<D1 < italic_j < italic_D, for indices {i1,,iD1}subscript𝑖1subscript𝑖𝐷1\{i_{1},\ldots,i_{D-1}\}{ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT }, and denoting indexing with square brackets. To encode a distribution p(𝐗)𝑝𝐗p(\bm{\mathrm{X}})italic_p ( bold_X ), one can reparameterize tensors 𝐀jsubscript𝐀𝑗\bm{\mathrm{A}}_{j}bold_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to be non-negative (Glasser et al., 2019) or apply the Born rule and square 𝒯𝒯\mathcal{T}caligraphic_T to model p(𝐱)(𝒯[x1,,xD])2proportional-to𝑝𝐱superscript𝒯subscript𝑥1subscript𝑥𝐷2p(\bm{\mathrm{x}})\propto(\mathcal{T}[x_{1},\ldots,x_{D}])^{2}italic_p ( bold_x ) ∝ ( caligraphic_T [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Such a TN is called a Born machine (BM) (Glasser et al., 2019). Besides modeling complex quantum states, TNs such as BMs have also been explored as classical ML models to learn discrete distributions (Stoudenmire & Schwab, 2016; Han et al., 2018; Glasser et al., 2019; Cheng et al., 2019), in quantum ML (Liu & Wang, 2018; Huggins et al., 2018), and more recently extended to continuous domains by introducing sets of basis functions, called TTDE (Novikov et al., 2021). Next, we show they are a special case of NPC2s.

Proposition 3 (Reduction from BMs).

A BM encoding D𝐷Ditalic_D-dimensional tensor with m𝑚mitalic_m states by squaring a rank r𝑟ritalic_r MPS can be exactly represented as a structured-decomposable NPC2 in 𝒪(Dk4)𝒪𝐷superscript𝑘4\mathcal{O}(D\cdot k^{4})caligraphic_O ( italic_D ⋅ italic_k start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) time and space, with kmin{r2,mr}𝑘superscript𝑟2𝑚𝑟k\leq\min\{r^{2},mr\}italic_k ≤ roman_min { italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_m italic_r }.

We prove this in Sec. B.4 by showing an equivalent NPC2 defined on linear tree RG (e.g., the one in Fig. 2). This connection highlights how tractable marginalization in BMs is possible thanks to structured-decomposability (1), a condition that to the best of our knowledge was not previously studied for TNs. Futhermore, as NPC2s we can now design more flexible tree RGs, e.g., randomized tree structures (Peharz et al., 2020b; Di Mauro et al., 2017; Di Mauro et al., 2021), densely tensorized structures heavily exploiting GPU parallelization (Peharz et al., 2020a; Mari et al., 2023) or heuristically learn them from data (Liu & Van den Broeck, 2021).

4.1 Exponential separation of NPC2s and Structured Monotonic PCs

Squaring via Alg. 1 can already make a tensorized (monotonic) PC more expressive, but only by a polynomial factor, as we quadratically increase the size of each layer, while kee** the same number of learnable parameters (similarly to the increased number of components of squared NMMs (Sec. 2)). On the other hand, allowing negative parameters can provide an exponential advantage, as proven for certain circuits (Valiant, 1979), but understanding if this advantage carries over to our squared circuits is not immediate. In fact, we observe there cannot be any expressiveness advantage in squaring certain classes of non-monotonic structured-decomposable circuits. These are the circuits that support tractable maximum-a-posteriori inference (Choi et al., 2020) and satisfy an additional property known as determinism (see Darwiche (2001), Def. A.5). Squaring these circuits outputs a PC of the same size and that is monotonic, as formalized next and proven in Sec. B.6.

Proposition 4 (Squaring deterministic circuits).

Let c𝑐citalic_c be a smooth, decomposable and deterministic circuit, possibly computing a negative function. Then, the squared circuit c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is monotonic and has the same structure (and hence size) of c𝑐citalic_c.

The NPC2s we considered so far, as constructed in Sec. 3, are not deterministic. Here we prove that some non-negative functions (hence probability distributions up to renormalization) can be computed by NPC2s that are exponentially smaller than any structured-decomposable monotonic PC.

Theorem 1 (Expressive efficiency of NPC2s).

There is a class of non-negative functions \mathcal{F}caligraphic_F over variables 𝐗𝐗\bm{\mathrm{X}}bold_X that can be compactly represented as a shallow squared NMM (hence NPC2s), but for which the smallest structured-decomposable monotonic PC computing any F𝐹F\in\mathcal{F}italic_F ∈ caligraphic_F has size 2Ω(|𝐗|)superscript2Ω𝐗2^{\Omega(|\bm{\mathrm{X}}|)}2 start_POSTSUPERSCRIPT roman_Ω ( | bold_X | ) end_POSTSUPERSCRIPT.

We prove this in Sec. B.5 by showing a non-trivial lower bound on the size of structured-decomposable monotonic PCs for a variant of the unique disjointness problem (Fiorini et al., 2015). Intuitively, this tells us that, given a fixed number of parameters, NPC2s can potentially be much more expressive than structured-decomposable monotonic PCs (and hence shallow additive MMs). We conjecture that an analogous lower bound can be devised for decomposable monotonic PCs. Furthermore, as this result directly extends to PSD and BM models (Sec. 4), it opens up interesting theoretical connections in the research fields of kernel-based and tensor network models.

Continuous     Splines          
Discrete     Categorical             GT      Binomial

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) GT
Refer to caption
(b) MPC
Refer to caption
(c) MPC2
Refer to caption
(d) NPC2
Refer to caption
(e) MPC
Refer to caption
(f) MPC2
Refer to caption
(g) NPC2
Figure 3: NPC2s are better estimators, especially with parameter-efficient input layers. Distribution estimated by monotonic PCs (MPC), squared monotonic PCs (MPC2) and NPC2s on 2D continuous (above) and discrete (below) data. On continuous data input layers compute splines (Eq. 11), while on discrete data they compute either categoricals (for MPC and MPC2), embeddings (for NPC2s) or Binomials. Secs. H.1 and H.2 shows log-likelihoods on also additional data.

5 Experiments

We aim to answer the following questions: (A) are NPC2s better distribution estimators than monotonic PCs? (B) how the increased model size given by squaring and the presence of negative parameters independently influence the expressiveness of NPC2s? (C) how does the choice of input layers and the RG affect the performance of NPC2s? We perform several distribution estimation experiments on both synthetic and real-world data, and label the following paragraphs with letters denoting relevance to the above questions. Moreover, note that our comparisons between NPC2s and monotonic PCs are based on models having the same number of learnable parameters.

(A, B) Synthetic continuous data. Following Wenliang et al. (2019), we evaluate monotonic PCs and NPC2s on 2D density estimation tasks, as this allows us to gain an insight on the learned density functions. To disentangle the effect of squaring versus that of negative parameters, we also experiment with squared monotonic PCs. We build circuit structures from a trivial tree RG (see Sec. H.1 for details). We experiment with splines as input layers for NPC2s, and enforce their non-negativity for monotonic PCs (see App. E). Fig. 3 shows that, while squaring benefits monotonic PCs, negative parameters in NPC2s are needed to better capture complex target densities.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Power Gas Hepmass M.BooNE BSDS300
MADE -3.08 3.56 -20.98 -15.59 148.85
RealNVP 0.17 8.33 -18.71 -13.84 153.28
MAF 0.24 10.08 -17.73 -12.24 154.93
NSF 0.66 13.09 -14.01 -9.22 157.31
Gaussian -7.74 -3.58 -27.93 -37.24 96.67
EiNet-LRS 0.36 4.79 -22.46 -34.21
TTDE 0.46 8.93 -21.34 -28.77 143.30
MPC (LT) 0.51 6.73 -22.06 -32.47 116.90
NPC2 (LT) 0.53 9.00 -20.66 -26.68 118.58
MPC (BT) 0.57 5.56 -22.45 -32.11 123.30
NPC2 (BT) 0.62 10.98 -20.41 -26.92 128.38
Figure 4: NPC2s can be more expressive than monotonic PCs (MPCs). Best average log-likelihoods achieved by monotonic PCs (+++) and NPC2s (±2superscriptplus-or-minus2\pm^{2}± start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT), built either from randomized linear tree (LT) or binary tree (BT) RGs (see Sec. H.3). The scatter plots (left) pairs log-likelihoods based on the number of units per layer K𝐾Kitalic_K (the higher the darker), differentiating PCs with Gaussian (G/blue) and splines (S/red) input layers. Both axes of each scatter plot are on the same scale, thus the results above the diagonal are of NPC2s achieving higher log-likelihoods than MPCs at parity of model size. The table (right) shows our models’ best average test log-likelihoods and puts them in context with intractable (above) and tractable (below) models w.r.t. variable marginalization.

(C) Synthetic discrete data. We estimate the probability mass of the previous 2D data sets, now finitely-discretized (see Sec. H.2), to better understand when negative parameters might bring little to no advantage if input layers are already expressive enough. First, we experiment with (squared) monotonic PCs (resp. NPC2s) having input layers computing categoricals (resp. real-valued embeddings). Second, we employ the less flexible but more parameter-efficient Binomials instead. Sec. H.2 reports the hyperparameters. Fig. 3 shows that, while there is no clear advantage for NPC2s equipped with embeddings instead of MPC2 with categoricals, in case of Binomials they can better capture the target distribution. This is because categoricals (and embeddings) already have enough parameters to capture “holes” in the probability mass function. However, Binomials introduce a strong inductive bias that might hinder learning. We believe this is the reason why, according to some preliminary results, we did not observe an improvement of NPC2s with respect to monotonic PCs on estimating image distributions.

(A, B, C) Multi-variate continuous data. Following Papamakarios et al. (2017), we evaluate deeper PCs for density estimation on five multivariate data sets (see Table H.1). We evaluate monotonic PCs and NPC2s in tensorized form built out of randomized linear tree RGs. That is, for some variable permutation, we construct a tree RG where each partition splits a region into a set of only one variable and recursively factorizes the rest. By doing so, we recover an architecture similar to a BM or TTDE (see Sec. 4). Following Peharz et al. (2020b), we also experiment with binary tree RGs whose regions are randomly split in half. Sec. H.3 details these RGs, as well as the hyperparameters used. We compare against: a full covariance Gaussian, normalizing flows RealNVP (Dinh et al., 2017), MADE (Germain et al., 2015), MAF (Papamakarios et al., 2017) and NSF (Durkan et al., 2019), a monotonic PC with input layers encoding flows (EiNet-LRS) (Sidheekh et al., 2023), and TTDE (Novikov et al., 2021). Fig. 4 shows that NPC2s with Gaussian input layers generally achieve higher log-likelihoods than monotonic PCs on four data sets. Fig. H.3 shows similar results when comparing to squared monotonic PCs, thus providing evidence that negative parameters other than squaring contribute to the expressiveness of NPC2s. Binary tree RGs generally deliver better likelihoods than linear tree ones, especially on Gas, where NPC2s using them outperform TTDE.

(A) Distilling intractable models. Monotonic PCs have been used to approximate intractable models such as LLMs and perform exact inference in presence of logical constraints, such as for constrained text generation (Zhang et al., 2023). As generation performance is correlated with how well the LLM is approximated by a tractable model, we are interested in how NPC2s can better be the distillation target of a LLM such as GPT2, rather than monotonic PCs. Following Zhang et al. (2023), we minimize the KL divergence between GPT2 and our PCs on a data set of sampled sentences (details in Sec. H.4). Since sentences are sequences of token variables, the architecture of tensorized circuits is built from a linear tree RG, thus corresponding to an inhomogeneous HMM in case of monotonic PCs (see Sec. B.4.1) while resembling a BM for NPC2s. Fig. 5 shows that NPC2s can distill GPT2 and scale better than monotonic PCs, as they achieve log-likelihoods closer to the ones computed by GPT2. We observe that NPC2s fit the training data much better than the test data, even though the results on test data are generally significant (see Table H.7). While this is further evidence of their increased expressiveness, regularizing NPC2s deserves future investigation.

[Uncaptioned image]
[Uncaptioned image]
Figure 5: NPC2(±2superscriptplus-or-minus2\pm^{2}± start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) achieve higher log-likelihoods than monotonic PCs (+++) on data sampled by GPT2. We report the median and the area including 90% of runs by varying the size of layers K𝐾Kitalic_K and other hyperparameters (Sec. H.4). For comparison, the log-likelihood of GPT2 on the same training data is about 5252-52- 52. The difference on the test data is significant for most values of K𝐾Kitalic_K (see p-values in Table H.7).

6 Discussion & Conclusion

With this work, we hope to popularize subtractive MMs via squaring as a simple and effective model class in the toolkit of tractable probabilistic modeling and reasoning that can rival traditional additive MMs. By casting them in the framework of circuits, we presented how to effectively represent and learn deep subtractive MMs such as NPC2s (Sec. 3) while showing how they can generalize other model classes such as PSD and tensor network models (Sec. 4). Our main theoretical result (Sec. 4.1) applies also to these models and justifies the increased performance we found in practice (Sec. 5). This work is the first to rigorously address representing and learning non-monotonic PCs in a general way, and opens up a number of future research directions. The first one is to retrieve a latent variable interpretation for NPC2s, as negative parameters in a non-monotonic PC invalidate the probabilistic interpretation of its sub-circuits (Peharz et al., 2017), making it not possible to learn its structure and parameters in classical ways (see App. G). Better ways to learn NPC2s, in turn, can benefit all applications in which PCs are widely used – from causal discovery (Wang et al., 2022) to variational inference (Shih & Ermon, 2020) and neuro-symbolic AI (Ahmed et al., 2022) – by making more compact and expressive distributions accessible. Finally, by formally connecting circuits with tensor networks, we hope to inspire works that carry over the advancements of one community to the other, such as better learning schemes (Stoudenmire & Schwab, 2016; Novikov et al., 2021), and more flexible ways to factorize high-dimensional tensors (Mari et al., 2023).

Reproducibility Statement

In App. H we include all the details about the experiments we showed in Sec. 5. The source code, documentation, data sets and scripts needed to reproduce the results and figures, are available at https://github.com/april-tools/squared-npcs.

Acknowledgments

AV was supported by the “UNREAL: Unified Reasoning Layer for Trustworthy ML” project (EP/Y023838/1) selected by the ERC and funded by UKRI EPSRC. NG acknowledges the support by the European Union (ERC consolidator, eLinoR, no 101085607). AMS acknowledges funding from the Helsinki Institute for Information Technology. MT acknowledges funding from the Research Council of Finland (grant number 347279). AS acknowledges funding from the Research Council of Finland (grant number 339730). We acknowledge CSC – IT Center for Science, Finland, for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through CSC. We acknowledge the computational resources provided by the Aalto Science-IT project.

References

  • Ahmed et al. (2022) Kareem Ahmed, Stefano Teso, Kai-Wei Chang, Guy Van den Broeck, and Antonio Vergari. Semantic probabilistic layers for neuro-symbolic learning. In Advances in Neural Information Processing Systems 35 (NeurIPS), volume 35, pp.  29944–29959. Curran Associates, Inc., 2022.
  • Bailly (2011) Raphaël Bailly. Quadratic weighted automata: Spectral algorithm and likelihood maximization. In Asian Conference on Machine Learning, 2011.
  • Baldi et al. (2016) Pierre Baldi, Kyle Cranmer, Taylor Faucett, Peter Sadowski, and Daniel Whiteson. Parameterized neural networks for high-energy physics. European Physical Journal C, 76(5):235, 2016. doi: 10.1140/epjc/s10052-016-4099-4.
  • Blanchard et al. (2021) Pierre Blanchard, Desmond J. Higham, and Nicholas J. Higham. Accurately computing the log-sum-exp and softmax functions. Institute of Mathematics and its Applications Journal of Numerical Analysis (IMAJNA), 41(4):2311–2330, 2021.
  • Broadrick et al. (2024) Oliver Broadrick, Honghua Zhang, and Guy Van den Broeck. Polynomial semantics of tractable probabilistic circuits. arXiv preprint arXiv:2402.09085, 2024.
  • Cheng et al. (2019) Song Cheng, Lei Wang, Tao Xiang, and Pan Zhang. Tree tensor networks for generative modeling. Physical Review B, 99(15):155131, 2019.
  • Choi et al. (2015) Arthur Choi, Guy Van den Broeck, Adnan Darwiche, Qiang Yang, and Michael Wooldridge. Tractable learning for structured probability spaces: A case study in learning preference distributions. In 24th International Joint Conference on Artificial Intelligence (IJCAI), volume 2015, pp.  2861–2868. IJCAI, 2015.
  • Choi et al. (2011) Myung ** Choi, Vincent YF Tan, Animashree Anandkumar, and Alan S Willsky. Learning latent tree graphical models. Journal of Machine Learning Research, 12:1771–1812, 2011.
  • Choi et al. (2020) YooJung Choi, Antonio Vergari, and Guy Van den Broeck. Probabilistic circuits: A unifying framework for tractable probabilistic modeling. 2020.
  • Correia & de Campos (2019) Alvaro H. C. Correia and Cassio P. de Campos. Towards scalable and robust sum-product networks. In Scalable Uncertainty Management, 2019.
  • Dang et al. (2021) Meihua Dang, Antonio Vergari, and Guy Van den Broeck. Strudel: A fast and accurate learner of structured-decomposable probabilistic circuits. The International Journal of Approximate Reasoning (IJAR), 140:92–115, 2021.
  • Darwiche (2001) Adnan Darwiche. Decomposable negation normal form. Journal of the ACM (JACM), 48:608–647, 2001.
  • Darwiche (2009) Adnan Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge University Press, 2009.
  • Darwiche & Marquis (2002) Adnan Darwiche and Pierre Marquis. A knowledge compilation map. Journal of Artificial Intelligence Research (JAIR), 17:229–264, 2002.
  • Daubechies (1992) Ingrid Daubechies. Ten lectures on wavelets. Computers in Physics, 6:697–697, 1992.
  • de Boor (1971) Carl de Boor. Subroutine package for calculating with B-splines. Technical report, Los Alamos National Lab. (LANL), 1971.
  • de Colnet & Mengel (2021) Alexis de Colnet and Stefan Mengel. A compilation of succinctness results for arithmetic circuits. In 18th International Conference on Principles of Knowledge Representation and Reasoning (KR), pp.  205–215, 2021.
  • De Wolf (2003) Ronald De Wolf. Nondeterministic quantum query and communication complexities. SIAM Journal on Computing, 32(3):681–699, 2003.
  • Dennis (2016) Aaron W. Dennis. Algorithms for Learning the Structure of Monotone and Nonmonotone Sum-Product Networks. PhD thesis, Brigham Young University, 2016.
  • Dennis & Ventura (2012) Aaron W. Dennis and Dan Ventura. Learning the architecture of sum-product networks using clustering on variables. In Advances in Neural Information Processing Systems 25 (NeurIPS), pp.  2033–2041. Curran Associates, Inc., 2012.
  • Di Mauro et al. (2017) Nicola Di Mauro, Antonio Vergari, Teresa M. A. Basile, and Floriana Esposito. Fast and accurate density estimation with extremely randomized cutset networks. In Machine Learning and Knowledge Discovery in Databases: ECML PKDD, pp.  203–219. Springer, 2017.
  • Di Mauro et al. (2021) Nicola Di Mauro, Gennaro Gala, Marco Iannotta, and Teresa Maria Altomare Basile. Random probabilistic circuits. In 37th Conference on Uncertainty in Artificial Intelligence (UAI), volume 161, pp.  1682–1691. PMLR, 2021.
  • Dinh et al. (2017) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. In 5th International Conference on Learning Representations (ICLR), 2017.
  • Dirac (1930) Paul Adrien Maurice Dirac. The Principles of Quantum Mechanics. Clarendon Press, Oxford, 1930.
  • Dua & Graff (2017) Dheeru Dua and Casey Graff. UCI Machine Learning Repository, 2017.
  • Durkan et al. (2019) Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In Advances in Neural Information Processing Systems 32 (NeurIPS), pp.  7511–7522. Curran Associates, Inc., 2019.
  • Fiorini et al. (2015) Samuel Fiorini, Serge Massar, Sebastian Pokutta, Hans Raj Tiwary, and Ronald De Wolf. Exponential lower bounds for polytopes in combinatorial optimization. Journal of the ACM (JACM), 62(2):1–23, 2015.
  • Fonollosa et al. (2015) Jordi Fonollosa, Sadique Sheik, Ramón Huerta, and Santiago Marco. Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sensors and Actuators B: Chemical, 215:618–629, 2015.
  • Germain et al. (2015) Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: Masked autoencoder for distribution estimation. In 32nd International Conference on Machine Learning (ICML), pp.  881 – 889, 2015.
  • Gillis (2020) Nicolas Gillis. Nonnegative Matrix Factorization. Society for Industrial and Applied Mathematics (SIAM), 2020.
  • Glasser et al. (2019) Ivan Glasser, Ryan Sweke, Nicola Pancotti, Jens Eisert, and Ignacio Cirac. Expressive power of tensor-network factorizations for probabilistic modeling. In Advances in Neural Information Processing Systems 32 (NeurIPS), pp.  1498–1510. Curran Associates, Inc., 2019.
  • Hackbusch (2012) Wolfgang Hackbusch. Tensor spaces and numerical tensor calculus, volume 42. Springer, 2012.
  • Han et al. (2018) Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Pan Zhang. Unsupervised generative modeling using matrix product states. Physical Review X, 8:031012, 2018.
  • Hebrail & Berard (2012) Georges Hebrail and Alice Berard. Individual household electric power consumption. UCI Machine Learning Repository, 2012.
  • Hinton (2002) Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002.
  • Hong & Gao (2021) Xia Hong and Junbin Gao. Estimating the square root of probability density function on Riemannian manifold. Expert Systems - The Journal of Knowledge Engineering, 38(7), 2021.
  • Hoory et al. (2006) Shlomo Hoory, Nathan Linial, and Avi Wigderson. Expander graphs and their applications. Bulletin of the American Mathematical Society, 43(4):439–561, 2006.
  • Huggins et al. (2018) William J. Huggins, Piyush S. Patil, Bradley K. Mitchell, K. Birgitta Whaley, and Edwin Miles Stoudenmire. Towards quantum machine learning with tensor networks. Quantum Science and Technology, 4, 2018.
  • Jaini et al. (2018) Priyank Jaini, Pascal Poupart, and Yaoliang Yu. Deep homogeneous mixture models: Representation, separation, and approximation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems (NeurIPS), volume 31. Curran Associates, Inc., 2018.
  • Jiang et al. (1999) Renyan Jiang, Ming J. Zuo, and Han-Xiong Li. Weibull and inverse weibull mixture models allowing negative weights. Reliability Engineering & System Safety, 66(3):227–234, 1999.
  • Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR), 2015.
  • Kolda & Bader (2009) Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. Society of Industrial and Applied Mathematics (SIAM) Review, 51(3):455–500, 2009.
  • LeCun et al. (2006) Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fujie Huang. A tutorial on energy-based learning. Predicting Structured Data, 2006.
  • Liu & Van den Broeck (2021) Anji Liu and Guy Van den Broeck. Tractable regularization of probabilistic circuits. In Advances in Neural Information Processing Systems 34 (NeurIPS), pp.  3558–3570. Curran Associates, Inc., 2021.
  • Liu et al. (2023) Anji Liu, Honghua Zhang, and Guy Van den Broeck. Scaling up probabilistic circuits by latent variable distillation. In 11th International Conference on Learning Representations (ICLR), 2023.
  • Liu & Wang (2018) **-Guo Liu and Lei Wang. Differentiable learning of quantum circuit born machines. Physical Review A, 98(6):062324, 2018.
  • Loconte et al. (2023) Lorenzo Loconte, Nicola Di Mauro, Robert Peharz, and Antonio Vergari. How to turn your knowledge graph embeddings into generative models via probabilistic circuits. In Advances in Neural Information Processing Systems 37 (NeurIPS). Curran Associates, Inc., 2023.
  • Mari et al. (2023) Antonio Mari, Gennaro Vessio, and Antonio Vergari. Unifying and understanding overparameterized circuit representations via low-rank tensor decompositions. In 6th Workshop on Tractable Probabilistic Modeling, 2023.
  • Marteau-Ferey et al. (2020) Ulysse Marteau-Ferey, Francis Bach, and Alessandro Rudi. Non-parametric models for non-negative functions. In Advances in Neural Information Processing Systems 33 (NeurIPS), pp.  12816–12826, 2020.
  • Martens & Medabalimi (2014) James Martens and Venkatesh Medabalimi. On the expressive efficiency of sum product networks. arXiv preprint arXiv:1411.7717, 2014.
  • Martin et al. (2001) David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In 8th International Conference on Computer Vision (ICCV), volume 2, pp.  416–423. IEEE, 2001.
  • Mauá et al. (2018) D. Mauá, Diarmaid Conaty, Fabio Gagliardi Cozman, Katja Poppenhaeger, and Cassio Polpo de Campos. Robustifying sum-product networks. International Journal of Approximate Reasoning, 101:163–180, 2018.
  • McLachlan et al. (2019) Geoffrey J. McLachlan, Sharon X. Lee, and Suren I. Rathnayake. Finite mixture models. Annual Review of Statistics and its Application, 6:355–378, 2019.
  • Nguyen et al. (2019) TrungTin Nguyen, Hien Duy Nguyen, Faicel Chamroukhi, and Geoffrey J. McLachlan. Approximation by finite mixtures of continuous density functions that vanish at infinity. Cogent Mathematics & Statistics, 7, 2019.
  • Novikov et al. (2021) Georgii S. Novikov, Maxim E. Panov, and Ivan V. Oseledets. Tensor-train density estimation. In 37th Conference on Uncertainty in Artificial Intelligence (UAI), volume 161 of Proceedings of Machine Learning Research, pp. 1321–1331. PMLR, 2021.
  • Orús (2013) Román Orús. A practical introduction to tensor networks: Matrix product states and projected entangled pair states. Annals of Physics, 349:117–158, 2013.
  • Papamakarios et al. (2017) George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems 30 (NeurIPS), pp.  2338–2347. Curran Associates, Inc., 2017.
  • Papamakarios et al. (2021) George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research (JMLR), 22(1):2617–2680, 2021.
  • Peharz et al. (2017) Robert Peharz, Robert Gens, Franz Pernkopf, and Pedro M. Domingos. On the latent variable interpretation in sum-product networks. IEEE Transactions on Pattern Analalysis and Machine Intelligence, 39(10):2030–2044, 2017.
  • Peharz et al. (2020a) Robert Peharz, Steven Lang, Antonio Vergari, Karl Stelzner, Alejandro Molina, Martin Trapp, Guy Van Den Broeck, Kristian Kersting, and Zoubin Ghahramani. Einsum networks: Fast and scalable learning of tractable probabilistic circuits. In 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pp. 7563–7574. PMLR, 2020a.
  • Peharz et al. (2020b) Robert Peharz, Antonio Vergari, Karl Stelzner, Alejandro Molina, Xiaoting Shao, Martin Trapp, Kristian Kersting, and Zoubin Ghahramani. Random sum-product networks: A simple and effective approach to probabilistic deep learning. In 35th Conference on Uncertainty in Artificial Intelligence (UAI), volume 115 of Proceedings of Machine Learning Research, pp. 334–344. PMLR, 2020b.
  • Pérez-García et al. (2007) David Pérez-García, F. Verstraete, Michael M. Wolf, and Juan Ignacio Cirac. Matrix product state representations. Quantum Information and Computing, 7(5):401–430, 2007. ISSN 1533-7146.
  • Piegl & Tiller (1995) Les A. Piegl and Wayne Tiller. The NURBS book. In Monographs in Visual Communication, 1995.
  • Pinheiro & Vidakovic (1997) Aluisio Pinheiro and Brani Vidakovic. Estimating the square root of a density via compactly supported wavelets. Computational Statistics and Data Analysis, 25(4):399–415, 1997.
  • Pipatsrisawat & Darwiche (2008) Knot Pipatsrisawat and Adnan Darwiche. New compilation languages based on structured decomposability. In 23rd Conference on Artificial Intelligence (AAAI), volume 8, pp.  517–522, 2008.
  • Poon & Domingos (2011) Hoifung Poon and Pedro Domingos. Sum-product networks: A new deep architecture. In IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp.  689–690. IEEE, 2011.
  • Rabusseau & Denis (2014) Guillaume Rabusseau and François Denis. Learning negative mixture models by tensor decompositions. arXiv preprint arXiv:1403.4224, 2014.
  • Rasmussen & Williams (2005) Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, 2005.
  • Roe et al. (2004) Byron P. Roe, Hai-Jun Yang, Ji Zhu, Yong Liu, Ion Stancu, and Gordon McGregor. Boosted decision trees as an alternative to artificial neural networks for particle identification. Nuclear Instruments & Methods in Physics Research Section A-accelerators Spectrometers Detectors and Associated Equipment, 543:577–584, 2004.
  • Roughgarden (2016) Tim Roughgarden. Communication complexity (for algorithm designers). Foundations and Trends® in Theoretical Computer Science, 11(3–4):217–404, 2016.
  • Rudi & Ciliberto (2021) Alessandro Rudi and Carlo Ciliberto. PSD representations for effective probability models. In Advances in Neural Information Processing Systems 34 (NeurIPS), pp.  19411–19422. Curran Associates, Inc., 2021.
  • Schölkopf & Smola (2001) Bernhard Schölkopf and Alex Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. In Adaptive Computation and Machine Learning Series. MIT Press, 2001.
  • Schollwoeck (2010) Ulrich Schollwoeck. The density-matrix renormalization group in the age of matrix product states. Annals of Physics, 326:96–192, 2010.
  • Seeger (2005) Matthias Seeger. Expectation propagation for exponential families. 2005.
  • Shen et al. (2016) Yujia Shen, Arthur Choi, and Adnan Darwiche. Tractable operations for arithmetic circuits of probabilistic models. In Advances in Neural Information Processing Systems 29 (NeurIPS). Curran Associates, Inc., 2016.
  • Shih & Ermon (2020) Andy Shih and Stefano Ermon. Probabilistic circuits for variational inference in discrete graphical models. In Advances in Neural Information Processing Systems 33 (NeurIPS). Curran Associates, Inc., 2020.
  • Shpilka & Yehudayoff (2010) Amir Shpilka and Amir Yehudayoff. Arithmetic circuits: A survey of recent results and open questions. Founddations and Trends in Theoretical Computer Science, 5:207–388, 2010.
  • Sidheekh et al. (2023) Sahil Sidheekh, Kristian Kersting, and Sriraam Natarajan. Probabilistic flow circuits: Towards unified deep models for tractable probabilistic inference. In 39th Conference on Uncertainty in Artificial Intelligence (UAI), volume 216 of Proceedings of Machine Learning Research, pp. 1964–1973. PMLR, 2023.
  • Sladek et al. (2023) Aleksanteri M Sladek, Martin Trapp, and Arno Solin. Encoding negative dependencies in probabilistic circuits. In 6th Workshop on Tractable Probabilistic Modeling, 2023.
  • Stoudenmire & Schwab (2016) Edwin Stoudenmire and David J Schwab. Supervised learning with tensor networks. In Advances in Neural Information Processing Systems 29 (NeurIPS), pp.  4799–4807. Curran Associates, Inc., 2016.
  • Tsuchida et al. (2023) Russell Tsuchida, Cheng Soon Ong, and Dino Sejdinovic. Squared neural families: A new class of tractable density models. arXiv preprint arXiv:2305.13552, 2023.
  • Valiant (1979) Leslie G. Valiant. Negation can be exponentially powerful. In 11th Annual ACM Symposium on Theory of Computing, pp. 189–196, 1979.
  • Vergari et al. (2019a) Antonio Vergari, Nicola Di Mauro, and Floriana Esposito. Visualizing and understanding sum-product networks. Machine Learning, 108(4):551–573, 2019a.
  • Vergari et al. (2019b) Antonio Vergari, Nicola Di Mauro, and Guy Van den Broeck. Tractable probabilistic models: Representations, algorithms, learning, and applications. Tutorial at the 35th Conference on Uncertainty in Artificial Intelligence (UAI), 2019b.
  • Vergari et al. (2021) Antonio Vergari, YooJung Choi, Anji Liu, Stefano Teso, and Guy Van den Broeck. A compositional atlas of tractable circuit operations for probabilistic inference. In Advances in Neural Information Processing Systems 34 (NeurIPS), pp.  13189–13201. Curran Associates, Inc., 2021.
  • Vermeulen et al. (1992) Allan H. Vermeulen, Richard H. Bartels, and Glenn R. Heppler. Integrating products of B-splines. SIAM Journal on Scientific and Statistical Computing, 13:1025–1038, 1992.
  • Wang et al. (2022) Benjie Wang, Matthew R. Wicker, and Marta Kwiatkowska. Tractable uncertainty for structure learning. In 39th International Conference on Machine Learning (ICML), pp.  23131–23150. PMLR, 2022.
  • Wenliang et al. (2019) Li Wenliang, Danica J. Sutherland, Heiko Strathmann, and Arthur Gretton. Learning deep kernels for exponential family densities. In 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pp. 6737–6746. PMLR, 2019.
  • Zhang & Zhang (2005) Baibo Zhang and Changshui Zhang. Finite mixture models with negative components. In 4th International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM), pp.  31–41. Springer, 2005.
  • Zhang et al. (2021) Honghua Zhang, Brendan Juba, and Guy Van den Broeck. Probabilistic generating circuits. In International Conference on Machine Learning, pp. 12447–12457. PMLR, 2021.
  • Zhang et al. (2023) Honghua Zhang, Meihua Dang, Nanyun Peng, and Guy Van den Broeck. Tractable control for autoregressive language generation. In 40th International Conference on Machine Learning (ICML), volume 202 of Proceedings of Machine Learning Research, pp. 40932–40945. PMLR, 2023.

Appendix A Circuits

Refer to caption
Refer to caption
Figure A.1: Computational units can be grouped into layers as to build a tensorized circuit. Sum units each parameterized by the rows of 𝐖3×3𝐖superscript33\bm{\mathrm{W}}\in\mathbb{R}^{3\times 3}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT (left, in purple) form a sum layer parameterized by 𝐖𝐖\bm{\mathrm{W}}bold_W (right). Product units (left, in red) form an Hadamard product layer (right). Input units (left, in yellow) form an input layer computing the same functions (right).

In Sec. 3 we introduced circuits in a tensorized formalism. Here we instead present the definitions and properties of circuits as they are usually defined in the literature, which will be used in App. B.

Definition A.1 (Circuit (Choi et al., 2020; Vergari et al., 2021)).

A circuit c𝑐citalic_c is a parameterized computational graph over variables 𝐗𝐗\bm{\mathrm{X}}bold_X encoding a function c(𝐗)𝑐𝐗c(\bm{\mathrm{X}})italic_c ( bold_X ) and comprising three kinds of computational units: input, product, and sum. Each product or sum unit n𝑛nitalic_n receives as inputs the outputs of other units, denoted with the set 𝗂𝗇(n)𝗂𝗇𝑛{\mathsf{in}}(n)sansserif_in ( italic_n ). Each unit n𝑛nitalic_n encodes a function cnsubscript𝑐𝑛c_{n}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT defined as: (i) fn(𝗌𝖼(n))subscript𝑓𝑛𝗌𝖼𝑛f_{n}({\mathsf{sc}}(n))italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( sansserif_sc ( italic_n ) ) if n𝑛nitalic_n is an input unit, where fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a function over variables 𝗌𝖼(n)𝐗𝗌𝖼𝑛𝐗{\mathsf{sc}}(n)\subseteq\bm{\mathrm{X}}sansserif_sc ( italic_n ) ⊆ bold_X, called its scope, (ii) i𝗂𝗇(n)ci(𝗌𝖼(ni))subscriptproduct𝑖𝗂𝗇𝑛subscript𝑐𝑖𝗌𝖼subscript𝑛𝑖\prod_{i\in{\mathsf{in}}(n)}c_{i}({\mathsf{sc}}(n_{i}))∏ start_POSTSUBSCRIPT italic_i ∈ sansserif_in ( italic_n ) end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( sansserif_sc ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) if n𝑛nitalic_n is a product unit, and (iii) i𝗂𝗇(n)wici(𝗌𝖼(ni))subscript𝑖𝗂𝗇𝑛subscript𝑤𝑖subscript𝑐𝑖𝗌𝖼subscript𝑛𝑖\sum_{i\in{\mathsf{in}}(n)}w_{i}c_{i}({\mathsf{sc}}(n_{i}))∑ start_POSTSUBSCRIPT italic_i ∈ sansserif_in ( italic_n ) end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( sansserif_sc ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) if n𝑛nitalic_n is a sum unit, with wisubscript𝑤𝑖w_{i}\in\mathbb{R}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R denoting the weighted sum parameters. The scope of a product or sum unit n𝑛nitalic_n is the union of the scopes of its inputs, i.e., 𝗌𝖼(n)=i𝗂𝗇(n)𝗌𝖼(i)𝗌𝖼𝑛subscript𝑖𝗂𝗇𝑛𝗌𝖼𝑖{\mathsf{sc}}(n)=\bigcup_{i\in{\mathsf{in}}(n)}{\mathsf{sc}}(i)sansserif_sc ( italic_n ) = ⋃ start_POSTSUBSCRIPT italic_i ∈ sansserif_in ( italic_n ) end_POSTSUBSCRIPT sansserif_sc ( italic_i ).

Note that tensorized circuits (Def. 1) are circuits where each input (resp. product and sum) layer consists of scalar input (resp. product and sum) units. For example, Fig. A.1 shows how computational units are grouped into layers. A probabilistic circuit (PC) is defined as a circuit encoding a non-negative function. PCs that are smooth and decomposable (Def. A.2) enable computing the partition function and, more in general, performing variable marginalization efficiently (Prop. A.1).

Definition A.2 (Smoothness and decomposability (Darwiche & Marquis, 2002)).

A circuit is smooth if for every sum unit n𝑛nitalic_n, its input units depend all on the same variables, i.e, i,j𝗂𝗇(n):𝗌𝖼(i)=𝗌𝖼(j):for-all𝑖𝑗𝗂𝗇𝑛𝗌𝖼𝑖𝗌𝖼𝑗\forall i,j\in{\mathsf{in}}(n)\colon{\mathsf{sc}}(i)={\mathsf{sc}}(j)∀ italic_i , italic_j ∈ sansserif_in ( italic_n ) : sansserif_sc ( italic_i ) = sansserif_sc ( italic_j ). A circuit is decomposable if the inputs of every product unit n𝑛nitalic_n depend on disjoint sets of variables, i.e, i,j𝗂𝗇(n)ij:𝗌𝖼(i)𝗌𝖼(j)=:for-all𝑖𝑗𝗂𝗇𝑛𝑖𝑗𝗌𝖼𝑖𝗌𝖼𝑗\forall i,j\in{\mathsf{in}}(n)\ i\neq j\colon{\mathsf{sc}}(i)\cap{\mathsf{sc}}% (j)=\varnothing∀ italic_i , italic_j ∈ sansserif_in ( italic_n ) italic_i ≠ italic_j : sansserif_sc ( italic_i ) ∩ sansserif_sc ( italic_j ) = ∅.

Proposition A.1 (Tractability (Choi et al., 2020)).

Let c𝑐citalic_c be a smooth and decomposable circuit over variables 𝐗𝐗\bm{\mathrm{X}}bold_X whose input units can be integrated efficiently. Then for any 𝐙𝐗𝐙𝐗\bm{\mathrm{Z}}\subseteq\bm{\mathrm{X}}bold_Z ⊆ bold_X and 𝐲𝐲\bm{\mathrm{y}}bold_y an assignment to variables in 𝐗𝐙𝐗𝐙\bm{\mathrm{X}}\setminus\bm{\mathrm{Z}}bold_X ∖ bold_Z, the quantity c(𝐲,𝐳)d𝐳𝑐𝐲𝐳differential-d𝐳\int c(\bm{\mathrm{y}},\bm{\mathrm{z}})\,\mathrm{d}\bm{\mathrm{z}}∫ italic_c ( bold_y , bold_z ) roman_d bold_z can be computed exactly in time and space Θ(|c|)Θ𝑐\Theta(|c|)roman_Θ ( | italic_c | ), where |c|𝑐|c|| italic_c | denotes the size of the circuit, i.e., the number of connections in the computational graph.

The size of circuits in tensorized form is obtained by counting the number of connections between the scalar computational units (see Sec. A.1.1). Squaring circuits or their tensorized representation efficiently such that the resulting PC is smooth and decomposable (Def. A.2) requires the satisfaction of structured-decomposability, as showed in (Pipatsrisawat & Darwiche, 2008; Vergari et al., 2021).

Definition A.3 (Structured-decomposability (Pipatsrisawat & Darwiche, 2008; Darwiche, 2009)).

A circuit is structured-decomposable if (1) it is smooth and decomposable, and (2) any pair of product units n,m𝑛𝑚n,mitalic_n , italic_m having the same scope decompose their scope at their input units in the same way.

Note that shallow MMs are both decomposable and structured-decomposable. As anticipated in Sec. 3, the expressiveness of squared non-monotonic PCs that are also deterministic is the same as monotonic deterministic PCs, which are used for tractable maximum-a-posteriori (MAP) inference. We prove it formally in Sec. B.6 by leveraging the definition of determinism that we show in Def. A.5. Before that, we introduce the definition of support of a computational unit.

Definition A.4 (Support (Choi et al., 2020)).

The support of a computational unit n𝑛nitalic_n over variables 𝐗𝐗\bm{\mathrm{X}}bold_X is defined as the set of value assignments to variables in 𝐗𝐗\bm{\mathrm{X}}bold_X such that the output of n𝑛nitalic_n is non-zero, i.e., 𝗌𝗎𝗉𝗉(n)={𝐱𝗏𝖺𝗅(𝐗)cn(𝐱)0}𝗌𝗎𝗉𝗉𝑛conditional-set𝐱𝗏𝖺𝗅𝐗subscript𝑐𝑛𝐱0\mathsf{supp}(n)=\{\bm{\mathrm{x}}\in\mathsf{val}(\bm{\mathrm{X}})\mid c_{n}(% \bm{\mathrm{x}})\neq 0\}sansserif_supp ( italic_n ) = { bold_x ∈ sansserif_val ( bold_X ) ∣ italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) ≠ 0 }, where 𝗏𝖺𝗅(𝐗)𝗏𝖺𝗅𝐗\mathsf{val}(\bm{\mathrm{X}})sansserif_val ( bold_X ) denotes the domain of variables 𝐗𝐗\bm{\mathrm{X}}bold_X.

Definition A.5 (Determinism (Darwiche, 2001)).

A circuit c𝑐citalic_c is deterministic if for any sum unit nc𝑛𝑐n\in citalic_n ∈ italic_c its inputs have disjoint support (Def. A.4), i.e., i,j𝗂𝗇(n),ij:𝗌𝗎𝗉𝗉(i)𝗌𝗎𝗉𝗉(j)=\forall i,j\in{\mathsf{in}}(n),i\neq j\colon\mathsf{supp}(i)\cap\mathsf{supp}(% j)=\varnothing∀ italic_i , italic_j ∈ sansserif_in ( italic_n ) , italic_i ≠ italic_j : sansserif_supp ( italic_i ) ∩ sansserif_supp ( italic_j ) = ∅.

A.1 Tensorized Circuits

Def. 1 can be further generalized by introducing Kronecker product layers, which are the building blocks of other tensorized circuit architectures, such as randomized and tensorized sum-product networks (RAT-SPNs) (Peharz et al., 2020b), einsum networks (EiNets) (Peharz et al., 2020a).

Definition A.6 (Tensorized circuit).

A tensorized circuit c𝑐citalic_c is a parameterized computational graph encoding a function c(𝐗)𝑐𝐗c(\bm{\mathrm{X}})italic_c ( bold_X ) and comprising of three kinds of layers: input, product and sum. Each layer comprises computational units defined over the same set of variables, also called its scope, and every non-input layer receives input from one or more layers. The scope of each non-input layer is the union of the scope of its inputs, and the scope of the output layer computing c(𝐗)𝑐𝐗c(\bm{\mathrm{X}})italic_c ( bold_X ) is 𝐗𝐗\bm{\mathrm{X}}bold_X. Each input layer bold-ℓ\bm{\mathrm{\ell}}bold_ℓ has scope 𝐘𝐗𝐘𝐗\bm{\mathrm{Y}}\subseteq\bm{\mathrm{X}}bold_Y ⊆ bold_X and computes a collection of K𝐾Kitalic_K functions fi(𝐘)subscript𝑓𝑖𝐘f_{i}(\bm{\mathrm{Y}})\in\mathbb{R}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_Y ) ∈ blackboard_R, i.e., bold-ℓ\bm{\mathrm{\ell}}bold_ℓ outputs a K𝐾Kitalic_K-dimensional vector. Each product layer bold-ℓ\bm{\mathrm{\ell}}bold_ℓ computes either an Hadamard (or element-wise) or Kronecker product over the N𝑁Nitalic_N layers it receives as input, i.e., =i=1Ni\bm{\mathrm{\ell}}=\odot_{i=1}^{N}\bm{\mathrm{\ell}}_{i}bold_ℓ = ⊙ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or i=1Nisuperscriptsubscripttensor-product𝑖1𝑁absentsubscriptbold-ℓ𝑖\otimes_{i=1}^{N}\bm{\mathrm{\ell}}_{i}⊗ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. A sum layer with S𝑆Sitalic_S sum units and receiving input form a previous layer Kbold-ℓsuperscript𝐾\bm{\mathrm{\ell}}\in\mathbb{R}^{K}bold_ℓ ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, is parameterized by 𝐖S×K𝐖superscript𝑆𝐾\bm{\mathrm{W}}\in\mathbb{R}^{S\times K}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_K end_POSTSUPERSCRIPT and computes 𝐖𝐖bold-ℓ\bm{\mathrm{W}}\bm{\mathrm{\ell}}bold_W bold_ℓ.

A.1.1 Size of Tensorized Circuits

The time and space complexity of evaluating a circuit is linear in its size. The size |c|𝑐|c|| italic_c | of a circuit c𝑐citalic_c (Def. A.1) is obtained by counting the number of input connections of each scalar product or sum unit. In other words, it is the number of edges in the computational graph.

If c𝑐citalic_c is a tensorized circuit, then its size is obtained by counting the number of connections in its non-tensorized form. Fig. A.1 shows part of a tensorized circuit and its non-tensorized form. For sum layers, the number of scalar input connections is the size of its parameterization matrix, i.e., SK𝑆𝐾S\cdot Kitalic_S ⋅ italic_K if it is parameterized by 𝐖S×K𝐖superscript𝑆𝐾\bm{\mathrm{W}}\in\mathbb{R}^{S\times K}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_K end_POSTSUPERSCRIPT. If bold-ℓ\bm{\mathrm{\ell}}bold_ℓ is an Hadamard product layer computing =i=1Ni\bm{\mathrm{\ell}}=\odot_{i=1}^{N}\bm{\mathrm{\ell}}_{i}bold_ℓ = ⊙ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where each isubscriptbold-ℓ𝑖\bm{\mathrm{\ell}}_{i}bold_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT outputs a K𝐾Kitalic_K-dimensional vector, then the number of its scalar input connections is NK𝑁𝐾N\cdot Kitalic_N ⋅ italic_K. In case of Kronecker product layers as in the more general Def. A.6, i.e., =i=1Ni\bm{\mathrm{\ell}}=\otimes_{i=1}^{N}\bm{\mathrm{\ell}}_{i}bold_ℓ = ⊗ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where each isubscriptbold-ℓ𝑖\bm{\mathrm{\ell}}_{i}bold_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT outputs a K𝐾Kitalic_K-dimensional vector, then the number of its scalar input connections is KN+1superscript𝐾𝑁1K^{N+1}italic_K start_POSTSUPERSCRIPT italic_N + 1 end_POSTSUPERSCRIPT.

A.2 Tractable Exact Sampling

Each sum unit in a monotonic PC can be interpreted as a finitely discrete latent variable that can assume as many values as the number of input connections (Peharz et al., 2017). As such, a monotonic PC can be seen as a hierarchical MM. This allows us to sample exactly from the modeled distribution by (1) recursively sampling latent variables until input units are reached, and (2) sampling observed variables from the distributions modeled by input units (Vergari et al., 2019a).

Such probabilistic interpretation of inner sum units for NPC2s is not possible, as they can output negative values. However, since NPC2s are smooth and decomposable (Def. A.2), they support efficient marginalization and hence conditioning (1). This allows us to still sample exactly from the modeled distribution via inverse transform sampling. That is, we choose a variable ordering X1,X2,,XDsubscript𝑋1subscript𝑋2subscript𝑋𝐷X_{1},X_{2},\ldots,X_{D}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and sample them in an autoregressive fashion, i.e., x1p(X1)similar-tosubscript𝑥1𝑝subscript𝑋1x_{1}\sim p(X_{1})italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_p ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), x2p(X2x1)similar-tosubscript𝑥2𝑝conditionalsubscript𝑋2subscript𝑥1x_{2}\sim p(X_{2}\mid x_{1})italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_p ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), \ldots, xDp(XDx1,,xD1)similar-tosubscript𝑥𝐷𝑝conditionalsubscript𝑋𝐷subscript𝑥1subscript𝑥𝐷1x_{D}\sim p(X_{D}\mid x_{1},\ldots,x_{D-1})italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ∼ italic_p ( italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ), which is still linear in the number of variables.

Appendix B Proofs

B.1 Squaring Tensorized Circuits

Proposition B.1 (Correctness of Alg. 1).

Let c𝑐citalic_c be a tensorized structured-decomposable circuit (Def. 1 or its generalization in Def. A.3), then Alg. 1 recursively constructs the layers of the squared tensorized PC c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT such that c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is also structured-decomposable.

Proof.

The proof is by induction on the structure of c𝑐citalic_c. Let bold-ℓ\bm{\mathrm{\ell}}bold_ℓ be a sum layer having as input 𝗂subscriptbold-ℓ𝗂\bm{\mathrm{\ell}}_{\mathsf{i}}bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT and computing 𝐖𝗂𝐖subscriptbold-ℓ𝗂\bm{\mathrm{W}}\bm{\mathrm{\ell}}_{\mathsf{i}}bold_W bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT, with 𝐖S×K𝐖superscript𝑆𝐾\bm{\mathrm{W}}\in\mathbb{R}^{S\times K}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_K end_POSTSUPERSCRIPT and 𝗂subscriptbold-ℓ𝗂\bm{\mathrm{\ell}}_{\mathsf{i}}bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT computing an output in Ssuperscript𝑆\mathbb{R}^{S}blackboard_R start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT. If bold-ℓ\bm{\mathrm{\ell}}bold_ℓ is the last layer of c𝑐citalic_c (i.e., the output layer), then S=1𝑆1S=1italic_S = 1 since c𝑐citalic_c outputs a scalar, and the squared layer 2superscriptbold-ℓ2\bm{\mathrm{\ell}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT must compute

2=(𝐖𝗂)(𝐖𝗂)=(𝐖𝐖)(𝗂𝗂)=(𝐖𝐖)𝗂2,superscriptbold-ℓ2𝐖subscriptbold-ℓ𝗂𝐖subscriptbold-ℓ𝗂tensor-product𝐖𝐖tensor-productsubscriptbold-ℓ𝗂subscriptbold-ℓ𝗂tensor-product𝐖𝐖superscriptsubscriptbold-ℓ𝗂2\bm{\mathrm{\ell}}^{2}=(\bm{\mathrm{W}}\bm{\mathrm{\ell}}_{\mathsf{i}})\cdot(% \bm{\mathrm{W}}\bm{\mathrm{\ell}}_{\mathsf{i}})=(\bm{\mathrm{W}}\otimes\bm{% \mathrm{W}})(\bm{\mathrm{\ell}}_{\mathsf{i}}\otimes\bm{\mathrm{\ell}}_{\mathsf% {i}})=(\bm{\mathrm{W}}\otimes\bm{\mathrm{W}})\bm{\mathrm{\ell}}_{\mathsf{i}}^{% 2},bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( bold_W bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ) ⋅ ( bold_W bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ) = ( bold_W ⊗ bold_W ) ( bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ⊗ bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ) = ( bold_W ⊗ bold_W ) bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

which requires squaring the input layer 𝗂subscriptbold-ℓ𝗂\bm{\mathrm{\ell}}_{\mathsf{i}}bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT. By inductive hypothesis the squared circuit having 𝗂2superscriptsubscriptbold-ℓ𝗂2\bm{\mathrm{\ell}}_{\mathsf{i}}^{2}bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as output layer is structured-decomposable, hence also the squared circuit having 2superscriptbold-ℓ2\bm{\mathrm{\ell}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as output layer must be. If bold-ℓ\bm{\mathrm{\ell}}bold_ℓ is a non-output sum layer, we still require computing the Kronecker product of its input layer. The squared layer 2superscriptbold-ℓ2\bm{\mathrm{\ell}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is again a sum layer that outputs a S2superscript𝑆2S^{2}italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-dimensional vector, i.e.,

2==(𝐖𝗂)(𝐖𝗂)=(𝐖𝐖)(𝗂𝗂)=(𝐖𝐖)𝗂2superscriptbold-ℓ2tensor-productbold-ℓbold-ℓtensor-product𝐖subscriptbold-ℓ𝗂𝐖subscriptbold-ℓ𝗂tensor-product𝐖𝐖tensor-productsubscriptbold-ℓ𝗂subscriptbold-ℓ𝗂tensor-product𝐖𝐖superscriptsubscriptbold-ℓ𝗂2\displaystyle\bm{\mathrm{\ell}}^{2}=\bm{\mathrm{\ell}}\otimes\bm{\mathrm{\ell}% }=\left(\bm{\mathrm{W}}\bm{\mathrm{\ell}}_{\mathsf{i}}\right)\otimes\left(\bm{% \mathrm{W}}\bm{\mathrm{\ell}}_{\mathsf{i}}\right)=(\bm{\mathrm{W}}\otimes\bm{% \mathrm{W}})(\bm{\mathrm{\ell}}_{\mathsf{i}}\otimes\bm{\mathrm{\ell}}_{\mathsf% {i}})=(\bm{\mathrm{W}}\otimes\bm{\mathrm{W}})\bm{\mathrm{\ell}}_{\mathsf{i}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = bold_ℓ ⊗ bold_ℓ = ( bold_W bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ) ⊗ ( bold_W bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ) = ( bold_W ⊗ bold_W ) ( bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ⊗ bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ) = ( bold_W ⊗ bold_W ) bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

via mixed-product property (L11-15 in Alg. 1). Let bold-ℓ\bm{\mathrm{\ell}}bold_ℓ be a binary333Without loss of generality, we assume product layers have exactly two layers as inputs. Hadamard product layer computing 𝗂𝗂𝗂direct-productsubscriptbold-ℓ𝗂subscriptbold-ℓ𝗂𝗂\bm{\mathrm{\ell}}_{\mathsf{i}}\odot\bm{\mathrm{\ell}}_{\mathsf{ii}}bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ⊙ bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT for input layers 𝗂,𝗂𝗂subscriptbold-ℓ𝗂subscriptbold-ℓ𝗂𝗂\bm{\mathrm{\ell}}_{\mathsf{i}},\bm{\mathrm{\ell}}_{\mathsf{ii}}bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT , bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT each computing a K𝐾Kitalic_K-dimensional vector. Then, the squared layer 2superscriptbold-ℓ2\bm{\mathrm{\ell}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT computes the Hadamard product between K2superscript𝐾2K^{2}italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-dimensional vectors, i.e.,

2=(𝗂𝗂𝗂)(𝗂𝗂𝗂)=(𝗂𝗂)(𝗂𝗂𝗂𝗂)=𝗂2𝗂𝗂2superscriptbold-ℓ2tensor-productdirect-productsubscriptbold-ℓ𝗂subscriptbold-ℓ𝗂𝗂direct-productsubscriptbold-ℓ𝗂subscriptbold-ℓ𝗂𝗂direct-producttensor-productsubscriptbold-ℓ𝗂subscriptbold-ℓ𝗂tensor-productsubscriptbold-ℓ𝗂𝗂subscriptbold-ℓ𝗂𝗂direct-productsuperscriptsubscriptbold-ℓ𝗂2superscriptsubscriptbold-ℓ𝗂𝗂2\displaystyle\bm{\mathrm{\ell}}^{2}=(\bm{\mathrm{\ell}}_{\mathsf{i}}\odot\bm{% \mathrm{\ell}}_{\mathsf{ii}})\otimes(\bm{\mathrm{\ell}}_{\mathsf{i}}\odot\bm{% \mathrm{\ell}}_{\mathsf{ii}})=(\bm{\mathrm{\ell}}_{\mathsf{i}}\otimes\bm{% \mathrm{\ell}}_{\mathsf{i}})\odot(\bm{\mathrm{\ell}}_{\mathsf{ii}}\otimes\bm{% \mathrm{\ell}}_{\mathsf{ii}})=\bm{\mathrm{\ell}}_{\mathsf{i}}^{2}\odot\bm{% \mathrm{\ell}}_{\mathsf{ii}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ⊙ bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT ) ⊗ ( bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ⊙ bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT ) = ( bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ⊗ bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ) ⊙ ( bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT ⊗ bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT ) = bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⊙ bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

via mixed-product property with respect to the Hadamard product. By inductive hypothesis 𝗂2superscriptsubscriptbold-ℓ𝗂2\bm{\mathrm{\ell}}_{\mathsf{i}}^{2}bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 𝗂𝗂2superscriptsubscriptbold-ℓ𝗂𝗂2\bm{\mathrm{\ell}}_{\mathsf{ii}}^{2}bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are the output layers of structured-decomposable circuits depending on a disjoint sets of variables. As such, the circuit having 2superscriptbold-ℓ2\bm{\mathrm{\ell}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as output layer maintains structured-decomposability (L6-9 in Alg. 1). For the base case we consider the squaring of an input layer bold-ℓ\bm{\mathrm{\ell}}bold_ℓ that computes K𝐾Kitalic_K functions fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over some variables 𝐘𝐗𝐘𝐗\bm{\mathrm{Y}}\subseteq\bm{\mathrm{X}}bold_Y ⊆ bold_X. We replace bold-ℓ\bm{\mathrm{\ell}}bold_ℓ with its squaring 2superscriptbold-ℓ2\bm{\mathrm{\ell}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT which encodes the products fi(𝐘)fj(𝐘)subscript𝑓𝑖𝐘subscript𝑓𝑗𝐘f_{i}(\bm{\mathrm{Y}})f_{j}(\bm{\mathrm{Y}})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_Y ) italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_Y ), 1i,jKformulae-sequence1𝑖𝑗𝐾1\leq i,j\leq K1 ≤ italic_i , italic_j ≤ italic_K, by introducing K2superscript𝐾2K^{2}italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT functions gijsubscript𝑔𝑖𝑗g_{ij}italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT such that gij(𝐘)=fi(𝐘)fj(𝐘)subscript𝑔𝑖𝑗𝐘subscript𝑓𝑖𝐘subscript𝑓𝑗𝐘g_{ij}(\bm{\mathrm{Y}})=f_{i}(\bm{\mathrm{Y}})f_{j}(\bm{\mathrm{Y}})italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( bold_Y ) = italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_Y ) italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_Y ) (L2-4 in Alg. 1).

Squaring Kronecker product layers. In the case of bold-ℓ\bm{\mathrm{\ell}}bold_ℓ being a binary Kronecker product layer instead as in the more general Def. A.6, then the squared layer 2superscriptbold-ℓ2\bm{\mathrm{\ell}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT computes the Kronecker product between K2superscript𝐾2K^{2}italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-dimensional vectors up to a permutation of the entries, i.e.,

2=(𝗂𝗂𝗂)(𝗂𝗂𝗂)=𝐑((𝗂𝗂)(𝗂𝗂𝗂𝗂))=𝐑(𝗂2𝗂𝗂2),superscriptbold-ℓ2tensor-producttensor-productsubscriptbold-ℓ𝗂subscriptbold-ℓ𝗂𝗂tensor-productsubscriptbold-ℓ𝗂subscriptbold-ℓ𝗂𝗂𝐑tensor-producttensor-productsubscriptbold-ℓ𝗂subscriptbold-ℓ𝗂tensor-productsubscriptbold-ℓ𝗂𝗂subscriptbold-ℓ𝗂𝗂𝐑tensor-productsuperscriptsubscriptbold-ℓ𝗂2superscriptsubscriptbold-ℓ𝗂𝗂2\bm{\mathrm{\ell}}^{2}=(\bm{\mathrm{\ell}}_{\mathsf{i}}\otimes\bm{\mathrm{\ell% }}_{\mathsf{ii}})\otimes(\bm{\mathrm{\ell}}_{\mathsf{i}}\otimes\bm{\mathrm{% \ell}}_{\mathsf{ii}})=\bm{\mathrm{R}}\left((\bm{\mathrm{\ell}}_{\mathsf{i}}% \otimes\bm{\mathrm{\ell}}_{\mathsf{i}})\otimes(\bm{\mathrm{\ell}}_{\mathsf{ii}% }\otimes\bm{\mathrm{\ell}}_{\mathsf{ii}})\right)=\bm{\mathrm{R}}\left(\bm{% \mathrm{\ell}}_{\mathsf{i}}^{2}\otimes\bm{\mathrm{\ell}}_{\mathsf{ii}}^{2}% \right),bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ⊗ bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT ) ⊗ ( bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ⊗ bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT ) = bold_R ( ( bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ⊗ bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT ) ⊗ ( bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT ⊗ bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT ) ) = bold_R ( bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⊗ bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (6)

by introducing a K4×K4superscript𝐾4superscript𝐾4K^{4}\times K^{4}italic_K start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT × italic_K start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT permutation matrix 𝐑𝐑\bm{\mathrm{R}}bold_R whose rows are all zeros except for one entry set to 1, which reorders the entries of 𝗂2𝗂𝗂2tensor-productsuperscriptsubscriptbold-ℓ𝗂2superscriptsubscriptbold-ℓ𝗂𝗂2\bm{\mathrm{\ell}}_{\mathsf{i}}^{2}\otimes\bm{\mathrm{\ell}}_{\mathsf{ii}}^{2}bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⊗ bold_ℓ start_POSTSUBSCRIPT sansserif_ii end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as to recover the equality in Eq. 6. Note that such permutation maintains decomposability (Def. A.2), and its application can be computed by a sum layer having 𝐑𝐑\bm{\mathrm{R}}bold_R as fixed parameters. Moreover, by inductive hypothesis, the squaring circuit having 2superscriptbold-ℓ2\bm{\mathrm{\ell}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as output layer is still structured-decomposable. Finally, Alg. B.1 generalizes Alg. 1 as to support the squaring of Kronecker product layers as showed above (L10-11 in Alg. B.1). ∎

Algorithm B.1 squareTensorizedCircuit(,)squareTensorizedCircuitbold-ℓ\textsf{squareTensorizedCircuit}(\bm{\mathrm{\ell}},\mathcal{R})squareTensorizedCircuit ( bold_ℓ , caligraphic_R )

Input: A tensorized circuit (Def. A.6) having output layer bold-ℓ\bm{\mathrm{\ell}}bold_ℓ and defined on a tree RG rooted by \mathcal{R}caligraphic_R.
Output: The tensorized squared circuit defined on the same tree RG having 2superscriptbold-ℓ2\bm{\mathrm{\ell}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as output layer computing tensor-productbold-ℓbold-ℓ\bm{\mathrm{\ell}}\otimes\bm{\mathrm{\ell}}bold_ℓ ⊗ bold_ℓ.

1:if bold-ℓ\bm{\mathrm{\ell}}bold_ℓ is an input layer then
2:     bold-ℓ\bm{\mathrm{\ell}}bold_ℓ computes K𝐾Kitalic_K functions fi()subscript𝑓𝑖f_{i}(\mathcal{R})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_R )
3:     return An input layer 2superscriptbold-ℓ2\bm{\mathrm{\ell}}^{2}bold_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT computing all K2superscript𝐾2K^{2}italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
4:          product combinations fi()fj()subscript𝑓𝑖subscript𝑓𝑗f_{i}(\mathcal{R})f_{j}(\mathcal{R})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_R ) italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_R )
5:else if bold-ℓ\bm{\mathrm{\ell}}bold_ℓ is a product layer then
6:     {(i,i),(ii,ii)}getInputs(,)subscriptbold-ℓisubscriptisubscriptbold-ℓiisubscriptiigetInputsbold-ℓ\{(\bm{\mathrm{\ell}}_{\textsf{i}},\mathcal{R}_{\textsf{i}}),\ (\bm{\mathrm{% \ell}}_{\textsf{ii}},\mathcal{R}_{\textsf{ii}})\}\leftarrow\textsf{getInputs}(% \bm{\mathrm{\ell}},\mathcal{R}){ ( bold_ℓ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ) , ( bold_ℓ start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT ) } ← getInputs ( bold_ℓ , caligraphic_R )
7:     i2squareTensorizedCircuit(i,i)superscriptsubscriptbold-ℓi2squareTensorizedCircuitsubscriptbold-ℓisubscripti\bm{\mathrm{\ell}}_{\textsf{i}}^{2}\leftarrow\textsf{squareTensorizedCircuit}(% \bm{\mathrm{\ell}}_{\textsf{i}},\mathcal{R}_{\textsf{i}})bold_ℓ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ← squareTensorizedCircuit ( bold_ℓ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT i end_POSTSUBSCRIPT )
8:     ii2squareTensorizedCircuit(ii,ii)superscriptsubscriptbold-ℓii2squareTensorizedCircuitsubscriptbold-ℓiisubscriptii\bm{\mathrm{\ell}}_{\textsf{ii}}^{2}\leftarrow\textsf{squareTensorizedCircuit}% (\bm{\mathrm{\ell}}_{\textsf{ii}},\mathcal{R}_{\textsf{ii}})bold_ℓ start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ← squareTensorizedCircuit ( bold_ℓ start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT )
9:     if =iiibold-ℓdirect-productsubscriptbold-ℓisubscriptbold-ℓii\bm{\mathrm{\ell}}=\bm{\mathrm{\ell}}_{\textsf{i}}\odot\bm{\mathrm{\ell}}_{% \textsf{ii}}bold_ℓ = bold_ℓ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ⊙ bold_ℓ start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT then return i2ii2direct-productsuperscriptsubscriptbold-ℓi2superscriptsubscriptbold-ℓii2\bm{\mathrm{\ell}}_{\textsf{i}}^{2}\odot\bm{\mathrm{\ell}}_{\textsf{ii}}^{2}bold_ℓ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⊙ bold_ℓ start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
10:     else return 𝐑(i2ii2)𝐑tensor-productsuperscriptsubscriptbold-ℓi2superscriptsubscriptbold-ℓii2\bm{\mathrm{R}}\left(\bm{\mathrm{\ell}}_{\textsf{i}}^{2}\otimes\bm{\mathrm{% \ell}}_{\textsf{ii}}^{2}\right)bold_R ( bold_ℓ start_POSTSUBSCRIPT i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⊗ bold_ℓ start_POSTSUBSCRIPT ii end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where 𝐑𝐑\bm{\mathrm{R}}bold_R is
11:      a permutation matrix (see proof of Prop. B.1)      
12:else\triangleright bold-ℓ\bm{\mathrm{\ell}}bold_ℓ is a sum layer
13:     {(𝗂,)}getInputs(,)subscriptbold-ℓ𝗂getInputsbold-ℓ\{(\bm{\mathrm{\ell}}_{\mathsf{i}},\mathcal{R})\}\leftarrow\textsf{getInputs}(% \bm{\mathrm{\ell}},\mathcal{R}){ ( bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT , caligraphic_R ) } ← getInputs ( bold_ℓ , caligraphic_R )
14:     𝗂2squareTensorizedCircuit(𝗂,)superscriptsubscriptbold-ℓ𝗂2squareTensorizedCircuitsubscriptbold-ℓ𝗂\bm{\mathrm{\ell}}_{\mathsf{i}}^{2}\leftarrow\textsf{squareTensorizedCircuit}(% \bm{\mathrm{\ell}}_{\mathsf{i}},\mathcal{R})bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ← squareTensorizedCircuit ( bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT , caligraphic_R )
15:     𝐖S×KgetParameters()𝐖superscript𝑆𝐾getParametersbold-ℓ\bm{\mathrm{W}}\in\mathbb{R}^{S\times K}\leftarrow\textsf{getParameters}(\bm{% \mathrm{\ell}})bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_K end_POSTSUPERSCRIPT ← getParameters ( bold_ℓ )
16:     𝐖S2×K2𝐖𝐖superscript𝐖superscriptsuperscript𝑆2superscript𝐾2tensor-product𝐖𝐖\bm{\mathrm{W}}^{\prime}\in\mathbb{R}^{S^{2}\times K^{2}}\leftarrow\bm{\mathrm% {W}}\otimes\bm{\mathrm{W}}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ← bold_W ⊗ bold_W
17:     return 𝐖𝗂2superscript𝐖superscriptsubscriptbold-ℓ𝗂2\bm{\mathrm{W}}^{\prime}\bm{\mathrm{\ell}}_{\mathsf{i}}^{2}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_ℓ start_POSTSUBSCRIPT sansserif_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

B.2 Tractable Marginalization of NPC2s

Proposition 1.

Let c𝑐citalic_c be a tensorized structured-decomposable circuit where the products of functions computed by each input layer can be tractably integrated. Any marginalization of c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT obtained via Alg. 1 requires time and space 𝒪(LM2)𝒪𝐿superscript𝑀2\mathcal{O}(L\cdot M^{2})caligraphic_O ( italic_L ⋅ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where L𝐿Litalic_L is the number of layers in c𝑐citalic_c and M𝑀Mitalic_M is the maximum time required to evaluate one layer in c𝑐citalic_c (as detailed in Sec. A.1.1).

Proof.

Given c𝑐citalic_c by hypothesis, Prop. B.1 ensures that the PC built via Alg. 1 computes c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and is defined on the same tree RG (Def. 2) of c𝑐citalic_c. As such, c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is structured-decomposable and hence also smooth and decomposable (see Def. A.3). Now, we make an argument about c𝑐citalic_c and c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in their non-tensorized form (Def. A.1) as to leverage Prop. A.1 for tractable marginalization later. The size of c𝑐citalic_c is |c|𝒪(LM)𝑐𝒪𝐿𝑀|c|\in\mathcal{O}(L\cdot M)| italic_c | ∈ caligraphic_O ( italic_L ⋅ italic_M ), where L𝐿Litalic_L is the number of layers and M𝑀Mitalic_M the maximum number of scalar input connections of each layer in c𝑐citalic_c (see Sec. A.1.1 for details). The size of c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is therefore |c2|𝒪(LM2)superscript𝑐2𝒪𝐿superscript𝑀2|c^{2}|\in\mathcal{O}(L\cdot M^{2})| italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ∈ caligraphic_O ( italic_L ⋅ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), since Alg. 1 squares the output dimension of each layer as well as the size of the parameterization matrix of each sum layer. Since c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is smooth and decomposable and the functions computed by its input layers can be tractably integrated, then Prop. A.1 ensures we can marginalize any subset of variables in time and space |c2|𝒪(LM2)superscript𝑐2𝒪𝐿superscript𝑀2|c^{2}|\in\mathcal{O}(L\cdot M^{2})| italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ∈ caligraphic_O ( italic_L ⋅ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). ∎

B.3 Representing PSD models within the language of NPC2s

Proposition 2.

A PSD model with kernel function κ𝜅\kappaitalic_κ, defined over d𝑑ditalic_d data points, and parameterized by a PSD matrix 𝐀𝐀\bm{\mathrm{A}}bold_A, can be represented as a mixture of squared NMMs (hence NPC2s) in time 𝒪(d3)𝒪superscript𝑑3\mathcal{O}(d^{3})caligraphic_O ( italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ).

Proof.

The PSD model computes a non-negative function f(𝐱;𝐀,𝜿)=𝜿(𝐱)𝐀𝜿(𝐱)𝑓𝐱𝐀𝜿𝜿superscript𝐱top𝐀𝜿𝐱f(\bm{\mathrm{x}};\bm{\mathrm{A}},\bm{\mathrm{\kappa}})=\bm{\mathrm{\kappa}}(% \bm{\mathrm{x}})^{\top}\bm{\mathrm{A}}\bm{\mathrm{\kappa}}(\bm{\mathrm{x}})italic_f ( bold_x ; bold_A , bold_italic_κ ) = bold_italic_κ ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_A bold_italic_κ ( bold_x ), where 𝜿(𝐱)=[κ(𝐱,𝐱(1)),,κ(𝐱,𝐱(d))]d𝜿𝐱𝜅𝐱superscript𝐱1𝜅𝐱superscript𝐱𝑑superscript𝑑\bm{\mathrm{\kappa}}(\bm{\mathrm{x}})=[\kappa(\bm{\mathrm{x}},\bm{\mathrm{x}}^% {(1)}),\ldots,\kappa(\bm{\mathrm{x}},\bm{\mathrm{x}}^{(d)})]\in\mathbb{R}^{d}bold_italic_κ ( bold_x ) = [ italic_κ ( bold_x , bold_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) , … , italic_κ ( bold_x , bold_x start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ) ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, with data points 𝐱(1),,𝐱(d)superscript𝐱1superscript𝐱𝑑\bm{\mathrm{x}}^{(1)},\ldots,\bm{\mathrm{x}}^{(d)}bold_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , bold_x start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT, and 𝐀d×d𝐀superscript𝑑𝑑\bm{\mathrm{A}}\in\mathbb{R}^{d\times d}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT is PSD. Let 𝐀=i=1rλi𝐮i𝐮i𝐀superscriptsubscript𝑖1𝑟subscript𝜆𝑖subscript𝐮𝑖superscriptsubscript𝐮𝑖top\bm{\mathrm{A}}=\sum_{i=1}^{r}\lambda_{i}\bm{\mathrm{u}}_{i}\bm{\mathrm{u}}_{i% }^{\top}bold_A = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT be the eigendecomposition of 𝐀𝐀\bm{\mathrm{A}}bold_A with rank r𝑟ritalic_r. Then we can rewrite f(𝐱;𝐀,𝜿)𝑓𝐱𝐀𝜿f(\bm{\mathrm{x}};\bm{\mathrm{A}},\bm{\mathrm{\kappa}})italic_f ( bold_x ; bold_A , bold_italic_κ ) as

f(𝐱;𝐀,𝜿)=𝜿(𝐱)(i=1rλi𝐮i𝐮i)𝜿(𝐱)=i=1rλi(𝐮i𝜿(𝐱))2,𝑓𝐱𝐀𝜿𝜿superscript𝐱topsuperscriptsubscript𝑖1𝑟subscript𝜆𝑖subscript𝐮𝑖superscriptsubscript𝐮𝑖top𝜿𝐱superscriptsubscript𝑖1𝑟subscript𝜆𝑖superscriptsuperscriptsubscript𝐮𝑖top𝜿𝐱2f(\bm{\mathrm{x}};\bm{\mathrm{A}},\bm{\mathrm{\kappa}})=\bm{\mathrm{\kappa}}(% \bm{\mathrm{x}})^{\top}\left(\sum\nolimits_{i=1}^{r}\lambda_{i}\bm{\mathrm{u}}% _{i}\bm{\mathrm{u}}_{i}^{\top}\right)\bm{\mathrm{\kappa}}(\bm{\mathrm{x}})=% \sum\nolimits_{i=1}^{r}\lambda_{i}\left(\bm{\mathrm{u}}_{i}^{\top}\bm{\mathrm{% \kappa}}(\bm{\mathrm{x}})\right)^{2},italic_f ( bold_x ; bold_A , bold_italic_κ ) = bold_italic_κ ( bold_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_κ ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_κ ( bold_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where λi>0subscript𝜆𝑖0\lambda_{i}>0italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 are positive eigenvalues. Therefore, such PSD model can be represented as a monotonic mixture of rd𝑟𝑑r\leq ditalic_r ≤ italic_d squared NMMs (Eq. 2), whose d𝑑ditalic_d components computing 𝜿(𝐱)𝜿𝐱\bm{\mathrm{\kappa}}(\bm{\mathrm{x}})bold_italic_κ ( bold_x ) are shared. The eigendecomposition of 𝐀𝐀\bm{\mathrm{A}}bold_A can be done in time 𝒪(d3)𝒪superscript𝑑3\mathcal{O}(d^{3})caligraphic_O ( italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ), and materializing each squared NMMs (e.g., as in Fig. 1) requires time and space 𝒪(d2)𝒪superscript𝑑2\mathcal{O}(d^{2})caligraphic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Note that if 𝐀=𝐮𝐮𝐀superscript𝐮𝐮top\bm{\mathrm{A}}=\bm{\mathrm{u}}\bm{\mathrm{u}}^{\top}bold_A = bold_uu start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is a rank-1 matrix, then f(𝐱;𝐀,𝜿)=(𝐮𝜿(𝐱))𝑓𝐱𝐀𝜿superscript𝐮top𝜿𝐱f(\bm{\mathrm{x}};\bm{\mathrm{A}},\bm{\mathrm{\kappa}})=\left(\bm{\mathrm{u}}^% {\top}\bm{\mathrm{\kappa}}(\bm{\mathrm{x}})\right)italic_f ( bold_x ; bold_A , bold_italic_κ ) = ( bold_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_κ ( bold_x ) )2 is exactly a squared NMM whose d𝑑ditalic_d components compute 𝜿(𝐱)𝜿𝐱\bm{\mathrm{\kappa}}(\bm{\mathrm{x}})bold_italic_κ ( bold_x ). ∎

B.4 Relationship with Tensor Networks

In this section, we detail the construction of a tensorized structured-decomposable circuit (Def. 1) that is equivalent to a matrix product state (MPS) tensor network (Pérez-García et al., 2007), as we mention in Sec. 4. As such, the application of the Born rule as to retrieve a probabilistic model called Born machine (BM) (Glasser et al., 2019) is equivalent to squaring the equivalent circuit (Sec. 3).

Proposition 3.

A BM encoding D𝐷Ditalic_D-dimensional tensor with m𝑚mitalic_m states by squaring a rank r𝑟ritalic_r MPS can be exactly represented as a structured-decomposable NPC2 in 𝒪(Dk4)𝒪𝐷superscript𝑘4\mathcal{O}(D\cdot k^{4})caligraphic_O ( italic_D ⋅ italic_k start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) time and space, with kmin{r2,mr}𝑘superscript𝑟2𝑚𝑟k\leq\min\{r^{2},mr\}italic_k ≤ roman_min { italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_m italic_r }.

Proof.

We prove it constructively, by using a similar transformation used by Glasser et al. (2019) to represent a non-negative MPS factorization as an hidden Markov model (HMM). Let 𝐗={X1,,XD}𝐗subscript𝑋1subscript𝑋𝐷\bm{\mathrm{X}}=\{X_{1},\ldots,X_{D}\}bold_X = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } be a set of discrete variables each taking values in {1,,m}1𝑚\{1,\ldots,m\}{ 1 , … , italic_m }. Let 𝒯𝒯\mathcal{T}caligraphic_T be a tensor with D𝐷Ditalic_D m𝑚mitalic_m-dimensional indices. Given an assignment 𝐱=x1,,xD𝐱subscript𝑥1subscript𝑥𝐷\bm{\mathrm{x}}=\langle x_{1},\ldots,x_{D}\ranglebold_x = ⟨ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ⟩ to 𝐗𝐗\bm{\mathrm{X}}bold_X, we factorize 𝒯𝒯\mathcal{T}caligraphic_T via a rank r𝑟ritalic_r MPS factorization, i.e.,

𝒯[x1,,xD]=i1=1ri2=1riD1=1r𝐀1[x1,i1]𝐀2[x2,i1,i2]𝐀D[xD,iD1]𝒯subscript𝑥1subscript𝑥𝐷superscriptsubscriptsubscript𝑖11𝑟superscriptsubscriptsubscript𝑖21𝑟superscriptsubscriptsubscript𝑖𝐷11𝑟subscript𝐀1subscript𝑥1subscript𝑖1subscript𝐀2subscript𝑥2subscript𝑖1subscript𝑖2subscript𝐀𝐷subscript𝑥𝐷subscript𝑖𝐷1\mathcal{T}[x_{1},\ldots,x_{D}]=\sum_{i_{1}=1}^{r}\sum_{i_{2}=1}^{r}\cdots\sum% _{i_{D-1}=1}^{r}\bm{\mathrm{A}}_{1}[x_{1},i_{1}]\bm{\mathrm{A}}_{2}[x_{2},i_{1% },i_{2}]\cdots\bm{\mathrm{A}}_{D}[x_{D},i_{D-1}]caligraphic_T [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ] = ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ⋯ ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT bold_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] bold_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ⋯ bold_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ] (7)

where 𝐀1,𝐀Dm×rsubscript𝐀1subscript𝐀𝐷superscript𝑚𝑟\bm{\mathrm{A}}_{1},\bm{\mathrm{A}}_{D}\in\mathbb{R}^{m\times r}bold_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_r end_POSTSUPERSCRIPT and 𝐀jm×r×rsubscript𝐀𝑗superscript𝑚𝑟𝑟\bm{\mathrm{A}}_{j}\in\mathbb{R}^{m\times r\times r}bold_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_r × italic_r end_POSTSUPERSCRIPT with 1<j<D1𝑗𝐷1<j<D1 < italic_j < italic_D, for indices {i1,,iD1}subscript𝑖1subscript𝑖𝐷1\{i_{1},\ldots,i_{D-1}\}{ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT } and denoting indexing with square brackets. To reduce 𝒯𝒯\mathcal{T}caligraphic_T to being computed by a tensorized structured-decomposable circuit c𝑐citalic_c, i.e., such that c(𝐱)=𝒯[x1,,xD]𝑐𝐱𝒯subscript𝑥1subscript𝑥𝐷c(\bm{\mathrm{x}})=\mathcal{T}[x_{1},\ldots,x_{D}]italic_c ( bold_x ) = caligraphic_T [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ] for any 𝐱𝐱\bm{\mathrm{x}}bold_x, we perform the following construction. First, we perform a canonical polyadic (CP) decomposition (Kolda & Bader, 2009) of each 𝐀jsubscript𝐀𝑗\bm{\mathrm{A}}_{j}bold_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with 1<j<D1𝑗𝐷1<j<D1 < italic_j < italic_D, i.e.,

𝐀j[xj,ij1,ij]=sj=1k𝐁j[ij1,sj]𝐕j[xj,sj]𝐂j[ij,sj]subscript𝐀𝑗subscript𝑥𝑗subscript𝑖𝑗1subscript𝑖𝑗superscriptsubscriptsubscript𝑠𝑗1𝑘subscript𝐁𝑗subscript𝑖𝑗1subscript𝑠𝑗subscript𝐕𝑗subscript𝑥𝑗subscript𝑠𝑗subscript𝐂𝑗subscript𝑖𝑗subscript𝑠𝑗\bm{\mathrm{A}}_{j}[x_{j},i_{j-1},i_{j}]=\sum_{s_{j}=1}^{k}\bm{\mathrm{B}}_{j}% [i_{j-1},s_{j}]\bm{\mathrm{V}}_{j}[x_{j},s_{j}]\bm{\mathrm{C}}_{j}[i_{j},s_{j}]bold_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] = ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_i start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] bold_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ]

where kmin{r2,mr}𝑘superscript𝑟2𝑚𝑟k\leq\min\{r^{2},mr\}italic_k ≤ roman_min { italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_m italic_r } is the maximum rank of the CP decomposition (Kolda & Bader, 2009), and 𝐕jm×ksubscript𝐕𝑗superscript𝑚𝑘\bm{\mathrm{V}}_{j}\in\mathbb{R}^{m\times k}bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_k end_POSTSUPERSCRIPT, 𝐁jr×ksubscript𝐁𝑗superscript𝑟𝑘\bm{\mathrm{B}}_{j}\in\mathbb{R}^{r\times k}bold_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT, 𝐂jr×ksubscript𝐂𝑗superscript𝑟𝑘\bm{\mathrm{C}}_{j}\in\mathbb{R}^{r\times k}bold_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT. Then, we “contract” each 𝐂jsubscript𝐂𝑗\bm{\mathrm{C}}_{j}bold_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with 𝐁j+1subscript𝐁𝑗1\bm{\mathrm{B}}_{j+1}bold_B start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT by computing

𝐖j[sj,sj+1]=ij=1r𝐂j[ij,sj]𝐁j+1[ij,sj+1]subscript𝐖𝑗subscript𝑠𝑗subscript𝑠𝑗1superscriptsubscriptsubscript𝑖𝑗1𝑟subscript𝐂𝑗subscript𝑖𝑗subscript𝑠𝑗subscript𝐁𝑗1subscript𝑖𝑗subscript𝑠𝑗1\bm{\mathrm{W}}_{j}[s_{j},s_{j+1}]=\sum_{i_{j}=1}^{r}\bm{\mathrm{C}}_{j}[i_{j}% ,s_{j}]\bm{\mathrm{B}}_{j+1}[i_{j},s_{j+1}]bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ] = ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT bold_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] bold_B start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT [ italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ]

with 𝐖jk×ksubscript𝐖𝑗superscript𝑘𝑘\bm{\mathrm{W}}_{j}\in\mathbb{R}^{k\times k}bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_k end_POSTSUPERSCRIPT for 1<j<D11𝑗𝐷11<j<D-11 < italic_j < italic_D - 1. In addition, we “contract” 𝐂D1subscript𝐂𝐷1\bm{\mathrm{C}}_{D-1}bold_C start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT with 𝐀Dsubscript𝐀𝐷\bm{\mathrm{A}}_{D}bold_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT by computing

𝐕D[xD,sD1]subscript𝐕𝐷subscript𝑥𝐷subscript𝑠𝐷1\displaystyle\bm{\mathrm{V}}_{D}[x_{D},s_{D-1}]bold_V start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ] =iD1=1r𝐂D1[iD1,sD1]𝐀D[xD,iD1].absentsuperscriptsubscriptsubscript𝑖𝐷11𝑟subscript𝐂𝐷1subscript𝑖𝐷1subscript𝑠𝐷1subscript𝐀𝐷subscript𝑥𝐷subscript𝑖𝐷1\displaystyle=\sum_{i_{D-1}=1}^{r}\bm{\mathrm{C}}_{D-1}[i_{D-1},s_{D-1}]\bm{% \mathrm{A}}_{D}[x_{D},i_{D-1}].= ∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT bold_C start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT [ italic_i start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ] bold_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ] .

In addition, for notation clarity we rename 𝐁2subscript𝐁2\bm{\mathrm{B}}_{2}bold_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with 𝐖1subscript𝐖1\bm{\mathrm{W}}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐀1subscript𝐀1\bm{\mathrm{A}}_{1}bold_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with 𝐕1subscript𝐕1\bm{\mathrm{V}}_{1}bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. By doing so, we can rewrite Eq. 7 as a sum with indices {i1,s2,,sD1}subscript𝑖1subscript𝑠2subscript𝑠𝐷1\{i_{1},s_{2},\ldots,s_{D-1}\}{ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT } over products, i.e.,

𝒯[x1,,xD]=𝒯subscript𝑥1subscript𝑥𝐷absent\displaystyle\mathcal{T}[x_{1},\ldots,x_{D}]=caligraphic_T [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ] = i1=1r𝐕1[x1,i1]s2=1k𝐖1[i1,s2]𝐕[x2,s2]superscriptsubscriptsubscript𝑖11𝑟subscript𝐕1subscript𝑥1subscript𝑖1superscriptsubscriptsubscript𝑠21𝑘subscript𝐖1subscript𝑖1subscript𝑠2𝐕subscript𝑥2subscript𝑠2\displaystyle\sum_{i_{1}=1}^{r}\bm{\mathrm{V}}_{1}[x_{1},i_{1}]\sum_{s_{2}=1}^% {k}\bm{\mathrm{W}}_{1}[i_{1},s_{2}]\bm{\mathrm{V}}[x_{2},s_{2}]\cdots∑ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] bold_V [ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ⋯
\displaystyle\cdots sD2=1k𝐖D3[sD3,sD2]VD2[xD2,sD2]\displaystyle\sum_{s_{D-2}=1}^{k}\bm{\mathrm{W}}_{D-3}[s_{D-3},s_{D-2}]V_{D-2}% [x_{D-2},s_{D-2}]\cdot∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_D - 2 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_D - 3 end_POSTSUBSCRIPT [ italic_s start_POSTSUBSCRIPT italic_D - 3 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_D - 2 end_POSTSUBSCRIPT ] italic_V start_POSTSUBSCRIPT italic_D - 2 end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_D - 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_D - 2 end_POSTSUBSCRIPT ] ⋅
\displaystyle\cdot sD1=1k𝐖D2[sD2,sD1]VD1[xD1,sD1]VD[xD,sD1]superscriptsubscriptsubscript𝑠𝐷11𝑘subscript𝐖𝐷2subscript𝑠𝐷2subscript𝑠𝐷1subscript𝑉𝐷1subscript𝑥𝐷1subscript𝑠𝐷1subscript𝑉𝐷subscript𝑥𝐷subscript𝑠𝐷1\displaystyle\sum_{s_{D-1}=1}^{k}\bm{\mathrm{W}}_{D-2}[s_{D-2},s_{D-1}]V_{D-1}% [x_{D-1},s_{D-1}]V_{D}[x_{D},s_{D-1}]∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_D - 2 end_POSTSUBSCRIPT [ italic_s start_POSTSUBSCRIPT italic_D - 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ] italic_V start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ] italic_V start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ]

Fig. B.1 shows an example of such MPS factorization via CP decompositions. We see that we can encode the products over the same indices using a Hadamard product layers, and summations over indices {i1,s2,,sD1}subscript𝑖1subscript𝑠2subscript𝑠𝐷1\{i_{1},s_{2},\ldots,s_{D-1}\}{ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT } with sum layers parameterized by the 𝐖jsubscript𝐖𝑗\bm{\mathrm{W}}_{j}bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, with 1j<D11𝑗𝐷11\leq j<D-11 ≤ italic_j < italic_D - 1. More precisely, the sum layers that sum over s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and sD1subscript𝑠𝐷1s_{D-1}italic_s start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT are parameterized by matrices of ones. Each 𝐕jsubscript𝐕𝑗\bm{\mathrm{V}}_{j}bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with 1jD1𝑗𝐷1\leq j\leq D1 ≤ italic_j ≤ italic_D is instead encoded by an input layer depending on the variable Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and computing k𝑘kitalic_k functions fl(Xj)subscript𝑓𝑙subscript𝑋𝑗f_{l}(X_{j})italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) such that fl(xj)=𝐕j[xj,l]subscript𝑓𝑙subscript𝑥𝑗subscript𝐕𝑗subscript𝑥𝑗𝑙f_{l}(x_{j})=\bm{\mathrm{V}}_{j}[x_{j},l]italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_l ], with 1lr1𝑙𝑟1\leq l\leq r1 ≤ italic_l ≤ italic_r if j=1𝑗1j=1italic_j = 1 and 1lk1𝑙𝑘1\leq l\leq k1 ≤ italic_l ≤ italic_k if j>1𝑗1j>1italic_j > 1. The tensorized circuit constructed in this way is structured-decomposable, as it is defined on a linear tree RG (e.g., Fig. 2) induced by the same variable ordering implicitly stated by the MPS factorization (from left to right in Eq. 7, see Sec. B.4 for details). Fig. B.2 shows the circuit representation corresponding to the MPS reported in Fig. 1(c).

Finally, note that the number of parameters of such tensorized circuit correspond to the size of all {𝐖j}j=1D2superscriptsubscriptsubscript𝐖𝑗𝑗1𝐷2\{\bm{\mathrm{W}}_{j}\}_{j=1}^{D-2}{ bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D - 2 end_POSTSUPERSCRIPT and {𝐕j}j=1Dsuperscriptsubscriptsubscript𝐕𝑗𝑗1𝐷\{\bm{\mathrm{V}}_{j}\}_{j=1}^{D}{ bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT introduced above, i.e., overall 𝒪(Dk2)𝒪𝐷superscript𝑘2\mathcal{O}(D\cdot k^{2})caligraphic_O ( italic_D ⋅ italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) with kmin{r2,mr}𝑘superscript𝑟2𝑚𝑟k\leq\min\{r^{2},mr\}italic_k ≤ roman_min { italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_m italic_r }. Moreover, the CP decompositions at the beginning can be computed using iterative methods whose iterations require polynomial time (Kolda & Bader, 2009). To retrieve an equivalent BM, we can square the circuit constructed in this way using Alg. 1, which results in a circuit having size 𝒪(Dk4)𝒪𝐷superscript𝑘4\mathcal{O}(D\cdot k^{4})caligraphic_O ( italic_D ⋅ italic_k start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) (see Prop. B.1). A similar proof can be carried out for showing a reduction of other tensor network structures that can be squared efficiently, such as tree-shaped networks (Cheng et al., 2019).

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure B.1: Further decomposing a matrix product state (MPS) via CP decompositions. Tensor networks are represented here using the Penrose graphical notation, where circles denote tensors and their connections denote summations over shared indices, and with variables X1,X2,X3,X4subscript𝑋1subscript𝑋2subscript𝑋3subscript𝑋4X_{1},X_{2},X_{3},X_{4}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT denoting input indices. Given a MPS (a), we perform a CP decomposition of 𝐀2subscript𝐀2\bm{\mathrm{A}}_{2}bold_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝐀3subscript𝐀3\bm{\mathrm{A}}_{3}bold_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (b). Red edges denote additional indices given by the CP decompositions. Then, we rename 𝐀1subscript𝐀1\bm{\mathrm{A}}_{1}bold_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with 𝐕1subscript𝐕1\bm{\mathrm{V}}_{1}bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝐁2subscript𝐁2\bm{\mathrm{B}}_{2}bold_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with 𝐖1subscript𝐖1\bm{\mathrm{W}}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Finally, we contract 𝐂2subscript𝐂2\bm{\mathrm{C}}_{2}bold_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with 𝐁3subscript𝐁3\bm{\mathrm{B}}_{3}bold_B start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and 𝐂3subscript𝐂3\bm{\mathrm{C}}_{3}bold_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with 𝐀4subscript𝐀4\bm{\mathrm{A}}_{4}bold_A start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT resulting in tensors 𝐖2subscript𝐖2\bm{\mathrm{W}}_{2}bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝐕4subscript𝐕4\bm{\mathrm{V}}_{4}bold_V start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, respectively (c). Fig. B.2 shows the tensorized circuit corresponding to such tensor network, where 𝐕1,𝐕2,𝐕3,𝐕4subscript𝐕1subscript𝐕2subscript𝐕3subscript𝐕4\bm{\mathrm{V}}_{1},\bm{\mathrm{V}}_{2},\bm{\mathrm{V}}_{3},\bm{\mathrm{V}}_{4}bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and 𝐖1,𝐖2subscript𝐖1subscript𝐖2\bm{\mathrm{W}}_{1},\bm{\mathrm{W}}_{2}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT parameterize input layers and sum layers, respectively.
Refer to caption
(a)
Refer to caption
(b)
Figure B.2: Matrix product states (MPS) as structured-decomposable circuits. The decomposed MPS over three variables showed in Fig. 1(c) can be immediately represented as a tensorized structured-decomposable circuit (a) defined on a linear tree RG (b, matching the colors of layers) having Hadamard product layers and sum layers parameterized by 𝐖1,𝐖2subscript𝐖1subscript𝐖2\bm{\mathrm{W}}_{1},\bm{\mathrm{W}}_{2}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and a row vector of ones 𝟏1\mathbf{1}bold_1 (for the output sum). Each input layer maps variable assignments x1,x2,x3,x4subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4x_{1},x_{2},x_{3},x_{4}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT to rows in 𝐕1,𝐕2,𝐕3,𝐕4subscript𝐕1subscript𝐕2subscript𝐕3subscript𝐕4\bm{\mathrm{V}}_{1},\bm{\mathrm{V}}_{2},\bm{\mathrm{V}}_{3},\bm{\mathrm{V}}_{4}bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, respectively.

B.4.1 Relationship with Hidden Markov Models

MPS tensor networks where each tensor 𝐀isubscript𝐀𝑖\bm{\mathrm{A}}_{i}bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is non-negative can be seen as inhomogeneous hidden Markov models (HMMs) as showed by Glasser et al. (2019), i.e., where latent state and emitting transitions do not necessarily share parameters. As such, the tensorized structured-decomposable circuit c𝑐citalic_c that is equivalent to a MPS (see Sec. B.4) is also an inhomogenous HMM if c𝑐citalic_c is monotonic.

In Sec. 5 we experiment with a tensorized monotonic PC that is an inhomogenous HMM to distill a large language model, as to leverage the sequential structure of the sentences. We compare it against a NPC2 that is the squaring of a MPS (also called Born machine (Glasser et al., 2019)) or, equivalently, the squaring of an inhomogenous HMM-like whose parameters can be negative.

B.5 Exponential Separation

Theorem 1.

There is a class of non-negative functions \mathcal{F}caligraphic_F over variables 𝐗𝐗\bm{\mathrm{X}}bold_X that can be compactly represented as shallow squared NMMs (and hence squared non-monotonic PCs) but for which the smallest structured-decomposable monotonic PC computing any F𝐹F\in\mathcal{F}italic_F ∈ caligraphic_F has size 2Ω(|𝐗|)superscript2Ω𝐗2^{\Omega(|\bm{\mathrm{X}}|)}2 start_POSTSUPERSCRIPT roman_Ω ( | bold_X | ) end_POSTSUPERSCRIPT.

Proof.

For the proof of 1, we start by constructing \mathcal{F}caligraphic_F by introducing a variant of the unique disjointness (UDISJ) problem, which seems to have first been introduced by De Wolf (2003). The variant we consider here is defined over graphs, as detailed in the following definition.

Definition B.1 (Unique disjointness function).

Consider an undirected graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), where V𝑉Vitalic_V denotes its vertices and E𝐸Eitalic_E its edges. To every vertex vV𝑣𝑉v\in Vitalic_v ∈ italic_V we associate a Boolean variable Xvsubscript𝑋𝑣X_{v}italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and let 𝐗V={XvvV}subscript𝐗𝑉conditional-setsubscript𝑋𝑣𝑣𝑉\bm{\mathrm{X}}_{V}=\{X_{v}\mid v\in V\}bold_X start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_v ∈ italic_V } be the set of all these variables. The unique disjointness function of G𝐺Gitalic_G is defined as

𝖴𝖣𝖨𝖲𝖩G(𝐗v):=(1uvEXuXv)2.assignsubscript𝖴𝖣𝖨𝖲𝖩𝐺subscript𝐗𝑣superscript1subscript𝑢𝑣𝐸subscript𝑋𝑢subscript𝑋𝑣2\mathsf{UDISJ}_{G}(\bm{\mathrm{X}}_{v}):=\left(1-\sum_{uv\in E}X_{u}X_{v}% \right)^{2}.sansserif_UDISJ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) := ( 1 - ∑ start_POSTSUBSCRIPT italic_u italic_v ∈ italic_E end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (8)

The UDISJ function as a non-monotonic circuit. We will construct \mathcal{F}caligraphic_F as the class of functions 𝖴𝖣𝖨𝖲𝖩Gsubscript𝖴𝖣𝖨𝖲𝖩𝐺\mathsf{UDISJ}_{G}sansserif_UDISJ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT for graphs G𝒢𝐺𝒢G\in\mathcal{G}italic_G ∈ caligraphic_G, where 𝒢𝒢\mathcal{G}caligraphic_G is a family of graphs that we will choose later. Regardless of the way the class 𝒢𝒢\mathcal{G}caligraphic_G is picked, we can compactly represent 𝖴𝖣𝖨𝖲𝖩Gsubscript𝖴𝖣𝖨𝖲𝖩𝐺\mathsf{UDISJ}_{G}sansserif_UDISJ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT as a squared structured-decomposable (Def. A.3) and non-monotonic circuit as follows. First, we represent the function c(𝐗V)=1uvEXuXv𝑐subscript𝐗𝑉1subscript𝑢𝑣𝐸subscript𝑋𝑢subscript𝑋𝑣c(\bm{\mathrm{X}}_{V})=1-\sum_{uv\in E}X_{u}X_{v}italic_c ( bold_X start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) = 1 - ∑ start_POSTSUBSCRIPT italic_u italic_v ∈ italic_E end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT as sum unit computing 1a(𝐗V)+(1)b(𝐗V)1𝑎subscript𝐗𝑉1𝑏subscript𝐗𝑉1\cdot a(\bm{\mathrm{X}}_{V})+(-1)\cdot b(\bm{\mathrm{X}}_{V})1 ⋅ italic_a ( bold_X start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) + ( - 1 ) ⋅ italic_b ( bold_X start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) where

  • a𝑎aitalic_a is a circuit gadget that realizes an unnormalized uniform distribution over the domain of variables in 𝐗Vsubscript𝐗𝑉\bm{\mathrm{X}}_{V}bold_X start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, i.e., a(𝐗V)=vV(𝟙{Xv=0}+𝟙{Xv=1})𝑎subscript𝐗𝑉subscriptproduct𝑣𝑉1subscript𝑋𝑣01subscript𝑋𝑣1a(\bm{\mathrm{X}}_{V})=\prod_{v\in V}(\mathbbm{1}\!\left\{X_{v}=0\right\}+% \mathbbm{1}\!\left\{X_{v}=1\right\})italic_a ( bold_X start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_v ∈ italic_V end_POSTSUBSCRIPT ( blackboard_1 { italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 0 } + blackboard_1 { italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 1 } ) where 𝟙{Xv=0}1subscript𝑋𝑣0\mathbbm{1}\!\left\{X_{v}=0\right\}blackboard_1 { italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 0 } (resp. 𝟙{Xv=1}1subscript𝑋𝑣1\mathbbm{1}\!\left\{X_{v}=1\right\}blackboard_1 { italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 1 }) is an indicator function that outputs 1111 when Xvsubscript𝑋𝑣X_{v}italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is set to 0 (resp. 1);

  • b𝑏bitalic_b is another sum unit whose inputs are product units over the input units 𝟙{Xu=1},𝟙{Xv=1}1subscript𝑋𝑢11subscript𝑋𝑣1\mathbbm{1}\!\left\{X_{u}=1\right\},\mathbbm{1}\!\left\{X_{v}=1\right\}blackboard_1 { italic_X start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = 1 } , blackboard_1 { italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 1 } if there is an edge uv𝑢𝑣uvitalic_u italic_v in G𝐺Gitalic_G, i.e., b(𝐗V)=uvE𝟙{Xu=1}𝟙{Xv=1}𝑏subscript𝐗𝑉subscript𝑢𝑣𝐸1subscript𝑋𝑢11subscript𝑋𝑣1b(\bm{\mathrm{X}}_{V})=\sum_{uv\in E}\mathbbm{1}\!\left\{X_{u}=1\right\}\cdot% \mathbbm{1}\!\left\{X_{v}=1\right\}italic_b ( bold_X start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_u italic_v ∈ italic_E end_POSTSUBSCRIPT blackboard_1 { italic_X start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = 1 } ⋅ blackboard_1 { italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 1 }.

Note that b𝑏bitalic_b may not be smooth, but we can easily smooth it by adding to every product an additional input that is a circuit similar to a𝑎aitalic_a that outputs 1111 for any input 𝐗uv¯subscript𝐗¯𝑢𝑣\bm{\mathrm{X}}_{\overline{uv}}bold_X start_POSTSUBSCRIPT over¯ start_ARG italic_u italic_v end_ARG end_POSTSUBSCRIPT, where 𝐗uv¯=𝐗V{Xu,Xv}subscript𝐗¯𝑢𝑣subscript𝐗𝑉subscript𝑋𝑢subscript𝑋𝑣\bm{\mathrm{X}}_{\overline{uv}}=\bm{\mathrm{X}}_{V}\setminus\{X_{u},X_{v}\}bold_X start_POSTSUBSCRIPT over¯ start_ARG italic_u italic_v end_ARG end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∖ { italic_X start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT }. Since c𝑐citalic_c is structured-decomposable (Def. A.3), we can easily multiply it with itself to realize c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT that would be still a structured-decomposable circuit whose size is polynomially bounded as |c2|𝒪(|c|2)superscript𝑐2𝒪superscript𝑐2|c^{2}|\in\mathcal{O}(|c|^{2})| italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ∈ caligraphic_O ( | italic_c | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (Vergari et al., 2021). Note that in this case we have that |c|𝑐|c|| italic_c | is a polynomial in the number of variables (or vertices) |𝐗V|subscript𝐗𝑉|\bm{\mathrm{X}}_{V}|| bold_X start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT | by the construction above. Furthermore, note that c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is non-monotonic as one of its sum unit has negative parameters (i.e., 11-1- 1) to encode the subtraction in Eq. 8.

The lower bound for monotonic circuits. To prove the exponential lower bound for monotonic circuits in 1, we will use an approach that has been used in several other works (Martens & Medabalimi, 2014; de Colnet & Mengel, 2021). This approach is based on representing a decomposable circuit (and hence a structured-decomposable one) as a shallow mixture whose components are balanced products, as formalized next.

Definition B.2 (Balanced decomposable product).

Let 𝐗𝐗\bm{\mathrm{X}}bold_X be a set of variables. A balanced decomposable product over 𝐗𝐗\bm{\mathrm{X}}bold_X is a function from 𝐗𝐗\bm{\mathrm{X}}bold_X to \mathbb{R}blackboard_R that can be written as f(𝐘)×h(𝐙)𝑓𝐘𝐙f(\bm{\mathrm{Y}})\times h(\bm{\mathrm{Z}})italic_f ( bold_Y ) × italic_h ( bold_Z ) where f𝑓fitalic_f and hhitalic_h are functions to \mathbb{R}blackboard_R, ×\times× is an alias for the product (when used to multiply functions from now on), and (𝐘,𝐙)𝐘𝐙(\bm{\mathrm{Y}},\bm{\mathrm{Z}})( bold_Y , bold_Z ) is a balanced partitioning of 𝐗𝐗\bm{\mathrm{X}}bold_X, i.e., 𝐘𝐙=𝐗𝐘𝐙𝐗\bm{\mathrm{Y}}\cup\bm{\mathrm{Z}}=\bm{\mathrm{X}}bold_Y ∪ bold_Z = bold_X, 𝐘𝐙=𝐘𝐙\bm{\mathrm{Y}}\cap\bm{\mathrm{Z}}=\varnothingbold_Y ∩ bold_Z = ∅ with |𝐗|/3|𝐘|,|𝐙|2|𝐗|/3formulae-sequence𝐗3𝐘𝐙2𝐗3|\bm{\mathrm{X}}|/3\leq|\bm{\mathrm{Y}}|,|\bm{\mathrm{Z}}|\leq 2|\bm{\mathrm{X% }}|/3| bold_X | / 3 ≤ | bold_Y | , | bold_Z | ≤ 2 | bold_X | / 3.

Theorem B.1 (Martens & Medabalimi (2014)).

Let F𝐹Fitalic_F be a non-negative function over Boolean variables 𝐗𝐗\bm{\mathrm{X}}bold_X computed by a smooth and decomposable circuit c𝑐citalic_c. Then, F𝐹Fitalic_F can be written as a sum of N𝑁Nitalic_N balanced decomposable products (Def. B.2) over 𝐗𝐗\bm{\mathrm{X}}bold_X, with N|c|𝑁𝑐N\leq|c|italic_N ≤ | italic_c | in the form444In Martens & Medabalimi (2014), Theorem 38, this result is stated with N|c|2𝑁superscript𝑐2N\leq|c|^{2}italic_N ≤ | italic_c | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The square materializes from the fact that they reduce their circuits to have all their inner units to have exactly two inputs, as we already assume, following de Colnet & Mengel (2021).

F(𝐗)=k=1Nfk(𝐘k)×hk(𝐙k),𝐹𝐗superscriptsubscript𝑘1𝑁subscript𝑓𝑘subscript𝐘𝑘subscript𝑘subscript𝐙𝑘F(\bm{\mathrm{X}})=\sum_{k=1}^{N}f_{k}(\bm{\mathrm{Y}}_{k})\times h_{k}(\bm{% \mathrm{Z}}_{k}),italic_F ( bold_X ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) × italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,

where (𝐘k,𝐙k)subscript𝐘𝑘subscript𝐙𝑘(\bm{\mathrm{Y}}_{k},\bm{\mathrm{Z}}_{k})( bold_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is balanced a partitioning of 𝐗𝐗\bm{\mathrm{X}}bold_X for 1kN1𝑘𝑁1\leq k\leq N1 ≤ italic_k ≤ italic_N. If c𝑐citalic_c is structured-decomposable, the N𝑁Nitalic_N partitions {(𝐘k,𝐙k)}k=1Nsuperscriptsubscriptsubscript𝐘𝑘subscript𝐙𝑘𝑘1𝑁\{(\bm{\mathrm{Y}}_{k},\bm{\mathrm{Z}}_{k})\}_{k=1}^{N}{ ( bold_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT are all identical. Moreover, if c𝑐citalic_c is monotonic, then all fk,hksubscript𝑓𝑘subscript𝑘f_{k},h_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT only compute non-negative values.

Intuitively, Thm. B.1 tells us that to lower bound the size of c𝑐citalic_c we can lower bound N𝑁Nitalic_N. To this end, we first encode the UDISJ function (Eq. 8) as a sum of N𝑁Nitalic_N balanced products and show the exponential growth of N𝑁Nitalic_N for a particular family of graphs. We start with a special case for a representation in the following proposition.

Proposition B.2.

Let Gnsubscript𝐺𝑛G_{n}italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be a matching of size n𝑛nitalic_n, i.e., a graph consisting of n𝑛nitalic_n edges none of which share any vertices. Assume that the UDISJ function (Eq. 8) for Gnsubscript𝐺𝑛G_{n}italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is written as a sum of products of balanced partitions

𝖴𝖣𝖨𝖲𝖩Gn(𝐘,𝐙)=k=1Nfk(𝐘)×hk(𝐙),subscript𝖴𝖣𝖨𝖲𝖩subscript𝐺𝑛𝐘𝐙superscriptsubscript𝑘1𝑁subscript𝑓𝑘𝐘subscript𝑘𝐙\mathsf{UDISJ}_{G_{n}}(\bm{\mathrm{Y}},\bm{\mathrm{Z}})=\sum_{k=1}^{N}f_{k}(% \bm{\mathrm{Y}})\times h_{k}(\bm{\mathrm{Z}}),sansserif_UDISJ start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Y , bold_Z ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Y ) × italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Z ) ,

where for every edge uv𝑢𝑣uvitalic_u italic_v in Gnsubscript𝐺𝑛G_{n}italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT we have that Xu𝐘subscript𝑋𝑢𝐘X_{u}\in\bm{\mathrm{Y}}italic_X start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ bold_Y and Xv𝐙subscript𝑋𝑣𝐙X_{v}\in\bm{\mathrm{Z}}italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ bold_Z, and fk,hksubscript𝑓𝑘subscript𝑘f_{k},h_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are non-negative functions. Then N=2Ω(n)𝑁superscript2Ω𝑛N=2^{\Omega(n)}italic_N = 2 start_POSTSUPERSCRIPT roman_Ω ( italic_n ) end_POSTSUPERSCRIPT.

To prove the above result, we will make an argument on the rank of the so-called communication matrix, also known as the value matrix, for a function F𝐹Fitalic_F and a fixed partition (𝐘,𝐙)𝐘𝐙(\bm{\mathrm{Y}},\bm{\mathrm{Z}})( bold_Y , bold_Z ).

Definition B.3 (Communication matrix, or value matrix (de Colnet & Mengel, 2021)).

Let F𝐹Fitalic_F be a function over (𝐘,𝐙)𝐘𝐙(\bm{\mathrm{Y}},\bm{\mathrm{Z}})( bold_Y , bold_Z ), its communication matrix MFsubscript𝑀𝐹M_{F}italic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is a 2|𝐘|×2|𝐙|superscript2𝐘superscript2𝐙2^{|\bm{\mathrm{Y}}|}\times 2^{|\bm{\mathrm{Z}}|}2 start_POSTSUPERSCRIPT | bold_Y | end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT | bold_Z | end_POSTSUPERSCRIPT matrix whose rows (resp. columns) are uniquely indexed by assignments to 𝐘𝐘\bm{\mathrm{Y}}bold_Y (resp. 𝐙𝐙\bm{\mathrm{Z}}bold_Z) such that for a pair of index555An index i𝐘subscript𝑖𝐘i_{\bm{\mathrm{Y}}}italic_i start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT (resp. j𝐙subscript𝑗𝐙j_{\bm{\mathrm{Z}}}italic_j start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT) is a complete assignment to Boolean variables in 𝐘𝐘\bm{\mathrm{Y}}bold_Y (resp. 𝐙𝐙\bm{\mathrm{Z}}bold_Z). See 1. (i𝐘,j𝐙)subscript𝑖𝐘subscript𝑗𝐙(i_{\bm{\mathrm{Y}}},j_{\bm{\mathrm{Z}}})( italic_i start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT ), the entry at the row i𝐘subscript𝑖𝐘i_{\bm{\mathrm{Y}}}italic_i start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT and column j𝐙subscript𝑗𝐙j_{\bm{\mathrm{Z}}}italic_j start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT in MFsubscript𝑀𝐹M_{F}italic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is F(i𝐘,j𝐙)𝐹subscript𝑖𝐘subscript𝑗𝐙F(i_{\bm{\mathrm{Y}}},j_{\bm{\mathrm{Z}}})italic_F ( italic_i start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT ).

Example 1.

Let us consider a simple matching on 6 vertices, where 𝐘𝐘\bm{\mathrm{Y}}bold_Y correspond to the first 3 vertices, and 𝐙𝐙\bm{\mathrm{Z}}bold_Z to the last 3, and where there is an edge between the first, second and third vertices of 𝐘𝐘\bm{\mathrm{Y}}bold_Y and 𝐙𝐙\bm{\mathrm{Z}}bold_Z. The matrix MFsubscript𝑀𝐹M_{F}italic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is an 8-by-8 matrix, a row and a column for each assignment of the 3 binary variables associated to each vertex; it is given by

𝐘\𝐙\𝐘𝐙\bm{\mathrm{Y}}\backslash\bm{\mathrm{Z}}bold_Y \ bold_Z 000 100 010 001 110 101 011 111
000 1 1 1 1 1 1 1 1
100 1 0 1 1 0 0 1 0
010 1 1 0 1 0 1 0 0
001 1 1 1 0 1 0 0 0
110 1 0 0 1 1 0 0 1
101 1 0 1 0 0 1 0 1
011 1 1 0 0 0 0 1 1
111 1 0 0 0 1 1 1 4

Note that the name UDISJ comes from the fact that MF(i,j)=0subscript𝑀𝐹𝑖𝑗0M_{F}(i,j)=0italic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_i , italic_j ) = 0 if and only if 𝐘𝐘\bm{\mathrm{Y}}bold_Y and 𝐙𝐙\bm{\mathrm{Z}}bold_Z share a single entry equal to 1. In the following, we will rely on the following quantity.

Definition B.4 (Non-negative rank).

The non-negative rank of a non-negative matrix A+m×n𝐴superscriptsubscript𝑚𝑛A\in\mathbb{R}_{+}^{m\times n}italic_A ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT, denoted 𝗋𝖺𝗇𝗄+(A)subscript𝗋𝖺𝗇𝗄𝐴\mathsf{rank}_{+}(A)sansserif_rank start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_A ), is the smallest k𝑘kitalic_k such that there exist k𝑘kitalic_k nonnegative rank-one matrices {Ai}i=1ksuperscriptsubscriptsubscript𝐴𝑖𝑖1𝑘\{A_{i}\}_{i=1}^{k}{ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT such that A=i=1kAi𝐴superscriptsubscript𝑖1𝑘subscript𝐴𝑖A=\sum_{i=1}^{k}A_{i}italic_A = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Equivalently, it is the smallest k𝑘kitalic_k such that there exists two non-negative matrices B+m×k𝐵superscriptsubscript𝑚𝑘B\in\mathbb{R}_{+}^{m\times k}italic_B ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m × italic_k end_POSTSUPERSCRIPT and C+k×n𝐶superscriptsubscript𝑘𝑛C\in\mathbb{R}_{+}^{k\times n}italic_C ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k × italic_n end_POSTSUPERSCRIPT such that A=BC𝐴𝐵𝐶A=BCitalic_A = italic_B italic_C.

Given a function F𝐹Fitalic_F written as a sum over N𝑁Nitalic_N decomposable products (see Thm. B.1) over a fixed partition (𝐘,𝐙)𝐘𝐙(\bm{\mathrm{Y}},\bm{\mathrm{Z}})( bold_Y , bold_Z ), we now show that the non-negative rank of its communication matrix MFsubscript𝑀𝐹M_{F}italic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT (Def. B.3) is a lower bound of N𝑁Nitalic_N.

Lemma B.1.

Let F(𝐗)=k=1Nfk(𝐘)×hk(𝐙)𝐹𝐗superscriptsubscript𝑘1𝑁subscript𝑓𝑘𝐘subscript𝑘𝐙F(\bm{\mathrm{X}})=\sum_{k=1}^{N}f_{k}(\bm{\mathrm{Y}})\times h_{k}(\bm{% \mathrm{Z}})italic_F ( bold_X ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Y ) × italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Z ) where fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and hksubscript𝑘h_{k}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are non-negative functions and let MFsubscript𝑀𝐹M_{F}italic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT be the communication matrix (Def. B.3) of F𝐹Fitalic_F for the partition (𝐘,𝐙)𝐘𝐙(\bm{\mathrm{Y}},\bm{\mathrm{Z}})( bold_Y , bold_Z ), then it holds that

𝗋𝖺𝗇𝗄+(MF)N.subscript𝗋𝖺𝗇𝗄subscript𝑀𝐹𝑁\mathsf{rank}_{+}(M_{F})\leq N.sansserif_rank start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) ≤ italic_N .
Proof.

This proof is an easy extension of the proof of Lemma 13 from de Colnet & Mengel (2021). Assume w.l.o.g. that fk(𝐘)×hk(𝐙)0subscript𝑓𝑘𝐘subscript𝑘𝐙0f_{k}(\bm{\mathrm{Y}})\times h_{k}(\bm{\mathrm{Z}})\neq 0italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Y ) × italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Z ) ≠ 0 for any complete assignment to 𝐘𝐘\bm{\mathrm{Y}}bold_Y and 𝐙𝐙\bm{\mathrm{Z}}bold_Z.666If this were not the case we could simply drop the term from the summation, which would clearly reduce the number of summands. Let Mksubscript𝑀𝑘M_{k}italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denote the communication matrix of the function fk(𝐘)×hk(𝐙)subscript𝑓𝑘𝐘subscript𝑘𝐙f_{k}(\bm{\mathrm{Y}})\times h_{k}(\bm{\mathrm{Z}})italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Y ) × italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Z ). By construction, we have that MF=k=1NMksubscript𝑀𝐹superscriptsubscript𝑘1𝑁subscript𝑀𝑘M_{F}=\sum_{k=1}^{N}M_{k}italic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Furthermore, since all values in MFsubscript𝑀𝐹M_{F}italic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT are non-negative by definition, 𝗋𝖺𝗇𝗄+(Mk)subscript𝗋𝖺𝗇𝗄subscript𝑀𝑘\mathsf{rank}_{+}(M_{k})sansserif_rank start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is defined for all k𝑘kitalic_k and by sub-additivity of the non-negative rank we have that 𝗋𝖺𝗇𝗄+(MF)k=1N𝗋𝖺𝗇𝗄+(Mk)subscript𝗋𝖺𝗇𝗄subscript𝑀𝐹superscriptsubscript𝑘1𝑁subscript𝗋𝖺𝗇𝗄subscript𝑀𝑘\mathsf{rank}_{+}(M_{F})\leq\sum_{k=1}^{N}\mathsf{rank}_{+}(M_{k})sansserif_rank start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) ≤ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT sansserif_rank start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). To conclude the proof, it is sufficient to show that Mksubscript𝑀𝑘M_{k}italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are rank-1 matrices, i.e., 𝗋𝖺𝗇𝗄+(Mk)=1subscript𝗋𝖺𝗇𝗄subscript𝑀𝑘1\mathsf{rank}_{+}(M_{k})=1sansserif_rank start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = 1. To this end, consider an arbitrary k𝑘kitalic_k. Since fk(𝐘)×hk(𝐙)0subscript𝑓𝑘𝐘subscript𝑘𝐙0f_{k}(\bm{\mathrm{Y}})\times h_{k}(\bm{\mathrm{Z}})\neq 0italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Y ) × italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Z ) ≠ 0, there is a row in Mksubscript𝑀𝑘M_{k}italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that is not a row of zeros. Say it is indexed by i𝐘subscript𝑖𝐘i_{\bm{\mathrm{Y}}}italic_i start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT, then its entries are of the form fk(i𝐘)×hk(j𝐙)subscript𝑓𝑘subscript𝑖𝐘subscript𝑘subscript𝑗𝐙f_{k}(i_{\bm{\mathrm{Y}}})\times h_{k}(j_{\bm{\mathrm{Z}}})italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT ) × italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_j start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT ) for varying j𝐙subscript𝑗𝐙j_{\bm{\mathrm{Z}}}italic_j start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT. In any other rows indexed by i𝐘subscriptsuperscript𝑖𝐘i^{\prime}_{\bm{\mathrm{Y}}}italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT we have fk(i𝐘)×hk(j𝐙)=(fk(i𝐘)/fk(i𝐘))×fk(i𝐘)×hk(j𝐙)subscript𝑓𝑘subscriptsuperscript𝑖𝐘subscript𝑘subscript𝑗𝐙subscript𝑓𝑘subscriptsuperscript𝑖𝐘subscript𝑓𝑘subscript𝑖𝐘subscript𝑓𝑘subscript𝑖𝐘subscript𝑘subscript𝑗𝐙f_{k}(i^{\prime}_{\bm{\mathrm{Y}}})\times h_{k}(j_{\bm{\mathrm{Z}}})=(f_{k}(i^% {\prime}_{\bm{\mathrm{Y}}})/f_{k}(i_{\bm{\mathrm{Y}}}))\times f_{k}(i_{\bm{% \mathrm{Y}}})\times h_{k}(j_{\bm{\mathrm{Z}}})italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT ) × italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_j start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT ) = ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT ) / italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT ) ) × italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT ) × italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_j start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT ) for varying j𝐙subscript𝑗𝐙j_{\bm{\mathrm{Z}}}italic_j start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT. Consequently, all rows are non-negative multiples of the i𝐘subscript𝑖𝐘i_{\bm{\mathrm{Y}}}italic_i start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT row, and therefore 𝗋𝖺𝗇𝗄+(Mk)=1subscript𝗋𝖺𝗇𝗄subscript𝑀𝑘1\mathsf{rank}_{+}(M_{k})=1sansserif_rank start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = 1. ∎

To complete the proof of Prop. B.2, we leverage a known lower bound of the non-negative rank of the communication matrix of the UDISJ problem. The interested reader can find more information on this result in the books Roughgarden (2016), Gillis (2020) and the references therein.

Theorem B.2 (Fiorini et al. (2015)).

Let a UDISJ function defined as in Prop. B.2, and M𝖴𝖣𝖨𝖲𝖩subscript𝑀𝖴𝖣𝖨𝖲𝖩M_{\mathsf{UDISJ}}italic_M start_POSTSUBSCRIPT sansserif_UDISJ end_POSTSUBSCRIPT be its communication matrix over a partition (𝐘,𝐙)𝐘𝐙(\bm{\mathrm{Y}},\bm{\mathrm{Z}})( bold_Y , bold_Z ), then it holds that

(3/2)n𝗋𝖺𝗇𝗄+(M𝖴𝖣𝖨𝖲𝖩).superscript32𝑛subscript𝗋𝖺𝗇𝗄subscript𝑀𝖴𝖣𝖨𝖲𝖩(3/2)^{n}\leq\mathsf{rank}_{+}(M_{\mathsf{UDISJ}}).( 3 / 2 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ≤ sansserif_rank start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT sansserif_UDISJ end_POSTSUBSCRIPT ) .

Using Thm. B.2 and Lem. B.1, we directly get Prop. B.2. So we have shown that, for a fixed partition of variables (𝐘,𝐙)𝐘𝐙(\bm{\mathrm{Y}},\bm{\mathrm{Z}})( bold_Y , bold_Z ), every monotonic circuit c𝑐citalic_c encoding the UDISJ function (Eq. 8) of a matching of size n𝑛nitalic_n has size |c|2Ω(n)𝑐superscript2Ω𝑛|c|\geq 2^{\Omega(n)}| italic_c | ≥ 2 start_POSTSUPERSCRIPT roman_Ω ( italic_n ) end_POSTSUPERSCRIPT. However, the smallest non-monotonic circuit encoding the same function has polynomial size in n𝑛nitalic_n (see the construction of the UDISJ function as a circuit above). Now, to complete the proof for the exponential lower bound in 1, we need to find a function class \mathcal{F}caligraphic_F where this result holds for all possible partitions (𝐘,𝐙)𝐘𝐙(\bm{\mathrm{Y}},\bm{\mathrm{Z}})( bold_Y , bold_Z ). Such function class consists of UDISJ functions over a family of graphs, as detailed in the following proposition.

Proposition B.3.

There is a family of graphs 𝒢𝒢\mathcal{G}caligraphic_G such that for every graph Gn=(Vn,En)𝒢subscript𝐺𝑛subscript𝑉𝑛subscript𝐸𝑛𝒢G_{n}=(V_{n},E_{n})\in\mathcal{G}italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ caligraphic_G we have |Vn|=|En|=𝒪(n)subscript𝑉𝑛subscript𝐸𝑛𝒪𝑛|V_{n}|=|E_{n}|=\mathcal{O}(n)| italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | = | italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | = caligraphic_O ( italic_n ), and any monotonic structured-decomposable circuit representation of 𝖴𝖣𝖨𝖲𝖩Gnsubscript𝖴𝖣𝖨𝖲𝖩subscript𝐺𝑛\mathsf{UDISJ}_{G_{n}}sansserif_UDISJ start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT has size 2Ω(n)superscript2Ω𝑛2^{\Omega(n)}2 start_POSTSUPERSCRIPT roman_Ω ( italic_n ) end_POSTSUPERSCRIPT.

Proof.

We prove it by constructing a class of so-called expander graphs, which we introduce next. We say that a graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) has expansion ε𝜀\varepsilonitalic_ε if, for every subset Vsuperscript𝑉V^{\prime}italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of V𝑉Vitalic_V of size at most |V|/2𝑉2|V|/2| italic_V | / 2, there are at least ε|V|𝜀superscript𝑉\varepsilon|V^{\prime}|italic_ε | italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | edges from Vsuperscript𝑉V^{\prime}italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to VV𝑉superscript𝑉V\setminus V^{\prime}italic_V ∖ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in G𝐺Gitalic_G. It is well-known, see e.g. Hoory et al. (2006), that there are constants ε>0𝜀0\varepsilon>0italic_ε > 0 and d𝑑d\in\mathbb{N}italic_d ∈ blackboard_N and a family (Gn)nsubscriptsubscript𝐺𝑛𝑛(G_{n})_{n\in\mathbb{N}}( italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT of graphs such that Gnsubscript𝐺𝑛G_{n}italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT has at least n𝑛nitalic_n vertices, expansion ε𝜀\varepsilonitalic_ε and maximal degree d𝑑ditalic_d. We fix such a family of graphs in the remainder and denote by Vnsubscript𝑉𝑛V_{n}italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (resp. Ensubscript𝐸𝑛E_{n}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) the vertex set (resp. the edge set) of Gnsubscript𝐺𝑛G_{n}italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Let c𝑐citalic_c be a monotonic structured-decomposable circuit of size N𝑁Nitalic_N computing 𝖴𝖣𝖨𝖲𝖩Gnsubscript𝖴𝖣𝖨𝖲𝖩subscript𝐺𝑛\mathsf{UDISJ}_{G_{n}}sansserif_UDISJ start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Then, by using Thm. B.1, we can write it as

𝖴𝖣𝖨𝖲𝖩Gn(𝐘,𝐙)=k=1Nfk(𝐘)×hk(𝐙)subscript𝖴𝖣𝖨𝖲𝖩subscript𝐺𝑛𝐘𝐙superscriptsubscript𝑘1𝑁subscript𝑓𝑘𝐘subscript𝑘𝐙\mathsf{UDISJ}_{G_{n}}(\bm{\mathrm{Y}},\bm{\mathrm{Z}})=\sum_{k=1}^{N}f_{k}(% \bm{\mathrm{Y}})\times h_{k}(\bm{\mathrm{Z}})sansserif_UDISJ start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Y , bold_Z ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Y ) × italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_Z ) (9)

where (𝐘,𝐙)𝐘𝐙(\bm{\mathrm{Y}},\bm{\mathrm{Z}})( bold_Y , bold_Z ) is a balanced partition of 𝐗Vsubscript𝐗𝑉\bm{\mathrm{X}}_{V}bold_X start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT. Let V𝐘={vVnXv𝐘}subscript𝑉𝐘conditional-set𝑣subscript𝑉𝑛subscript𝑋𝑣𝐘V_{\bm{\mathrm{Y}}}=\{v\in V_{n}\mid X_{v}\in\bm{\mathrm{Y}}\}italic_V start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT = { italic_v ∈ italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ bold_Y } and V𝐙={vVnXv𝐙}subscript𝑉𝐙conditional-set𝑣subscript𝑉𝑛subscript𝑋𝑣𝐙V_{\bm{\mathrm{Z}}}=\{v\in V_{n}\mid X_{v}\in\bm{\mathrm{Z}}\}italic_V start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT = { italic_v ∈ italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ bold_Z }. Then (V𝐘,V𝐙)subscript𝑉𝐘subscript𝑉𝐙(V_{\bm{\mathrm{Y}}},V_{\bm{\mathrm{Z}}})( italic_V start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT ) form a balanced partition of Vnsubscript𝑉𝑛V_{n}italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. By the expansion of Gnsubscript𝐺𝑛G_{n}italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, it follows that there are Ω(n)Ω𝑛\Omega(n)roman_Ω ( italic_n ) edges from vertices in V𝐘subscript𝑉𝐘V_{\bm{\mathrm{Y}}}italic_V start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT to vertices in V𝐙subscript𝑉𝐙V_{\bm{\mathrm{Z}}}italic_V start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT. By greedily choosing some of those edges and using the bounded degree of Gnsubscript𝐺𝑛G_{n}italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we can construct an edge set Ensuperscriptsubscript𝐸𝑛E_{n}^{\prime}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of size Ω(n)Ω𝑛\Omega(n)roman_Ω ( italic_n ) that is a matching between 𝐘𝐘\bm{\mathrm{Y}}bold_Y and 𝐙𝐙\bm{\mathrm{Z}}bold_Z, i.e., all edges in Ensuperscriptsubscript𝐸𝑛E_{n}^{\prime}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT go from 𝐘𝐘\bm{\mathrm{Y}}bold_Y to 𝐙𝐙\bm{\mathrm{Z}}bold_Z and every vertex in Vnsubscript𝑉𝑛V_{n}italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is incident to only one edge in Ensuperscriptsubscript𝐸𝑛E_{n}^{\prime}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Let Vnsuperscriptsubscript𝑉𝑛V_{n}^{\prime}italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be the set of endpoints in Ensuperscriptsubscript𝐸𝑛E_{n}^{\prime}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝐗Vn𝐗Vsubscript𝐗superscriptsubscript𝑉𝑛subscript𝐗𝑉\bm{\mathrm{X}}_{V_{n}^{\prime}}\subseteq\bm{\mathrm{X}}_{V}bold_X start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⊆ bold_X start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT be the variables associated to them. We construct a new circuit csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from c𝑐citalic_c by substituting all input units for variables Xvsubscript𝑋𝑣X_{v}italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT that are not in 𝐗Vnsubscript𝐗superscriptsubscript𝑉𝑛\bm{\mathrm{X}}_{V_{n}^{\prime}}bold_X start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT by 00. Clearly, |c||c|superscript𝑐𝑐|c^{\prime}|\leq|c|| italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ≤ | italic_c | and hence all the lower bounds for |c|superscript𝑐|c^{\prime}|| italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | are lower bounds for |c|𝑐|c|| italic_c |. Let 𝐘¯=𝐗Vn𝐘¯𝐘subscript𝐗superscriptsubscript𝑉𝑛𝐘\overline{\bm{\mathrm{Y}}}=\bm{\mathrm{X}}_{V_{n}^{\prime}}\cap\bm{\mathrm{Y}}over¯ start_ARG bold_Y end_ARG = bold_X start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∩ bold_Y and 𝐙¯=𝐗Vn𝐙¯𝐙subscript𝐗superscriptsubscript𝑉𝑛𝐙\overline{\bm{\mathrm{Z}}}=\bm{\mathrm{X}}_{V_{n}^{\prime}}\cap\bm{\mathrm{Z}}over¯ start_ARG bold_Z end_ARG = bold_X start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∩ bold_Z. By construction csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT computes the function

𝖴𝖣𝖨𝖲𝖩Gn(𝐘¯,𝐙¯)=(1uvEnXuXv)2subscript𝖴𝖣𝖨𝖲𝖩superscriptsubscript𝐺𝑛¯𝐘¯𝐙superscript1subscript𝑢𝑣superscriptsubscript𝐸𝑛subscript𝑋𝑢subscript𝑋𝑣2\mathsf{UDISJ}_{G_{n}^{\prime}}(\overline{\bm{\mathrm{Y}}},\overline{\bm{% \mathrm{Z}}})=\left(1-\sum_{uv\in E_{n}^{\prime}}X_{u}X_{v}\right)^{2}sansserif_UDISJ start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG bold_Y end_ARG , over¯ start_ARG bold_Z end_ARG ) = ( 1 - ∑ start_POSTSUBSCRIPT italic_u italic_v ∈ italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

which corresponds to solving the UDISJ problem over the graph Gn=(Vn,En)superscriptsubscript𝐺𝑛superscriptsubscript𝑉𝑛superscriptsubscript𝐸𝑛G_{n}^{\prime}=(V_{n}^{\prime},E_{n}^{\prime})italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). From Eq. 9 recover that

𝖴𝖣𝖨𝖲𝖩Gn(𝐘¯,𝐙¯)=k=1Nfk(𝐘¯)×hk(𝐙¯),subscript𝖴𝖣𝖨𝖲𝖩superscriptsubscript𝐺𝑛¯𝐘¯𝐙superscriptsubscript𝑘1𝑁superscriptsubscript𝑓𝑘¯𝐘superscriptsubscript𝑘¯𝐙\mathsf{UDISJ}_{G_{n}^{\prime}}(\overline{\bm{\mathrm{Y}}},\overline{\bm{% \mathrm{Z}}})=\sum_{k=1}^{N}f_{k}^{\prime}(\overline{\bm{\mathrm{Y}}})\times h% _{k}^{\prime}(\overline{\bm{\mathrm{Z}}}),sansserif_UDISJ start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG bold_Y end_ARG , over¯ start_ARG bold_Z end_ARG ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over¯ start_ARG bold_Y end_ARG ) × italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over¯ start_ARG bold_Z end_ARG ) ,

where fksuperscriptsubscript𝑓𝑘f_{k}^{\prime}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (resp. hksuperscriptsubscript𝑘h_{k}^{\prime}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) are obtained from fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (resp. hksubscript𝑘h_{k}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) by setting all the variables not in 𝐗Vnsubscript𝐗superscriptsubscript𝑉𝑛\bm{\mathrm{X}}_{V_{n}^{\prime}}bold_X start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to 0. Since csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is monotonic by construction and |En|=Ω(n)superscriptsubscript𝐸𝑛Ω𝑛|E_{n}^{\prime}|=\Omega(n)| italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | = roman_Ω ( italic_n ), from Prop. B.2 it follows that N=2Ω(n)𝑁superscript2Ω𝑛N=2^{\Omega(n)}italic_N = 2 start_POSTSUPERSCRIPT roman_Ω ( italic_n ) end_POSTSUPERSCRIPT. ∎

Prop. B.3 concludes the proof of 1, as we showed the existence of family of graphs for which the smallest structured-decomposable monotonic circuit computing the UDISJ function over n𝑛nitalic_n variables has size 2Ω(n)superscript2Ω𝑛2^{\Omega(n)}2 start_POSTSUPERSCRIPT roman_Ω ( italic_n ) end_POSTSUPERSCRIPT. However, the smallest structured-decomposable non-monotonic circuit has size polynomial in n𝑛nitalic_n, whose construction has been detailed at the beginning of our proof. ∎

B.6 Squaring Deterministic Circuits

In Sec. 4.1 we argued that squaring any non-monotonic, smooth, decomposable (Def. A.2), and deterministic (Def. A.5) circuit yields a monotonic and deterministic PC. As a consequence, any function computed by a NPC2 that is deterministic can be computed by a monotonic and deterministic PC. Therefore, we are interested in squaring structured-decomposable circuits that are not deterministic. Below we formally prove 4.

Proposition 4.

Let c𝑐citalic_c be a smooth, decomposable and deterministic circuit over variables 𝐗𝐗\bm{\mathrm{X}}bold_X possibly computing a negative function. Then, the squared circuit c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is monotonic and has the same structure (hence size) of c𝑐citalic_c.

Proof.

The proof is by induction. Let nc𝑛𝑐n\in citalic_n ∈ italic_c be a product unit that computes cn(𝐙)=i𝗂𝗇(n)cn(𝐙i)subscript𝑐𝑛𝐙subscriptproduct𝑖𝗂𝗇𝑛subscript𝑐𝑛subscript𝐙𝑖c_{n}(\bm{\mathrm{Z}})=\prod_{i\in{\mathsf{in}}(n)}c_{n}(\bm{\mathrm{Z}}_{i})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_Z ) = ∏ start_POSTSUBSCRIPT italic_i ∈ sansserif_in ( italic_n ) end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), with 𝐙𝐗𝐙𝐗\bm{\mathrm{Z}}\subseteq\bm{\mathrm{X}}bold_Z ⊆ bold_X and (𝐙1,,𝐙|𝗂𝗇(n)|)subscript𝐙1subscript𝐙𝗂𝗇𝑛(\bm{\mathrm{Z}}_{1},\ldots,\bm{\mathrm{Z}}_{|{\mathsf{in}}(n)|})( bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_Z start_POSTSUBSCRIPT | sansserif_in ( italic_n ) | end_POSTSUBSCRIPT ) forming a partitioning of 𝐙𝐙\bm{\mathrm{Z}}bold_Z. Then its squaring computes cn2(𝐙)=i𝗂𝗇(n)cn2(𝐙i)superscriptsubscript𝑐𝑛2𝐙subscriptproduct𝑖𝗂𝗇𝑛superscriptsubscript𝑐𝑛2subscript𝐙𝑖c_{n}^{2}(\bm{\mathrm{Z}})=\prod_{i\in{\mathsf{in}}(n)}c_{n}^{2}(\bm{\mathrm{Z% }}_{i})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_Z ) = ∏ start_POSTSUBSCRIPT italic_i ∈ sansserif_in ( italic_n ) end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Now consider a sum unit nc𝑛𝑐n\in citalic_n ∈ italic_c that computes cn(𝐙)=i𝗂𝗇(n)wici(𝐙)subscript𝑐𝑛𝐙subscript𝑖𝗂𝗇𝑛subscript𝑤𝑖subscript𝑐𝑖𝐙c_{n}(\bm{\mathrm{Z}})=\sum_{i\in{\mathsf{in}}(n)}w_{i}c_{i}(\bm{\mathrm{Z}})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_Z ) = ∑ start_POSTSUBSCRIPT italic_i ∈ sansserif_in ( italic_n ) end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_Z ) with 𝐙𝐗𝐙𝐗\bm{\mathrm{Z}}\subseteq\bm{\mathrm{X}}bold_Z ⊆ bold_X and wisubscript𝑤𝑖w_{i}\in\mathbb{R}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R. Then its squaring computes cn2(𝐙)=i𝗂𝗇(n)j𝗂𝗇(n)wiwjci(𝐙)cj(𝐙)superscriptsubscript𝑐𝑛2𝐙subscript𝑖𝗂𝗇𝑛subscript𝑗𝗂𝗇𝑛subscript𝑤𝑖subscript𝑤𝑗subscript𝑐𝑖𝐙subscript𝑐𝑗𝐙c_{n}^{2}(\bm{\mathrm{Z}})=\sum_{i\in{\mathsf{in}}(n)}\sum_{j\in{\mathsf{in}}(% n)}w_{i}w_{j}c_{i}(\bm{\mathrm{Z}})c_{j}(\bm{\mathrm{Z}})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_Z ) = ∑ start_POSTSUBSCRIPT italic_i ∈ sansserif_in ( italic_n ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ sansserif_in ( italic_n ) end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_Z ) italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_Z ). Since c𝑐citalic_c is deterministic (Def. A.5), for any i,j𝑖𝑗i,jitalic_i , italic_j with ij𝑖𝑗i\neq jitalic_i ≠ italic_j either ci(𝐙)subscript𝑐𝑖𝐙c_{i}(\bm{\mathrm{Z}})italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_Z ) or cj(𝐙)subscript𝑐𝑗𝐙c_{j}(\bm{\mathrm{Z}})italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_Z ) is zero for any assignment to 𝐙𝐙\bm{\mathrm{Z}}bold_Z. Therefore, we have that

cn2(𝐙)=i𝗂𝗇(n)w2ci2(𝐙).superscriptsubscript𝑐𝑛2𝐙subscript𝑖𝗂𝗇𝑛superscript𝑤2superscriptsubscript𝑐𝑖2𝐙c_{n}^{2}(\bm{\mathrm{Z}})=\sum_{i\in{\mathsf{in}}(n)}w^{2}c_{i}^{2}(\bm{% \mathrm{Z}}).italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_Z ) = ∑ start_POSTSUBSCRIPT italic_i ∈ sansserif_in ( italic_n ) end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_Z ) . (10)

This implies that in deterministic circuits, squaring does not introduce additional components that encode (possibly negative) cross-products. The base case is defined on an input unit n𝑛nitalic_n that models a function fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and hence its squaring is an input unit that models fn2superscriptsubscript𝑓𝑛2f_{n}^{2}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. By induction c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is constructed from c𝑐citalic_c by squaring the parameters of sum units wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and squaring the functions fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT modeled by input units. Moreover, the number of inputs of each sum unit remains the same, as we observe in Eq. 10, and thus c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and c𝑐citalic_c have the same size. ∎

Appendix C Efficient learning of NPC2s

Refer to caption
Figure C.1: Evaluating the squared circuit representation adds little overhead during training. By learning by MLE (Eq. 4) and batched gradient descent, the time and space required to compute the partition function Z𝑍Zitalic_Z of c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is constant w.r.t. the batch size (BS) (left). By fixing the batch size to 512 and varying the output dimensionality (K𝐾Kitalic_K) of each layer (right), the resources needed to compute Z𝑍Zitalic_Z are similar to the ones needed to evaluate c𝑐citalic_c (i.e., c(𝐗)𝑐𝐗c(\bm{\mathrm{X}})italic_c ( bold_X )). For the left figure, we fix K=256𝐾256K=256italic_K = 256 and vary the BS, while for the right figure we fix BS=512BS512\text{BS}=512BS = 512 and vary K𝐾Kitalic_K. The plots share the y-axis.

In this section, we investigate the computational cost of learning NPC2s with a series of benchmarks, showing that NPC2s add little computational overhead over traditional monotonic PCs (MPCs).

Efficient renormalization in practice. As suggested by the MLE objective (Eq. 4), squaring the tensorized circuit c𝑐citalic_c with Alg. 1 is only required to compute the partition function Z=c2(𝐱)d𝐱𝑍superscript𝑐2𝐱differential-d𝐱Z=\int c^{2}(\bm{\mathrm{x}})\mathrm{d}\bm{\mathrm{x}}italic_Z = ∫ italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) roman_d bold_x. In addition, we need to compute Z𝑍Zitalic_Z only once per parameter update via gradient ascent, as Z𝑍Zitalic_Z does not depend on the training data. For these reasons, the increased computational burden of evaluating a squared circuit (see 1) as to compute Z𝑍Zitalic_Z is negligible, and it is independent w.r.t. the batch size. Fig. C.1 illustrates this aspect by comparing the time needed to evaluate c𝑐citalic_c on a batch of data and to compute the partition function Z𝑍Zitalic_Z. The results showed in Fig. C.1 are obtained by running benchmarks on NPC2s that are similar in size to the ones we experiment with in Sec. 5. That is, we benchmark a mixture of 32 NPC2s, each having an architecture built from a randomly-generated tree RG (see App. F for details) approximating the density function of BSDS300 (the data set with highest number of variables, see Table H.1). The input layers compute Gaussian distributions.

Training efficiency on UCI data sets. We benchmark the computational cost of learning NPC2s on UCI data sets (Table H.1). Fig. C.2 compares time and memory required to learn the best NPC2s and MPCs showed in Fig. 4, while Fig. C.3 compares time and memory required to learn them in a worse scenario for NPC2s where the batch size is small and the layer dimensionality is large, as NPC2s benefit from using large batch sizes as discussed above. NPC2s add very little overhead during training in most configurations when compared to MPCs, as computing the partition function Z𝑍Zitalic_Z is comparable to evaluating MPCs on a batch of samples. In particular, on Gas (|𝐗|=8𝐗8|\bm{\mathrm{X}}|=8| bold_X | = 8), NPC2 takes more time and memory to compute Z𝑍Zitalic_Z (times are 6ms6ms6\mathrm{ms}6 roman_m roman_s and 121ms121ms121\mathrm{ms}121 roman_m roman_s, while memory allocations are 0.6GiB0.6GiB0.6\mathrm{GiB}0.6 roman_GiB and 5.8GiB5.8GiB5.8\mathrm{GiB}5.8 roman_GiB), but it is only slightly more than the cost of computing c𝑐citalic_c for MPCs (time 144ms144ms144\mathrm{ms}144 roman_m roman_s and memory 4.4GiB4.4GiB4.4\mathrm{GiB}4.4 roman_GiB). Moreover, note that NPC2s achieve about a ×2absent2\times 2× 2 improvement on the log-likelihood on Gas. On the much higher dimensional data set BSDS300 (|𝐗|=63𝐗63|\bm{\mathrm{X}}|=63| bold_X | = 63) instead, we found that training NPC2 is even cheaper as it requires fewer parameters while still achieving an higher log-likelihood (128.38128.38128.38128.38 rather than 123.3123.3123.3123.3).

Hardware and significance of benchmarks. The benchmarks mentioned above and illustrated in Figs. C.1, C.2 and C.3 have been run on a single NVIDIA RTX A6000 with 48GiB of memory. The measured times are averaged over 50 independent circuit evaluations.

Refer to caption
Figure C.2: NPC2s add little overhead during training on real-world data sets, while improving log-likelihoods. We evaluate time and memory required by monotonic PCs (MPCs) and NPC2s to perform one optimization step on UCI data sets (Gas, Hepmass, MiniBooNE, BSDS300) with number of variables |𝐗|𝐗|\bm{\mathrm{X}}|| bold_X | and using the best hyperparameters found (see Sec. H.3). We benchmark the computation of c(𝐱)𝑐𝐱c(\bm{\mathrm{x}})italic_c ( bold_x ) by MPCs and c2(𝐱)superscript𝑐2𝐱c^{2}(\bm{\mathrm{x}})italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) by NPC2s on a batch 𝐱𝐱\bm{\mathrm{x}}bold_x of data (left), as well as the partition functions Z𝑍Zitalic_Z for both models (right), and label the data points with the final log-likelihoods achieved by the corresponding models (as also reported in Fig. 4). The plots share the y-axis. For NPC2s, computing the partition function Z𝑍Zitalic_Z is more expensive both in time and memory (right), but it is still very similar to the cost of evaluating c(𝐱)𝑐𝐱c(\bm{\mathrm{x}})italic_c ( bold_x ) or c2(𝐱)superscript𝑐2𝐱c^{2}(\bm{\mathrm{x}})italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) (left).
Refer to caption
Figure C.3: NPC2s add little overhead during training even with relatively small batch sizes. We evaluate time and memory required by monotonic PCs (MPCs) and NPC2s to perform one optimization step on UCI data sets (Gas, Hepmass, MiniBooNE, BSDS300) with respect to the number of variables |𝐗|𝐗|\bm{\mathrm{X}}|| bold_X | and using the same hyperparameters (512512512512 as batch size, 512512512512 as layer dimensionality, and Gaussian input layers). The plots share the y-axis. The cost of computing c2(𝐱)superscript𝑐2𝐱c^{2}(\bm{\mathrm{x}})italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) on a batch 𝐱𝐱\bm{\mathrm{x}}bold_x of data by NPC2s is only slightly higher than the cost of computing c(𝐱)𝑐𝐱c(\bm{\mathrm{x}})italic_c ( bold_x ) by MPCs (left), while the cost of computing Z𝑍Zitalic_Z for NPC2s is comparable to evaluating c2(𝐱)superscript𝑐2𝐱c^{2}(\bm{\mathrm{x}})italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x ) or c(𝐱)𝑐𝐱c(\bm{\mathrm{x}})italic_c ( bold_x ) (right).

Appendix D The Signed Log-Sum-Exp Trick

Scaling squared non-monotonic PCs to more than a few tens (resp. hundreds) of variables without performing computations in log-space is infeasible in 32-bit (resp. 64-bit) floating point arithmetic, as we illustrate in Fig. D.1. For this reason, we must perform computations in the log-space even in presence of negative values. The idea is to represent non-zero outputs 𝐲S𝐲superscript𝑆\bm{\mathrm{y}}\in\mathbb{R}^{S}bold_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT of each layer in terms of the element-wise logarithm of their absolute value log|𝐲|𝐲\log|\bm{\mathrm{y}}|roman_log | bold_y | and their element-wise sign sign(𝐲){1,1}Ssign𝐲superscript11𝑆\operatorname{sign}(\bm{\mathrm{y}})\in\{-1,1\}^{S}roman_sign ( bold_y ) ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, i.e., such that 𝐲=sign(𝐲)exp(log|𝐲|)𝐲direct-productsign𝐲𝐲\bm{\mathrm{y}}=\operatorname{sign}(\bm{\mathrm{y}})\odot\exp(\log|\bm{\mathrm% {y}}|)bold_y = roman_sign ( bold_y ) ⊙ roman_exp ( roman_log | bold_y | ).

Refer to caption
Figure D.1: Squared non-monotonic PCs cannot scale without performing computations in log-space. Partition functions (and their natural logarithm) of squared non-monotonic PCs having Gaussian input units, with increasing number of variables V𝑉Vitalic_V and having depth log2Vsubscript2𝑉\lceil\log_{2}V\rceil⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_V ⌉ computed using 32-bit and 64-bit floating point arithmetic.

In practice, we evaluate product and sum layers according to the following evaluation rules. Given an Hadamard product layer bold-ℓ\bm{\mathrm{\ell}}bold_ℓ, then it computes and propagates both log||=i=1Nlog|i|bold-ℓsuperscriptsubscript𝑖1𝑁subscriptbold-ℓ𝑖\log|\bm{\mathrm{\ell}}|=\sum_{i=1}^{N}\log|\bm{\mathrm{\ell}}_{i}|roman_log | bold_ℓ | = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log | bold_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | and sign()=i=1Nsign(i)signbold-ℓsuperscriptsubscript𝑖1𝑁signsubscriptbold-ℓ𝑖\operatorname{sign}(\bm{\mathrm{\ell}})=\bigodot_{i=1}^{N}\operatorname{sign}(% \bm{\mathrm{\ell}}_{i})roman_sign ( bold_ℓ ) = ⨀ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_sign ( bold_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for some inputs {i}i=1Nsuperscriptsubscriptsubscriptbold-ℓ𝑖𝑖1𝑁\{\bm{\mathrm{\ell}}_{i}\}_{i=1}^{N}{ bold_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Given a sum layer bold-ℓ\bm{\mathrm{\ell}}bold_ℓ parameterized by 𝐖S×K𝐖superscript𝑆𝐾\bm{\mathrm{W}}\in\mathbb{R}^{S\times K}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_K end_POSTSUPERSCRIPT and having superscriptbold-ℓ\bm{\mathrm{\ell}}^{\prime}bold_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as input layer, then it computes and propagates both log||=𝜶+log|𝐬|bold-ℓ𝜶𝐬\log|\bm{\mathrm{\ell}}|=\bm{\mathrm{\alpha}}+\log|\bm{\mathrm{s}}|roman_log | bold_ℓ | = bold_italic_α + roman_log | bold_s | and sign()=sign(𝐬)signbold-ℓsign𝐬\operatorname{sign}(\bm{\mathrm{\ell}})=\operatorname{sign}(\bm{\mathrm{s}})roman_sign ( bold_ℓ ) = roman_sign ( bold_s ) where 𝜶𝜶\bm{\mathrm{\alpha}}bold_italic_α and 𝐬𝐬\bm{\mathrm{s}}bold_s are defined as

𝜶=𝟏max1jS{log|[j]|}𝐬=𝐖(sign()exp(log||𝜶))formulae-sequence𝜶1subscript1𝑗𝑆superscriptbold-ℓdelimited-[]𝑗𝐬𝐖direct-productsignsuperscriptbold-ℓsuperscriptbold-ℓ𝜶\bm{\mathrm{\alpha}}=\bm{1}\cdot\max_{1\leq j\leq S}\{\log|\bm{\mathrm{\ell}}^% {\prime}[j]|\}\qquad\qquad\quad\bm{\mathrm{s}}=\bm{\mathrm{W}}\left(% \operatorname{sign}(\bm{\mathrm{\ell}}^{\prime})\odot\exp(\log|\bm{\mathrm{% \ell}}^{\prime}|-\bm{\mathrm{\alpha}})\right)bold_italic_α = bold_1 ⋅ roman_max start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_S end_POSTSUBSCRIPT { roman_log | bold_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ italic_j ] | } bold_s = bold_W ( roman_sign ( bold_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⊙ roman_exp ( roman_log | bold_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | - bold_italic_α ) )

by assuming 𝐬𝟎𝐬0\bm{\mathrm{s}}\neq\bm{0}bold_s ≠ bold_0, 𝟏1\bm{1}bold_1 denoting a S𝑆Sitalic_S-dimensional vector of ones, [j]superscriptbold-ℓdelimited-[]𝑗\bm{\mathrm{\ell}}^{\prime}[j]bold_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ italic_j ] denoting the j𝑗jitalic_j-th entry of the output of superscriptbold-ℓ\bm{\mathrm{\ell}}^{\prime}bold_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and exp\exproman_exp being applied element-wise. We call signed log-sum-exp trick the evaluation rule above for sum layers, which generalizes the log-sum-exp trick (Blanchard et al., 2021) that is used to evaluate tensorized monotonic PC architectures (Peharz et al., 2020a).

For the more general definition of tensorized circuits instead (Def. A.6), given a Kronecker product layer bold-ℓ\bm{\mathrm{\ell}}bold_ℓ, then it computes both log||=i=1Nlog|i|bold-ℓsuperscriptsubscriptdirect-sum𝑖1𝑁subscriptbold-ℓ𝑖\log|\bm{\mathrm{\ell}}|=\bigoplus_{i=1}^{N}\log|\bm{\mathrm{\ell}}_{i}|roman_log | bold_ℓ | = ⨁ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log | bold_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | and sign()=i=1Nsign(i)signbold-ℓsuperscriptsubscripttensor-product𝑖1𝑁signsubscriptbold-ℓ𝑖\operatorname{sign}(\bm{\mathrm{\ell}})=\bigotimes_{i=1}^{N}\operatorname{sign% }(\bm{\mathrm{\ell}}_{i})roman_sign ( bold_ℓ ) = ⨂ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_sign ( bold_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where direct-sum\bigoplus denotes an operator similar to the Kronecker product but computing sums rather than products.

Appendix E Splines as Expressive Input Components

Refer to caption
Figure E.1: Splines represent a class of flexible non-linear functions. A quadratic (k=2𝑘2k=2italic_k = 2) spline (in black) over n=4𝑛4n=4italic_n = 4 knots chosen uniformly in (0,1)01(0,1)( 0 , 1 ) (i.e, 0.20.20.20.2, 0.40.40.40.4, 0.60.60.60.6 and 0.80.80.80.8) is computed by a linear combination of n+k+1=7𝑛𝑘17n+k+1=7italic_n + italic_k + 1 = 7 distinct basis functions (each colored differently).

Polynomials defined on fixed intervals are candidate functions to be modeled by components (resp. input layers) of squared NMMs (Sec. 2) (resp. NPC2s Sec. 3). This is because they can be negative function and their product can be tractably integrated. In particular, we experiment with piecewise polynomials, also called splines. An univariate spline function of order k𝑘kitalic_k is a piecewise polynomial defined on a variable X𝑋Xitalic_X, and the n𝑛nitalic_n values of X𝑋Xitalic_X where polynomials meet are called knots. B-splines of order k𝑘kitalic_k are basis functions for continuous spline functions of the same degree. In practice, we can represent any spline function f𝑓fitalic_f of order k𝑘kitalic_k defined over n𝑛nitalic_n knots inside an interval (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) as a linear combination of n+k+1𝑛𝑘1n+k+1italic_n + italic_k + 1 basis functions, i.e.,

f(X)=i=1n+k+1αiBi,k(X)𝑓𝑋superscriptsubscript𝑖1𝑛𝑘1subscript𝛼𝑖subscript𝐵𝑖𝑘𝑋f(X)=\sum\nolimits_{i=1}^{n+k+1}\alpha_{i}B_{i,k}(X)italic_f ( italic_X ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + italic_k + 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( italic_X ) (11)

where αisubscript𝛼𝑖\alpha_{i}\in\mathbb{R}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R are the parameters of the spline and Bi,k(X)subscript𝐵𝑖𝑘𝑋B_{i,k}(X)italic_B start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( italic_X ) are polynomials of order k𝑘kitalic_k (i.e., the basis of f𝑓fitalic_f), which are unequivocally determined by the choice of the n𝑛nitalic_n knots. In particular, each Bi,k(X)subscript𝐵𝑖𝑘𝑋B_{i,k}(X)italic_B start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( italic_X ) is a non-negative polynomial that is recursively defined with the Cox-de-Boor formula (de Boor, 1971; Piegl & Tiller, 1995). Given two splines f,g𝑓𝑔f,gitalic_f , italic_g of order k𝑘kitalic_k defined over n𝑛nitalic_n knots and represented in terms of n+k+1𝑛𝑘1n+k+1italic_n + italic_k + 1 basis functions as in Eq. 11, we can write their product integral as

abf(X)g(X)dX=i=1n+k+1j=1n+k+1αiβjabBi,k(X)Bj,k(X)dXsuperscriptsubscript𝑎𝑏𝑓𝑋𝑔𝑋differential-d𝑋superscriptsubscript𝑖1𝑛𝑘1superscriptsubscript𝑗1𝑛𝑘1subscript𝛼𝑖subscript𝛽𝑗superscriptsubscript𝑎𝑏subscript𝐵𝑖𝑘𝑋subscript𝐵𝑗𝑘𝑋differential-d𝑋\int_{a}^{b}f(X)g(X)\,\mathrm{d}X=\sum\nolimits_{i=1}^{n+k+1}\sum\nolimits_{j=% 1}^{n+k+1}\alpha_{i}\beta_{j}\int_{a}^{b}B_{i,k}(X)B_{j,k}(X)\,\mathrm{d}X∫ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT italic_f ( italic_X ) italic_g ( italic_X ) roman_d italic_X = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + italic_k + 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + italic_k + 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( italic_X ) italic_B start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ( italic_X ) roman_d italic_X (12)

where αisubscript𝛼𝑖\alpha_{i}\in\mathbb{R}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R (resp. βjsubscript𝛽𝑗\beta_{j}\in\mathbb{R}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R) denote the parameters of f𝑓fitalic_f (resp. g𝑔gitalic_g). Therefore, integrating a product of splines requires integrating products of their basis functions. Among the various way of computing Eq. 12 exactly (Vermeulen et al., 1992), we can do it in time 𝒪(n2k2)𝒪superscript𝑛2superscript𝑘2\mathcal{O}(n^{2}\cdot k^{2})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) by representing the product Bi,k(X)Bj,k(X)subscript𝐵𝑖𝑘𝑋subscript𝐵𝑗𝑘𝑋B_{i,k}(X)B_{j,k}(X)italic_B start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( italic_X ) italic_B start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ( italic_X ) as the basis polynomial of another B-spline of order 2k+12𝑘12k+12 italic_k + 1, and finally integrating it in the interval of definition. Fig. E.1 shows an example of a spline.

Since each Bi,ksubscript𝐵𝑖𝑘B_{i,k}italic_B start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT is non-negative, we can use B-splines as components (resp. modeled by input layers) of traditional MMs (resp. monotonic PCs) by assuming each spline parameter αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be non-negative. This is the case of monotonic PCs we experimented with in Sec. 5, where non-negativity is guaranteed via exponentiation of the parameters.

Appendix F Tree Region Graphs

Refer to caption
Refer to caption
Figure F.1: Different ways to construct region graphs. The left figure illustrates a linear tree (LT) region graph (Def. 2) over four variables, which decomposes variables one by one. The right figure shows a possible binary tree (RT) region graph over five variables, which recursively splits them.

Since we require structured-decomposability to square circuits (see Sec. 3.2), we construct their architecture based on tree RGs (Def. 2). We choose to experiment with two kinds of tree RGs: binary tree (BT) and linear tree (LT). Following Peharz et al. (2020b), the BT is built by recursively partitioning variables evenly and randomly until regions with only one variable are obtained. The LT is built by (1) shuffling the variables randomly and then (2) recursively partitioning variables one by one, i.e., a set of variables {Xi,,XD}subscript𝑋𝑖subscript𝑋𝐷\{X_{i},\ldots,X_{D}\}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } is partitioned in {Xi}subscript𝑋𝑖\{X_{i}\}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } and {Xi+1,,XD}subscript𝑋𝑖1subscript𝑋𝐷\{X_{i+1},\ldots,X_{D}\}{ italic_X start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } for 1iD11𝑖𝐷11\leq i\leq D-11 ≤ italic_i ≤ italic_D - 1. Fig. F.1 shows examples of LT and BT RGs. Note that the LT is the same on which the circuit representation of matrix-product states (MPS) (Pérez-García et al., 2007) and TTDE (Novikov et al., 2021) depend on (see also Sec. 4 and Sec. B.4).

Appendix G Additional Related Works

Squared neural family (SNEFY) (Tsuchida et al., 2023) have been concurrently proposed as a class of models squaring the 2-norm of the output of a single-hidden-layer neural network. Under certain parametric conditions, SNEFYs can be re-normalized as to model a density function, but they do not guarantee tractable marginalization of any subset of variables as our NPC2s do, unless they encode a fully-factorized distribution, which would limit their expressiveness. Hence, SNEFYs can be employed in our NPC2s to model multivariate units in input layers with bounded scopes.

The rich literature of PCs provides several algorithms to learn both the structure and the parameters of circuits (Poon & Domingos, 2011; Peharz et al., 2017; Di Mauro et al., 2021; Dang et al., 2021; Liu & Van den Broeck, 2021; Liu et al., 2023). However, in these works circuits are always assumed to be monotonic. A first work considering subtractions is Dennis (2016) which generalizes the ad-hoc constraints over Gaussian NMMs (Zhang & Zhang, 2005) to deep PCs over Gaussian inputs by constraining their structure and reparameterizing their sum weights. Shallow NMM represented as squared circuits have been investigated for low-dimensional categorical distributions in Loconte et al. (2023). Concurrently, Sladek et al. (2023) investigated interleaving PSD models and PCs to represent deep NMMs in low-dimensional settings. The resulting model can be interpreted as a sum of squared circuits. Circuit representations encoding probability generating functions allow negative coefficients in symbolic computational graphs (Zhang et al., 2021), differently from them we encode probability densities and masses. Non-monotonic PCs have been recently proven to be able to compactly represent determinantal point processes, for which no compact monotonic circuit representation exists (Broadrick et al., 2024).

The relationship with tensor networks and PCs has been hinted in (Glasser et al., 2019) and (Novikov et al., 2021) for matrix-product states. However, they did neither provide a formal reduction nor a highlighted the structural properties needed to tractably perform the squaring and marginalize variables. Glasser et al. (2019) showed a number of bounds over ranks of matrix-product states and Born machines, however they do not provide a separation between NPC2s and monotonic PCs as we do, nor their results generalize to any region graph, including tree-shaped networks as our 1 does for structured-decomposable monotonic PCs. Born machines and squared tree-shaped tensor networks have been explored for distribution estimation (Han et al., 2018; Cheng et al., 2019) but, differently from our NPC2s, they were not equipped with non-linearities at the inputs and their evaluation was limited to small scale binary data. Bailly (2011) proposed squared probabilistic automata that are similar to Born machines but supporting inputs of any length by sharing the same parameter tensors across steps. By applying the construction used to show 3, we can represent such models as NPC2s where the parameters of sum and input layers are shared. Jaini et al. (2018) draw a connection between monotonic PCs, latent tree models (Choi et al., 2011) and hierarchical tensor mixture models (Hackbusch, 2012) with non-negative parameters, showing an exponential separation between shallow and deep monotonic circuits.

Appendix H Experimental Settings and Complementary Results

H.1 Continuous Synthetic Data

Following (Wenliang et al., 2019) we experiment with monotonic PCs, their squaring and NPC2s on synthetic continuous 2D data sets, named rings, cosine, funnel and banana. We generate each synthetic data set by sampling 10_000/1_000/2_00010_0001_0002_00010\_000/1\_000/2\_00010 _ 000 / 1 _ 000 / 2 _ 000 training/validation/test samples. In these experiments, we are interested in studying whether NPC2s can be more expressive in practice, without making assumptions on the data distribution and therefore choosing parametric distributions as components. For this reason, we choose components computing the product of univariate spline functions (see App. E) over 32 knots that are uniformly chosen in the data domain. In particular, for monotonic mixtures we restrict the spline coefficients to be non-negative.

Learning and hyperparameters. Since the data is bivariate, the tree on which PCs are defined on consists of just one region that is split in half. All models are learned by batched stochastic gradient descent using the Adam optimizer with default learning rate (Kingma & Ba, 2015) and a batch size of 256. The parameters of all mixtures are initialized by sampling uniformly between 0 and 1. Furthermore, monotonicity in (squared) PCs is ensured by exponentiating the parameters.

Fig. 3 shows the density functions estimated from data sets rings and cosine, when using 8 and 12 components, respectively. Moreover, Fig. H.1 report the log-likelihoods and other density functions learned from data sets funnel and banana, when using 4 components.

Refer to caption
(a) GT
Refer to caption
(b) MPC
Refer to caption
(c) MPC2
Refer to caption
(d) NPC2
Refer to caption
(e) GT
Refer to caption
(f) MPC
Refer to caption
(g) MPC2
Refer to caption
(h) NPC2
Refer to caption
Figure H.1: Negative parameters increases the expressiveness of NPC2s. From left to right (above) and for each bivariate density, we show the ground truth (GT) and its estimation by a monotonic PC (MPC), a squared monotonic PC (MPC2), and a NPC2 having input layers computing quadratic splines (App. E) and with the same number of parameters. Moreover, (below) we show the average log-likelihoods (and one standard deviation with 10 independent runs) on unseen data achieved by a monotonic MPC, a squared monotonic MPC2, and a NPC2 by increasing the dimensionality of input layers K𝐾Kitalic_K.

H.2 Discrete Synthetic Data

For our experiments investigating the flexibility of input layers of NPC2s (Sec. 2) in case of discrete data (Sec. 5), we quantize the bivariate continuous synthetic data sets reported in Sec. H.1. That is, we discretize both continuous variables using 32323232 uniform bins each. The resulting target distribution is therefore a probability mass function over two finitely discrete variables.

We experiment with monotonic PCs, their squaring and NPC2s with two families of input layers. First, we investigate very flexible input layers for finitely discrete data: categoricals for monotonic PCs and embeddings for NPC2s. Second, we experiment with the less flexible but more parameter-efficient Binomials. The learning and hyperparameters setting are the same used for the continuous data (see Sec. H.1). Fig. H.2 shows that there is little advantage in subtracting probability mass with respect to monotonic PCs having categorical components. However, in case of the less flexible Binomial components, NPC2s capture the target distribution significantly better. This is also confirmed by the log-likelihoods on unseen data, which we show in Fig. H.2.

Refer to caption

Categorical

Refer to caption
Refer to caption
Refer to caption

Binomial

Refer to caption
Refer to caption
Refer to caption
Refer to caption

Categorical

Refer to caption
Refer to caption
Refer to caption

Binomial

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) GT

Categorical

Refer to caption
(b) MPC
Refer to caption
(c) MPC2
Refer to caption
(d) NPC2

Binomial

Refer to caption
(e) MPC
Refer to caption
(f) MPC2
Refer to caption
(g) NPC2
Refer to caption
(h) Mixtures with categorical or embedding components.
Refer to caption
(i) Mixtures with Binomial components.
Figure H.2: Negative parameters increases the expressiveness of NPC2s. From left to right (above) and for each bivariate distribution, we show the ground truth (GT) and its estimation by a monotonic PC (MPC), a squared monotonic PC (MPC2), and a NPC2 having input layers computing categoricals (embeddings for NPC2s) and with the same number of parameters. Moreover, we show the average log-likelihoods (and one standard deviation with 10 independent runs) on unseen data achieved by a monotonic MPC, a squared monotonic MPC2, and a NPC2 with either categorical (a) or Binomial (b) components and by increasing the dimensionality of input layers K𝐾Kitalic_K.

H.3 UCI Continuous Data

Data sets. In Sec. 5 we evaluate NPC2s for density estimation on five multivariate UCI data sets (Dua & Graff, 2017): Power (Hebrail & Berard, 2012), Gas (Fonollosa et al., 2015), Hepmass (Baldi et al., 2016), MiniBooNE (Roe et al., 2004) and BSDS300 patches (Martin et al., 2001) by following the pre-processing by Papamakarios et al. (2017). Table H.1 reports their statistics.

Number of samples
D𝐷Ditalic_D train validation test
Power 6666 1,659,91716599171{,}659{,}9171 , 659 , 917 184,435184435184{,}435184 , 435 204,928204928204{,}928204 , 928
Gas 8888 852,174852174852{,}174852 , 174 94,6859468594{,}68594 , 685 105,206105206105{,}206105 , 206
Hepmass 21212121 315,123315123315{,}123315 , 123 35,0133501335{,}01335 , 013 174,987174987174{,}987174 , 987
MiniBooNE 43434343 29,5562955629{,}55629 , 556 3,28432843{,}2843 , 284 3,64836483{,}6483 , 648
BSDS300 63636363 1,000,00010000001{,}000{,}0001 , 000 , 000 50,0005000050{,}00050 , 000 250,000250000250{,}000250 , 000
Table H.1: UCI data set statistics. Dimensionality D𝐷Ditalic_D and number of samples of each data set split after the preprocessing by Papamakarios et al. (2017).

Models. We compare monotonic PCs and NPC2s in tensorized form (Def. 1) for density estimation. The tensorized architecture for both is constructed based on either the binary tree (BT) or linear tree (LT) RGs (see App. F). In addition, since both RGs are randomly-constructed, we instantiate eight of them by changing the random seed. By doing so, our monotonic PCs consist of a mixture of tensorized monotonic PCs each defined on a different RG. Conversely, our NPC2s consist of a mixture (with non-negative parameters) of tensorized NPC2s, each constructed by squaring a circuit defined on a different RG. To ensure a fair comparison, monotonic PCs and NPC2s have the exact same structure, but NPC2s allow for negative parameters via the squaring mechanism (see Sec. 3).

Hyperparameters. We search for hyperparameters by running a grid search with both monotonic PCs and NPC2s. For each UCI data set, Tables H.2 and H.3 report the possible value of each hyperparameter, depending on the chosen RG. In case of input layers modeling spline functions (see App. E), we use quadratic splines and select 512512512512 uniformly in the domain space.

Parameters initialization. We found NPC2s to be more sensible to the choice of the initialization method for parameters than monotonic PCs. The effect of initialization in monotonic PCs is not well explored in the literature, and it is even more unclear for NPC2s as parameters are allowed to be negative. In these experiments, we investigated initializing NPC2s by independently sampling the parameters from a normal distribution. However, we found NPC2s to achieve higher log-likelihoods if they are initialized with non-negative parameters only, i.e., by sampling uniformly between 0 and 1. However, in Sec. H.5 we show they still learn negative parameters when converging. Note that our work is a first attempt to learn non-monotonic PCs at scale, thus it opens interesting future directions on how to initialize and learn NPC2s, as well as how to regularize them.

Table H.2: Hyperparameter grid search space for each UCI data set (for BT experiments). Each data set is associated to lists of hyperparameters: learning rate, the dimensionality of layers in tensorized PCs (K𝐾Kitalic_K), batch size, and whether input layers compute Gaussian likelihoods or spline functions (see App. E).
Data set Learning rate K𝐾Kitalic_K Batch size Input layer
Power [0.01,0.005]0.010.005[0.01,0.005][ 0.01 , 0.005 ] [32,,512]32512[32,\ldots,512][ 32 , … , 512 ] [512,1024,2048]51210242048[512,1024,2048][ 512 , 1024 , 2048 ] [Gaussian, splines]
Gas [32,,1024]321024[32,\ldots,1024][ 32 , … , 1024 ] [512,1024,2048,4096]512102420484096[512,1024,2048,4096][ 512 , 1024 , 2048 , 4096 ]
Hepmass [32,,512]32512[32,\ldots,512][ 32 , … , 512 ] [512,1024,2048]51210242048[512,1024,2048][ 512 , 1024 , 2048 ]
MiniBooNE [32,,512]32512[32,\ldots,512][ 32 , … , 512 ] [512,1024,2048]51210242048[512,1024,2048][ 512 , 1024 , 2048 ]
BSDS300 [32,,256]32256[32,\ldots,256][ 32 , … , 256 ] [512,1024,2048]51210242048[512,1024,2048][ 512 , 1024 , 2048 ]
Table H.3: Hyperparameter grid search space for each UCI data set (for LT experiments). Each data set is associated to lists of hyperparameters: learning rate, the dimensionality of layers in tensorized PCs (K𝐾Kitalic_K), batch size, and whether input layers compute Gaussian likelihoods or spline functions (see App. E).
Data set Learning rate K𝐾Kitalic_K Batch size Input layer
Power [0.005,0.001]0.0050.001[0.005,0.001][ 0.005 , 0.001 ] [32,,512]32512[32,\ldots,512][ 32 , … , 512 ] [512,1024,2048]51210242048[512,1024,2048][ 512 , 1024 , 2048 ] [Gaussian, splines]
Gas
Hepmass
MiniBooNE
BSDS300
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure H.3: Negative parameters make squared non-monotonic PCs more expressive than squared monotonic PCs. NPC2s (±2superscriptplus-or-minus2\pm^{2}± start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, vertical) generally achieve higher log-likelihoods than squared monotonic PCs (+2superscript2+^{2}+ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, horizontal) when paired with the same number of units per layer K𝐾Kitalic_K. as shown by the presence of more points in the upper triangle than in the lower triangle for most data sets. Blue circles and red diamonds refer to runs with Gaussian (G) and spline (S) input layers respectively, and darker hues indicate larger K𝐾Kitalic_K. The dashed grey line represents the points of equal log-likelihood for both the NPC2 and the squared monotonic PC.
Table H.4: Squared non-monotonic PCs can be more expressive than monotonic PCs. Best average test log-likelihoods and two standard errors achieved by monotonic PCs (MPC) and NPC2s built either from randomized linear tree RGs (LT) or from randomized binary tree RGs (BT) (see Sec. H.3), when compared to baselines. MPC, MPC2 and NPC2 were experimented with both Gaussian (G) and spline (S) node input layers. † means no values were originally provided.
Power Gas Hepmass MiniBooNE BSDS300
MADE -3.08 ±0.03 3.56 ±0.04 -20.98 ±0.02 -15.59 ±0.50 148.85 ±0.28
RealNVP 0.17 ±0.01 8.33 ±0.14 -18.71 ±0.02 -13.84 ±0.52 153.28 ±1.78
MAF 0.24 ±0.01 10.08 ±0.02 -17.73 ±0.02 -12.24 ±0.45 154.93 ±0.28
NSF 0.66 ±0.01 13.09 ±0.02 -14.01 ±0.03 -9.22 ±0.48 157.31 ±0.28
Gaussian -7.74 ±0.02 -3.58 ±0.75 -27.93 ±0.02 -37.24 ±1.07 96.67 ±0.25
EiNet-LRS 0.36 ±† 4.79 ±† -22.46 ±† -34.21 ±†
TTDE 0.46 ±† 8.93 ±† -21.34 ±† -28.77 ±† 143.30 ±†
G S G S G S G S G S
MPC (LT) 0.51 ±.01 0.24 ±.01 6.73 ±.03 -2.05 ±.02 -22.07 ±.02 -23.09 ±.02 -32.48 ±.44 -37.53 ±.46 123.15 ±.28 116.90 ±.28
MPC2 (LT) 0.49 ±.01 0.39 ±.01 7.06 ±.03 0.95 ±.01 -21.42 ±.02 -22.24 ±.02 -29.46 ±.44 -32.81 ±.47
NPC2 (LT) 0.53 ±.01 0.43 ±.01 9.00 ±.02 3.03 ±.02 -20.66 ±.02 -21.53 ±.02 -26.68 ±.42 -29.36 ±.42 112.99 ±.29 120.11 ±.29
MPC (BT) 0.57 ±.01 0.32 ±.01 5.56 ±.03 -2.55 ±.02 -22.45 ±.02 -24.09 ±.02 -32.11 ±.43 -37.56 ±.46 121.92 ±.29 123.30 ±.29
MPC2 (BT) 0.57 ±.01 0.36 ±.01 8.24 ±.03 0.32 ±.02 -21.47 ±.02 -23.38 ±.02 -29.46 ±.43 -33.43 ±.47 125.56 ±.29 126.85 ±.29
NPC2 (BT) 0.63 ±.01 0.45 ±.01 10.98 ±.02 3.12 ±.01 -20.41 ±.02 -22.25 ±.02 -26.92 ±.44 -30.81 ±.54 114.47 ±.28 128.38 ±.29
Table H.5: Average test log-likelihoods and standard deviation over five independent runs with random initialization, using the same hyperparameters that brought the results showed in Table H.4.
Power Gas Hepmass MiniBooNE BSDS300
MPC (LT) 0.46  ±0.03 7.03  ±0.18 -22.07  ±0.02 -31.79  ±0.39 126.66  ±5.46
MPC (BT) 0.53  ±0.03 6.16  ±0.56 -22.42  ±0.45 -33.30  ±0.98 122.77  ±0.71
NPC2 (LT) 0.42  ±0.11 8.97  ±0.08 -20.67  ±0.05 -29.58  ±0.29 127.58  ±4.66
NPC2 (BT) 0.62  ±0.01 10.55  ±0.39 -20.48  ±0.11 -27.64  ±0.44 128.45  ±0.52
Table H.6: Best hyperparameters found via grid search, which were used for achieving results showed in Table H.4. For input layers, G and S respectively denote Gaussian and spline.
Model Data set K𝐾Kitalic_K Batch size Learning rate Input layer
MPC (BT) Power 512 512 0.01 G
Gas 1024 4096 0.01 G
Hepmass 128 512 0.01 G
MiniBooNE 32 512 0.01 G
BSDS300 512 512 0.01 S
MPC (LT) Power 512 512 0.001 G
Gas 512 1024 0.001 G
Hepmass 512 512 0.005 G
MiniBooNE 512 1024 0.005 G
BSDS300 64 512 0.005 S
NPC2 (BT) Power 512 512 0.01 G
Gas 1024 512 0.01 G
Hepmass 256 512 0.01 G
MiniBooNE 32 512 0.01 G
BSDS300 128 512 0.01 S
NPC2 (LT) Power 512 512 0.001 G
Gas 512 512 0.001 G
Hepmass 256 512 0.001 G
MiniBooNE 128 2048 0.005 G
BSDS300 32 1024 0.001 S

H.4 Large Language Model Distillation

Data set. Given p(𝐱)superscript𝑝𝐱p^{*}(\bm{\mathrm{x}})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x ) the distribution modeled by GPT2 over sentences 𝐱=[x1,,xD]𝐱subscript𝑥1subscript𝑥𝐷\bm{\mathrm{x}}=[x_{1},\ldots,x_{D}]bold_x = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ] having maximum length D𝐷Ditalic_D, we aim to minimize the Kullback-Leibler divergence KL[pp]KLdelimited-[]conditionalsuperscript𝑝𝑝\mathrm{KL}[p^{*}\mid p]roman_KL [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ italic_p ], where p𝑝pitalic_p is modeled by a PC. Minimizing such divergence is equivalent to learn the PC by maximum-likelihood on data sampled by GPT2. Therefore, following the experimental setting by Zhang et al. (2023) we sample a data set of 8M sentences using GPT2 having bounded length D=32𝐷32D=32italic_D = 32, i.e., with a maximum of D=32𝐷32D=32italic_D = 32 tokens. Then, we split such sentences into training, validation and test set having proportions 0.85/0.05/0.10, respectively.

Models. Then, we learn a monotonic PC and a NPC2 as tensorized circuits whose architecture is determined by a linear tree RG (Def. 2), i.e., a region graph that recursively partitions each set of finitely-discrete variables {Xi,,XD}subscript𝑋𝑖subscript𝑋𝐷\{X_{i},\ldots,X_{D}\}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } into {Xi}subscript𝑋𝑖\{X_{i}\}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } and {Xi+1,,XD}subscript𝑋𝑖1subscript𝑋𝐷\{X_{i+1},\ldots,X_{D}\}{ italic_X start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } for 1iD11𝑖𝐷11\leq i\leq D-11 ≤ italic_i ≤ italic_D - 1 (e.g., see Fig. 2). This is because we are interested in exploiting the sequential dependencies between words in a sentence. By enforcing monotonicity, we recover that the monotonic PC is equivalent to an inhomogenous hidden Markov model (HMM), and that that NPC2 corresponds to a Born machine (see Sec. B.4.1 for details).

Hyperparameters. All PCs are learned by batched stochastic gradient descent using Adam (Kingma & Ba, 2015) as optimizer with batch size 4096409640964096, and we continue optimizing until either the validation loss does not improve after three consecutive epochs or the maximum budget of 200 epochs has been reached. We perform multiple runs by exploring combinations of learning rates and initialization. For monotonic PCs, we run experiments by choosing learning rates in {5103,102,5102}5superscript103superscript1025superscript102\{5\cdot 10^{-3},10^{-2},5\cdot 10^{-2}\}{ 5 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 5 ⋅ 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT } and initializing parameters by sampling uniformly in (0,1)01(0,1)( 0 , 1 ), by sampling from a standard log-normal distribution, and from a Dirichlet distribution with concentration values set to 1111. Similarly for NPC2s, we run experiments by choosing the same learning rates for monotonic PCs, but using different initialization methods. Since squaring results in much larger outputs when compared to monotonic PCs, we initialize NPC2s such that the magnitude of parameters is relatively small. That is, in addition to sampling uniformly in (0,101)0superscript101(0,10^{-1})( 0 , 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), we also initialize the parameters by sampling from a normal distribution having mean 00 and standard deviation 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. By doing so, we initialize an approximately even number of positive and negative parameters. Moreover, we also experiment by initializing parameters by sampling from a normal distribution with mean 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and standard deviation 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, which initializes more parameters to be positive.

Results. For increasing layer dimensionality, we group runs having different learning rate and initialization method together and show the achieved log-likelihoods in Fig. 5. Moreover, for layer dimensionalities K256𝐾256K\leq 256italic_K ≤ 256, we report the double of log-likelihood points by repeating the runs with a different seed. Then, we perform statistical tests to assess the significance of NPC2s achieving higher log-likelihoods than monotonic PCs on the test data, and show the p-values in Table H.7.

K=𝐾absentK=italic_K = 32 64 128 256 512 1024
p-value === 1.0000 0.1071 0.0019 0.0020 <<< 0.0001 <<< 0.0001
Table H.7: Statistical significance of NPC2s achieving higher likelihoods on LLM distillation. We perform a one-sided Mann-Whitney U test between the log-likelihoods achieved by NPC2s and monotonic PCs on the test data (see also Fig. 5), using a total of 36 runs for layer dimensionalities K256𝐾256K\leq 256italic_K ≤ 256 and 18 runs for K>256𝐾256K>256italic_K > 256. We highlight the p-values that are consistent with a 99% confidence interval in bold.

H.5 Histograms of Learned Parameters

In Fig. H.4, we show the parameters of both monotonic PCs and NPC2s learned in our experiments (Sec. 5), i.e., distribution estimation on UCI data sets (Table H.1) and sentences sampled from GPT2. Even though we initialize the parameters of NPC2s to be non-negative in our experiments (i.e., by sampling from a uniform distribution located on the non-negative side), they still end up learning negative parameters. In particular, Fig. H.4 shows the histograms of sum layers parameters of monotonic PC and NPC2s learned on UCI data sets having the same model size, i.e., with layer dimensionality K=512𝐾512K=512italic_K = 512 for Power and Gas, K=128𝐾128K=128italic_K = 128 for Hepmass and MiniBooNE, and K=64𝐾64K=64italic_K = 64 for BSDS300. For the rest hyperparameters, we choose batch size 512512512512, learning rate 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, quadratic splines as input layers, and build a mixture of tree-shaped circuits (as described in Sec. H.3). For the models learned on GPT2 sentences, we use learning rate 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and uniform initialization with non-negative values (see also Sec. H.4). Note that for NPC2s we report the parameters of the circuit after being squared with Alg. 1, thus resulting in a quadratic increase in the number of parameters.

Refer to caption
Refer to caption
(a) Power
Refer to caption
Refer to caption
(b) Gas
Refer to caption
Refer to caption
(c) Hepmass
Refer to caption
Refer to caption
(d) MiniBooNE
Refer to caption
Refer to caption
(e) BSDS300
Refer to caption
Refer to caption
(f) GPT2 sentences
Figure H.4: NPC2s learn negative parameters with non-negative initializations. Histograms containing the 99% interquartile range of the sum layers parameters of monotonic PCs (+++, blue) and NPC2s after being squared (±2superscriptplus-or-minus2\pm^{2}± start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, orange) that are learned on UCI data sets (a-e) (Sec. H.3) and on sentences sampled from GPT2 (f) (Sec. H.4). Even if the chosen NPC2s are initialized with non-negative parameters (see Sec. H.5), they converge to a model instance with negative parameters.