Standardizing Structural Causal Models

Weronika Ormaniec
ETH Zürich
Zürich, Switzerland
[email protected] &Scott Sussex^∗
ETH Zürich
Zürich, Switzerland
[email protected] &Lars Lorch^∗
ETH Zürich
Zürich, Switzerland
[email protected] Bernhard Schölkopf
MPI for Intelligent Systems
Tübingen, Germany
[email protected] &Andreas Krause
ETH Zürich
Zürich, Switzerland
[email protected] Equal contribution.

Abstract

Synthetic datasets generated by structural causal models (SCMs) are commonly used for benchmarking causal structure learning algorithms. However, the variances and pairwise correlations in SCM data tend to increase along the causal ordering. Several popular algorithms exploit these artifacts, possibly leading to conclusions that do not generalize to real-world settings. Existing metrics like $\operatorname{Var}$ -sortability and $\operatorname{R^{2}}$ -sortability quantify these patterns, but they do not provide tools to remedy them. To address this, we propose internally-standardized structural causal models (iSCMs), a modification of SCMs that introduces a standardization operation at each variable during the generative process. By construction, iSCMs are not $\operatorname{Var}$ -sortable, and as we show experimentally, not $\operatorname{R^{2}}$ -sortable either for commonly-used graph families. Moreover, contrary to the post-hoc standardization of data generated by standard SCMs, we prove that linear iSCMs are less identifiable from prior knowledge on the weights and do not collapse to deterministic relationships in large systems, which may make iSCMs a useful model in causal inference beyond the benchmarking problem studied here.

1 Introduction

Predicting the effects of interventions and policy decisions requires reasoning about causality. Consequently, scientific fields ranging from biology and earth sciences to economics and statistics are interested in modeling causal structure (Pearl,, 2009; Maathuis et al.,, 2010; Imbens and Rubin,, 2015; Runge et al.,, 2019). A wide array of causal discovery algorithms has been proposed with the goal of inferring causal structure from data (e.g., Squires and Uhler,, 2022; Vowels et al.,, 2022). However, benchmarking these algorithms is challenging, since real-world datasets with an agreed-upon, ground-truth causal structure are rare (e.g., Sachs et al.,, 2005; see Mooij et al.,, 2020). The community predominantly relies on synthetic data for evaluating structure learning algorithms, where observations are generated according to a predetermined causal structure and system mechanisms. The inferred causal structures can then be directly compared to the ground truth. To generate synthetic data, it is common practice to sample from structural causal models with additive noise (SCMs) (Reisach et al.,, 2021). Unless stated otherwise, this work considers SCMs in which the variance scale of the additive noise is the same for all variables, a typical simplification made in benchmarking.

Under common benchmarking practices, synthetic datasets generated by SCMs contain patterns that are directly exploitable to make structure discovery easier. We will refer to such patterns as artifacts. In SCMs, the pairwise correlations between variables tend to increase along the causal ordering, since variance builds up downstream and, as a result, the proportion of the variance driven by the additive noise vanishes (Figure 1(a)). Reisach et al., (2024) characterize this phenomenon through an increase of the coefficients of determination ( $\operatorname{R^{2}}$ ) of the variables regressed on all others. Crucially, this artifact occurs both in the raw data and when shifting and scaling (standardizing) the variables to have zero mean and unit variance. One of the implications is that downstream causal dependencies in SCMs become effectively deterministic, especially in large-scale systems. As Reisach et al., (2024) demonstrate, simple causal discovery baselines can perform competitively on benchmarks of this kind by directly exploiting this phenomenon. This makes SCMs in their general definition possibly unsuitable for the purpose of benchmarking and, as we will argue, to some degree suboptimal for inferring causality more broadly. Ultimately, benchmarking on synthetic data with these patterns could lead to conclusions that do not generalize to real-world scenarios.

In this work, we propose a simple modification of SCMs that stabilizes the data-generating process and thereby removes exploitable covariance artifacts. Our models, denoted internally-standardized SCMs (iSCMs), introduce a standardization operation at each variable during the generative process (Figure 1(b)). In Section 4, we provide a theoretical motivation for this idea by studying linear iSCMs. We prove that, contrary to SCMs, the causal dependencies of iSCMs under mild assumptions never collapse to deterministic mechanisms as the graph size becomes large. Moreover, we formalize the correlation artifact commonly observed in benchmarks by proving that linear SCM structures in a Markov equivalence class (MEC) are partially identifiable for certain graph classes, given weak prior knowledge on the weight distribution of the ground-truth SCM. Most importantly, we show that this is not the case for the corresponding iSCMs. In Section 5, we empirically demonstrate that the baselines proposed in Reisach et al., (2021, 2024) are unable to exploit covariance artifacts in iSCMs, while practical classes of causal discovery algorithms are still able to learn causal structures in both linear and nonlinear systems. Our findings also reveal that SCM artifacts affect structure learning both positively and negatively, suggesting that generating (standardized) data from standard SCMs may be particularly ill-suited for benchmarking common approaches in use today.

(a) Standardized SCM

(b) iSCM

Figure 1: Standardizing SCMs two ways. Generative process for a chain graph of (a) standard SCMs, with data

\bf{x}

standardized post-hoc, and (b) SCMs with standardization performed during the generative process (iSCMs). Dashed arrows indicate z-standardization. Solid arrows indicate linear functions with weights from

\smash{\operatorname{Unif}_{\pm}[0.5,2.0]}

and additive noise from

\smash{\mathcal{N}(0,1)}

. We report absolute correlations

\smash{\lvert\rho\rvert}

of two consecutive observed variables, (a)

x_{j}^{s}

and

x_{j+1}^{s}

, or (b)

\smash{\widetilde{x}_{j}}

and

\smash{\widetilde{x}_{j+1}}

, averaged over

100000

models. In standard SCMs (a), correlations tend to increase along the causal ordering.

2 Background and Related Work

We begin by introducing structural causal models and the problem of causal structure learning, before discussing how synthetic data is often generated for evaluating structure learning algorithms. We then review existing works that study identifiability and patterns frequently present in synthetic data.

Structural causal models

A structural causal model (SCM) (Peters et al.,, 2017) of $d$ variables $\mathbf{x}=\{x_{1},\dots,x_{d}\}$ consists of a collection of structural assignments, each given by

x_{i}:=f_{i}(\mathbf{x}_{\mathrm{pa}(i)},\varepsilon_{i})\,,

(SCM)

where $\mathbf{x}_{\mathrm{pa}(i)}\subseteq\mathbf{x}\setminus\{x_{i}\}$ are called the parents of $x_{i}$ . Here, $f_{i}$ are arbitrary functions, and $\varepsilon_{i}$ are independent random variables that model exogenous noise (or unexplained variation). Together, they entail a joint probability distribution $p(\mathbf{x})$ over the variables $\bf{x}$ . It is common to consider SCMs with additive noise, e.g., with linear functions $f_{i}$ , as given by

f_{i}(\mathbf{x}_{\mathrm{pa}(i)},\varepsilon_{i})=\mathbf{w}_{i}^{\top}% \mathbf{x}_{\mathrm{pa}(i)}+\varepsilon_{i}\,.

(1)

The structural assignments in (SCM) induce a causal graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ over the variables $x_{i}$ , which is assumed to be acyclic. Specifically, the directed acyclic graph (DAG) $\mathcal{G}$ has vertices $v_{i}\in\mathcal{V}$ for every $x_{i}\in\mathbf{x}$ and a directed edge $(i,j)\in\mathcal{E}$ if $x_{i}\in\mathbf{x}_{\mathrm{pa}(j)}$ . We will explicitly distinguish this DAG $\mathcal{G}$ and its vertices $\mathcal{V}$ from the variables $\mathbf{x}$ . The skeleton of $\mathcal{G}$ denotes $\mathcal{G}$ with all edges undirected. If the skeleton of $\mathcal{G}$ is acyclic, we call $\mathcal{G}$ a forest.

Structure learning and benchmarking

Given a set of i.i.d. observations from the probability distribution $p(\mathbf{x})$ induced by an unknown SCM, causal structure learning aims to infer the causal graph $\mathcal{G}$ underlying the SCM. In this work, we focus on structure learning from observational, not interventional, data and only consider SCMs with no latent confounders.

Because it is difficult to obtain the ground-truth $\mathcal{G}$ for many real-world datasets, it is common to evaluate structure learning algorithms on synthetic data where $\mathcal{G}$ is known. A ubiquitous approach is to sample a DAG $\mathcal{G}$ , then SCM functions defined over $\mathcal{G}$ , and finally a dataset from this SCM, with the goal of later recovering $\mathcal{G}$ from the data. It is common to consider $\varepsilon_{i}$ with mean $0$ and fixed variance (often $1$ ), and for linear systems, to sample each $w_{i,j}$ uniformly and i.i.d. with support bounded away from $0$ (Shimizu et al.,, 2011; Peters and Bühlmann,, 2014; Zheng et al.,, 2018; Yu et al.,, 2019; Lachapelle et al.,, 2020; Zheng et al.,, 2020; Ng et al.,, 2020; Reisach et al.,, 2021; Lorch et al.,, 2022; Reisach et al.,, 2024). There exist alternative benchmarking strategies that involve sampling data from domain-specific simulators (Schaffter et al.,, 2011; Dibaeinia and Sinha,, 2020).

Data standardization and artifacts of SCMs

Previous work shows that generating data as described above can lead to strong artifacts. Reisach et al., (2021) observe that the variance of variables tends to increase along the topological ordering of $\mathcal{G}$ . This leads to the $\operatorname{Var}$ -SortnRegress baseline, which sorts variables based on their empirical variance and then performs sparse regression to infer $\mathcal{G}$ . Seng et al., (2024) show that structure learning algorithms minimizing an MSE-based loss (e.g., Zheng et al.,, 2018) can identify $\mathcal{G}$ under similar conditions. Therefore, Reisach et al., (2021) propose using standardization (Figure 1(a)) to remove this variance artifact from benchmarks. Specifically, they first sample all $x_{i}$ according to a standard SCM and then post-hoc transform the variables as

\displaystyle\hskip 81.0ptx_{i}^{s}:=\frac{x_{i}-\mathds{E}[x_{i}]}{\sqrt{% \operatorname{Var}[x_{i}]}}\,,

(Standardized SCM)

such that our observations correspond to samples from $p(\bf{x}^{s})$ . Standardization, however, only removes the variance artifact. Even in standardized SCMs, the fraction of a variable’s variance that is explained by all others, measured by the coefficient of determination $\operatorname{R^{2}}$ , tends to increase along the topological ordering (Reisach et al.,, 2024). $\operatorname{R^{2}}$ -SortnRegress exploits this correlation artifact analogously to $\operatorname{Var}$ -SortnRegress. Existing heuristics aiming to avoid the increasing correlations adjust the sampling process of $f_{i}$ , but they ultimately limit the causal dependencies that can be modeled, e.g., to certain levels of correlations among the observed $\mathbf{x}$ (Mooij et al.,, 2020) or a constant proportion of variance explained by the parents $\mathbf{x}_{\mathrm{pa}(i)}$ (Squires et al.,, 2022) (Appendix D.1). To our knowledge, there are currently no general methods for generating SCM data without strong correlation artifacts or significant limitations on the mechanisms $f_{i}$ and noise $\varepsilon_{i}$ .

Identifiability

Given a class of SCMs, there may be several SCMs with different causal graphs $\mathcal{G}$ that entail the same distribution $p(\mathbf{x})$ (Peters et al.,, 2017). Thus, even with infinite observations from $p(\mathbf{x})$ , we may be unable to identify the causal graph $\mathcal{G}$ that generated the observations. However, some identifiability results are known depending on the class of functions and noise distributions of the SCMs considered. For example, among all linear SCMs (1) with Gaussian noise $\varepsilon_{i}\sim\mathcal{N}(0,\sigma^{2}_{i})$ , the graph $\mathcal{G}$ can only be uniquely identified up to its MEC (Verma and Pearl,, 2013). However, if the noise scales $\sigma_{i}=\sigma$ are equal (Peters and Bühlmann,, 2014) or the noise is non-Gaussian (Shimizu et al.,, 2006), $\mathcal{G}$ can be uniquely identified given $p(\bf{x})$ .

It is fundamental to recognize that existing identifiability results only concern the unstandardized distributions $p(\bf{x})$ of SCMs. When we standardize the data and observe $p(\bf{x}^{s})$ instead, existing results no longer apply, because the implied SCM after standardization may violate the properties of the original SCM (e.g., its noise variances). In this work, we present, to our knowledge, the first (partial) identifiability result for standardized SCMs. Our result concerns a setting with prior knowledge on the magnitudes of w in Equation 1, an assumption underlying common benchmarking practices.

3 SCMs with Internal Standardization

3.1 Definition

We propose internally-standardized SCMs (iSCMs) as a modification to the standard data-generating process of SCMs. An iSCM $(\mathbf{S},\mathcal{P}_{\bm{\varepsilon}})$ consists of $d$ pairs of assignments, where for each $i\in\{1,\dots,d\}$ ,

\displaystyle x_{i}:=f_{i}(\mathbf{\widetilde{x}}_{\mathrm{pa}(i)},\varepsilon% _{i})~{}~{}~{}~{}\text{and}~{}~{}~{}~{}\widetilde{x}_{i}:=\frac{x_{i}-\mathds{% E}[x_{i}]}{\sqrt{\operatorname{Var}[x_{i}]}}

(iSCM)

with parents $\smash{\mathbf{\widetilde{x}}_{\mathrm{pa}(i)}\subseteq\widetilde{\bf{x}}% \setminus\{\widetilde{x}_{i}\}}$ of $\widetilde{x}_{i}$ in the underlying DAG and jointly-independent exogenous noise variables $\bm{\varepsilon}=[\varepsilon_{1},...,\varepsilon_{d}]\sim\mathcal{P}_{\bm{% \varepsilon}}$ . The variables $x_{i}$ are latent, and the variables $\smash{\widetilde{x}_{i}}$ are observed. Figure 2 illustrates the generative process. Algorithm 1 summarizes how to sample from (iSCM). If computing the population expectations and variances of $x_{i}$ is intractable, the empirical statistics obtained from $n$ samples can be used for standardization at each loop iteration of Algorithm 1.

Motivation

By construction, iSCMs model observed variables with zero mean and unit marginal variance. Contrary to standard SCMs, iSCMs avoid the accumulation of variance downstream in the causal ordering that can occur in standard SCMs (see Figure 1) through the standardization operation. Because each variable $x_{i}$ only depends on the standardized variables $\mathbf{\widetilde{x}}_{\mathrm{pa}(i)}$ , the relative scales of the noise distribution $\mathcal{P}_{\varepsilon_{i}}$ and the causal mechanisms $f_{i}$ are the same everywhere in the system and do not change, for example, downstream in the causal ordering. The causal mechansims of iSCMs are thus scale-free, in that the local interaction of mechanism $f_{i}$ and noise $\varepsilon_{i}$ occurs at a scale independent of the position of $x_{i}$ in the global ordering. This property makes iSCMs particularly useful for benchmarking, where random ground-truth models are commonly generated from a fixed distribution over functions $f_{i}$ and noise $\varepsilon_{i}$ . Contrary to existing heuristics (Section 2), iSCMs model arbitrarily strong or weak causal dependencies and levels of cause-explained variance.

Figure 2: Causal mechanisms in iSCMs. The function

f_{i}

modeling

x_{i}

depends on the standardized

\smash{\mathbf{\widetilde{x}}_{\mathrm{pa}(i)}}

. Dashing indicates z-standardization.

Algorithm 1 Sampling from an iSCM

Input: DAG

\mathcal{G}

, noise distribution

\mathcal{P}_{\bm{\varepsilon}}

Input: functions

\{f_{1},...,f_{d}\}

\pi\leftarrow

topological ordering of

\mathcal{G}

for

i=1

d

\varepsilon_{\pi_{i}}\sim\mathcal{P}_{\varepsilon_{\pi_{i}}}

x_{\pi_{i}}\leftarrow f_{\pi_{i}}(\mathbf{\widetilde{x}}_{\mathrm{pa}(\pi_{i})% },\varepsilon_{\pi_{i}})

\displaystyle\widetilde{x}_{\pi_{i}}\leftarrow\frac{x_{\pi_{i}}-\mathbb{E}[x_{% \pi_{i}}]}{\sqrt{\operatorname{Var}[x_{\pi_{i}}]}}

return

\big{[}\widetilde{x}_{1},\dots,\widetilde{x}_{d}\big{]}

\triangleright

\in\mathbb{R}^{d}

Interventions

Analogous to standard SCMs, interventions in iSCMs can be defined as modifications of the structural assignments $f_{i}$ in (iSCM) (Figure 2), while kee** the standardization operation based on the observational distribution. When the population statistics for standardization are intractable, we first sample observational data to obtain empirical statistics. Since we do not study interventions in this work, we defer a further discussion of interventions in iSCMs to Appendix B.

Units

When modeling a physical system, the functional mechanisms in standard SCMs have to account for the difference in units between the variables for the model to be unit-covariant (see Villar et al.,, 2023). A side-effect of internal standardization is that variables of iSCMs become unit-less, so iSCMs obey the passive symmetry of unit covariance by construction. Therefore, iSCMs naturally model both unit-less quantities and variables measured in different units, which can make them useful beyond benchmarking. Learned iSCMs would be invariant to the units chosen by the experimenter, similar to the physical world being independent of the mathematical models chosen to describe it.

3.2 Implied SCMs

It is natural to investigate whether SCMs can generate the same observations as standardized SCMs or iSCMs, given the same causal graph $\mathcal{G}$ and exogenous variables $\bm{\varepsilon}$ . In other words, can standardized SCMs and iSCMs be written as SCMs? For both models, the answer is yes. Specifically, we can express the generative process of $\smash{\bf{x}^{s}}$ in (Standardized SCM) and $\smash{\widetilde{\bf{x}}}$ in (iSCM) as

\displaystyle x^{s}_{i}=g^{s}_{i}(\mathbf{x}_{\mathrm{pa}(i)}^{s})+\theta^{s}_% {i}\varepsilon_{i}\quad\quad\quad\text{and}\quad\quad\quad\widetilde{x}_{i}=% \widetilde{g}_{i}(\mathbf{\widetilde{x}}_{\mathrm{pa}(i)})+\widetilde{\theta}_% {i}\varepsilon_{i}\,,

(2)

respectively, by moving the standardization operations into the causal mechanisms of the observables but leaving the DAG $\mathcal{G}$ and the variables $\bm{\varepsilon}$ unchanged. Appendix A describes how to construct these implied causal mechanisms $\smash{g^{s}_{i}}$ and $\smash{\widetilde{g}_{i}}$ and implied noise scales $\smash{\theta^{s}_{i}}$ and $\smash{\widetilde{\theta}_{i}}$ . We refer to the above SCM form of a standardized SCM or an iSCM with additive noise as their implied (SCM) model. Correspondingly, the implied SCMs have zero mean and unit variance. The notion of implied SCMs is powerful, because it enables us to analyze standardized SCMs and iSCMs as SCMs, and it sheds light on the performance of structure learning algorithms that assume unstandardized SCMs to underlie the generative process of the data (e.g., Shimizu et al.,, 2011; Zheng et al.,, 2018; Yu et al.,, 2019; Lachapelle et al.,, 2020; Zheng et al.,, 2020).

To provide a first characterization of standardized SCMs and iSCMs, our theoretical analyses focus on systems where $f_{i}$ are linear functions with additive, zero-mean noise as given by Equation (1). As a step** stone for this analysis, we derive an analytical expression for the covariance of linear SCMs, whose variables have unit variance by construction, without any form of standardization:

Lemma 1 (Covariance in linear SCMs with unit marginal variances).

Let $\bf{x}$ be modeled by a linear SCM defined by (1) with DAG $\mathcal{G}$ that satisfies $\operatorname{Var}[x_{i}]=1$ . Then, the covariance $\operatorname{Cov}[x_{i},x_{j}]$ is the sum of products of the weights along all unblocked paths between the nodes of $x_{i}$ and $x_{j}$ in $\mathcal{G}$ . Specifically, for any $i,j\in\{1,...,d\}$ such that $i\neq j$ , it holds that

\operatorname{Cov}[x_{i},x_{j}]=\sum_{p_{j\leftrightarrow i}\in P_{j% \leftrightarrow i}}\prod_{(l,m)\in p_{j\leftrightarrow i}}w_{l,m}\,,

(3)

where $P_{j\leftrightarrow i}$ are all unblocked paths from $x_{j}$ to $x_{i}$ in $\mathcal{G}$ , and $(l,m)\in p_{j\leftrightarrow i}$ indicates that the directed edge $(l,m)$ is part of the path $p_{j\leftrightarrow i}$ .

We give a proof in Appendix C.2. Since the implied SCMs of linear standardized SCMs and iSCMs are linear SCMs, the setting of Lemma 1 applies precisely to the SCM forms of both models. Thus, Lemma 1 enables us to study the covariances in standardized SCMs and iSCMs, and as we show next, derive conditions for the (non)identifiability of their DAGs $\mathcal{G}$ from the observational distribution.

4 Analysis

In this section, we give two theoretical results that support the suitability of iSCMs over standard SCMs for causal discovery benchmarking. First, we prove the general case of Figure 1. Contrary to standardized SCMs, iSCMs do not degenerate towards deterministic implied SCM mechanisms in deep graphs. Moreover, we prove that the DAGs of linear iSCMs cannot be identified beyond their MEC, assuming the DAG is a forest, even if the support of $\bf{w}$ is known. Crucially, we also show that this is not generally true for standardized SCMs. This suggests that algorithms can less easily game benchmarks based on linear iSCMs when knowing the data-generating process. For all results, we consider linear SCMs (1) with zero-mean additive noise and equal noise variances. All results are at the population level, so assume we know $p(\bf{x}^{s})$ or $p(\widetilde{\bf{x}})$ . Proofs are given in Appendix C.

4.1 Behavior with Increasing Graph Depth

Standardized SCMs tend towards increasing correlations between adjacent nodes down the topological ordering. This correlation artifact makes standardized SCMs problematic for benchmarking, because it may not be a property we expect to underlie real data. Reisach et al., (2024) show, under some assumptions on $\bf{w}$ , that the dependencies in standardized SCMs become deterministic with increasing graph depth. This implies that any exogenous variation $\varepsilon_{i}$ vanishes lower down in the system. Unless prior domain knowledge leads us to assume this holds in applications of interest, it may not be desirable to implicitly bias structure learning benchmarks towards such systems. For example, if the causal ordering represents time (Pamfil et al.,, 2020), the mechanisms of standardized SCMs are unable to model or characterize time-invariant or stable processes. Moreover, if we expect causal mechanisms to be independent (Schölkopf,, 2022), the qualitative behavior of a causal mechanism should not provide information about its position in the topological ordering relative to other mechanisms, as it would in SCMs. Reisach et al., (2024) show that baselines like $\operatorname{R^{2}}$ -SortnRegress can perform competitively on benchmarks by exploiting this artifact (Section 2).

iSCMs do not tend towards determinism with increasing graph depth (Figure 1(b)). In standardized SCMs, the correlations increase downstream, because the marginal variances of the underlying SCM increase with node depth, while the variance scale is fixed (Reisach et al.,, 2021). Thus, for large $i$ , the variance scale of $x_{i-1}$ becomes large relative to the scale of $\varepsilon_{i}$ , and the correlation of $x_{i}$ and $x_{i-1}$ tends towards $1$ . Since $x^{s}_{i}$ and $x^{s}_{i-1}$ are just standardized versions of these variables, they maintain the same correlation. iSCMs avoid this by standardizing internally, which scales the variance of any parents in a mechanism $f_{i}$ to $1$ , modulating the relative variance of $\varepsilon_{i}$ and $\smash{\mathbf{x}_{\mathrm{pa}(i)}}$ . In the following, we formalize this result for general graphs by bounding the fraction of cause explained variance (CEV). The fraction of CEV for $x_{i}$ is the proportion of $\operatorname{Var}[x_{i}]$ explained by its causal parents and given by

\operatorname{CEV_{f}}[x_{i}]=1-\frac{\operatorname{Var}[x_{i}-\mathds{E}[x_{i% }|\mathbf{x}_{\mathrm{pa}(i)}]]}{\operatorname{Var}[x_{i}]}\,.

(4)

The following results shows that we can bound the fraction of CEV for any variable in a linear iSCM:

Theorem 2 (Bound on $\smash{\operatorname{CEV_{f}}}$ in linear iSCMs).

Let $\bf{x}$ be modeled by a linear iSCM (1) with DAG $\mathcal{G}$ and additive noise of equal variances $\operatorname{Var}[\varepsilon_{i}]=\smash{\sigma^{2}}$ . Suppose any node in $\mathcal{G}$ has at most $m$ parents and $w=\max_{i,j\in\{1,...,d\}}\lvert w_{i,j}\rvert$ . Then, for any $i\in\{1,...,d\}$ , the fraction of CEV for $\widetilde{x}_{i}$ is bounded as

\displaystyle\operatorname{CEV_{f}}[\widetilde{x}_{i}]\leq 1-\frac{\sigma^{2}}% {m^{2}w^{2}+\sigma^{2}}\,.

Since the fraction of CEV is bounded, iSCMs are guaranteed not to collapse to determinism in large systems, alleviating several of the concerns with (standardized) SCMs discussed above.

4.2 Identifiability

Refer to caption — (a) DAGs with edge weights $\alpha$ and $\beta$

Figure 1(a) illustrates that the pairwise correlations in SCMs over chain graphs depend on the position in the topological ordering. This can allow algorithms like $\operatorname{R^{2}}$ -SortnRegress to infer the graph. By contrast, Figure 1(b) shows that iSCMs do no exhibit this pattern, with correlations between variables not increasing the identifiability of any part of the system.

In the following, we formalize this phenomenon for forests, that is, all DAGs with acyclic skeletons (Section 2). Specifically, we prove two results concerning the identifiability of the DAG $\mathcal{G}$ from the observational distribution, for standardized SCMs and iSCMs. This makes our finding the first identifiability result for standardized SCMs. While not every DAG is a forest, DAGs have forests as subgraphs and resemble forests as sparsity increases, thus providing us with intuition for generally sparse systems (e.g., Alon and Spencer,, 2016, Chapter 11).

Our first result leverages the observation that, for standardized SCMs, many DAGs in an MEC are infeasible given $p(\bf{x}^{s})$ when their edge directions are not consistent with the direction of increasing absolute covariance. To illustrate this idea, suppose our goal is to distinguish between the DAGs in the MEC $\smash{\tilde{\mathcal{G}}}=\{\mathrm{(i)},\mathrm{(ii)},\mathrm{(iii)}\}$ in Figure 3(a). We overload notation and denote the weights of the edges $\alpha$ and $\beta$ regardless of orientation. For standardized SCMs, we can apply Lemma 1 to the implied SCM of graph $\mathrm{(i)}$ to obtain the covariances

\displaystyle\operatorname{Cov}[x_{1}^{s},x_{2}^{s}]=\tfrac{\alpha}{\sqrt{% \alpha^{2}+1}}\quad\quad\text{and}\quad\quad\operatorname{Cov}[x_{2}^{s},x_{3}% ^{s}]=\beta\sqrt{\tfrac{\alpha^{2}+1}{\beta^{2}(\alpha^{2}+1)+1}}\,.

See Appendix C.4.1. Together, both expressions imply that standardized SCMs with DAG $\mathrm{(i)}$ satisfy

\lvert\operatorname{Cov}[x_{1}^{s},x_{2}^{s}]\rvert<\lvert\operatorname{Cov}[x% _{2}^{s},x_{3}^{s}]\rvert\quad\Longleftrightarrow\quad\tfrac{\alpha^{2}}{% \alpha^{2}+1}<\beta^{2}\,.

(5)

If $\lvert\beta\rvert\geq 1$ , then the right-hand side of Equation 5 is always true. In this case, the absolute covariance increases from $x_{1}$ to $x_{3}$ in all standardized SCMs with DAG $\mathrm{(i)}$ . By symmetry, the covariance in SCMs with DAG $\mathrm{(iii)}$ increases from $x_{3}$ to $x_{1}$ when $\lvert\alpha\rvert\geq 1$ . Therefore, if both weights are greater than $1$ , the absolute covariance increases downstream in all SCMs of $\mathrm{(i)}$ and $\mathrm{(iii)}$ . This implies that, among $\mathrm{(i)}$ and $\mathrm{(iii)}$ , only the DAG whose edges align with the covariance ordering in $p(\bf{x}^{s})$ can induce $p(\bf{x}^{s})$ . Irrespectively, the DAG $\mathrm{(ii)}$ remains plausible. We can extend the intuition of this 3-variable example to identify almost all edges in any forest MEC:

Theorem 3 (Partial identifiability of standardized linear SCMs with forest DAGs).

Let $\bf{x}^{s}$ be modeled by a standardized linear SCM (1) with forest DAG $\mathcal{G}$ , additive noise of equal variances $\operatorname{Var}[\varepsilon_{i}]=\sigma^{2}$ , and $\left\lvert w_{i,j}\right\rvert>1$ for all $i\in\text{pa}(j)$ . Then, given $p(\bf{x}^{s})$ and the partially directed graph $\smash{\tilde{\mathcal{G}}}$ representing the MEC of $\mathcal{G}$ , we can identify all but at most one edge of the true DAG $\mathcal{G}$ in each undirected connected component of the MEC $\smash{\tilde{\mathcal{G}}}$ .

Our proof of Theorem 3 considers each undirected component separately from the rest of the MEC $\smash{\tilde{\mathcal{G}}}$ . Hence, the identifiability result extends to undirected tree components of arbitrary, non-forest MECs as well. Theorem 3 shows that, when using standardized SCM data for benchmarking, algorithms can use pairwise correlations to orient additional edges correctly. The weights assumption of Theorem 3 is relevant to causal discovery benchmarking, because weights are often sampled i.i.d. from intervals bounded away from $0$ (Section 2). Hence, empirical evaluations may render standardized linear SCMs identifiable only through the design of their weights distribution. In the following, we show that, under similar conditions, iSCMs are more difficult to identify from their MEC. In the 3-variable example above, we can show that the observational distribution of iSCMs is the same for all DAGs $\mathrm{(i)}$ , $\mathrm{(ii)}$ , and $\mathrm{(iii)}$ when the weights $\alpha$ and $\beta$ are shared over the corresponding edges in the MEC (Figure 3(b); see Appendix C.4). This result generalizes to forests:

Theorem 4 (Nonidentifiability of linear Gaussian iSCMs with forest DAGs).

Let $\widetilde{\bf{x}}$ be modeled by a linear iSCM (1) with forest DAG $\mathcal{G}$ and additive Gaussian noise of equal variances $\operatorname{Var}[\varepsilon_{i}]$ . Then, for every DAG $\mathcal{G}^{\prime}$ in the MEC of $\mathcal{G}$ , there exists a linear iSCM with DAG $\mathcal{G}^{\prime}$ that has the same observational distribution as $\widetilde{\bf{x}}$ , the same noise variances, and the same weights on the corresponding edges in the MEC.

Our proof consists of showing that the covariance matrices of these systems are equal. For linear Gaussian iSCMs, this then implies that their observational distributions are identical. Theorem 4 thus shows that additional knowledge of the weight distribution in a benchmark does not allow identifying any additional edges beyond the MEC. By contrast, Theorem 3 shows that, for standardized SCMs, lower-bounding the weight magnitudes is sufficient for identifying most of the graph from its MEC. Without standardization, $\mathcal{G}$ is fully identified from its observational distribution under even weaker assumptions (Peters and Bühlmann,, 2014). Importantly, Theorem 4 does not generalize to arbitrary graphs beyond forests. Appendix C.4 provides a counterexample involving a 3-node skeleton with a cycle. As we study in the next section, this implies that nontrivial causal structure can still be learned from iSCM data. However, DAGs in benchmarks are often sparse, so we expect the implications of our identifiability results to capture relevant parts of empirical phenomena in benchmarking settings.

5 Experimental Results

Our previous analyses suggest that iSCMs address shortcomings of naive standardization, in particular, when sampling each $f_{i}$ and $\varepsilon_{i}$ from the same distribution, as commonly done in benchmarking. In this section, we now provide evidence that iSCMs do not contain the covariance artifacts of SCMs. Moreover, we benchmark the SortnRegress baselines (Section 2) and two representative structure learning algorithms to gain insights into how their performance varies when benchmarked on standardized SCMs and iSCMs. Appendix E provides all details of the experimental setup.

5.1 $\operatorname{R^{2}}$ -Sortability

Reisach et al., (2024) introduce the $\operatorname{R^{2}}$ -sortability metric to evaluate the correlation artifact underlying a dataset. $\operatorname{R^{2}}$ -sortability measures the correlation between the variables’ causal ordering and the $\operatorname{R^{2}}$ coefficients obtained from regressing each variable onto all others (Appendix D.2). The metric gives rise to the $\operatorname{R^{2}}$ -SortnRegress baseline described in Section 2. Reisach et al., (2024) show that $\operatorname{R^{2}}$ -sortability in SCMs is driven by an interplay of graph connectivity and the weight distribution of $f_{i}$ .

Figure 4 summarizes the $\operatorname{R^{2}}$ -sortability statistics for linear SCM and iSCM data. We write $\operatorname{ER}(d,k)$ and $\operatorname{SF}(d,k)$ to denote Erdős-Rényi and scale-free graphs of size $d$ and (expected) degree $k$ , respectively. We find that iSCMs generate datasets that are not $\operatorname{R^{2}}$ -sortable ( $\operatorname{R^{2}}$ -sortability $\approx$ $0.5$ ) and thus artifact-free while sampling over common graph structures (e.g., Zheng et al.,, 2018; Yu et al.,, 2019; Reisach et al.,, 2021). Conversely, standardized SCMs generate datasets that are strongly $\operatorname{R^{2}}$ -sortable ( $\lvert\text{$\operatorname{R^{2}}$-sortability}-0.5\rvert\gg 0$ ). Since $\operatorname{R^{2}}$ -sortability can be exploited for causal discovery, iSCM data serves as a test for evaluating whether algorithms utilize any data properties beyond the association between $\operatorname{R^{2}}$ and the causal ordering in SCMs. Our results do not exclude the possibility of iSCM configurations that still produce $\operatorname{R^{2}}$ -sortable datasets. However, we show that, for commonly-used $\mathcal{G}$ , $\mathcal{P}_{\bm{\varepsilon}}$ , and $\bf{w}$ , iSCM datasets are not $\operatorname{R^{2}}$ -sortable with high probability. Appendix F provides results for denser graphs.

5.2 Structure Learning

Under the same weight and noise distributions, standardized SCMs and iSCMs have different implied SCMs and generate qualitatively different datasets. Here, we study how this affects causal structure learning in practice. We evaluate $\operatorname{Var}$ - and $\operatorname{R^{2}}$ -SortnRegress (SR) and a baseline using random orderings (Reisach et al.,, 2021, 2024). In addition, we evaluate representative algorithms from two orthogonal approaches to learning structure from (co)variance information. Notears by Zheng et al., (2018) leverages continuous optimization to minimize an MSE loss, which is affected by noise scaling (Loh and Bühlmann,, 2014; Seng et al.,, 2024). Avici by Lorch et al., (2022) predicts graphs using a model pretrained on simulated data and is thus optimized to exploit any artifacts that improve predictive accuracy. To investigate its susceptibility to artifacts, we evaluate the public model checkpoints trained on standardized SCMs.

Figure 5(a) summarizes the results for linear and nonlinear systems. Here, the nonlinear mechanisms $f_{i}$ are samples from a Gaussian process with squared exponential kernel. As expected, $\operatorname{Var}$ -SortnRegress performs best when SCMs are not standardized. Likewise, $\operatorname{R^{2}}$ -SortnRegress performs better on SCMs and standardized SCMs, as iSCMs have $\operatorname{R^{2}}$ -sortability close to $0.5$ (Section 5.1). Avici shows the same trend, suggesting it may indeed be exploiting the correlation artifacts present in its training distribution. Like Reisach et al., (2021), we find that Notears performs best on unstandardized data. However, and more interestingly, Notears also performs better on iSCMs than on standardized SCMs, especially in linear and larger systems. As we investigate next, this gap may be explained by the fact that the implied models of standardized SCMs violate the assumptions of Notears more strongly than iSCMs. Overall, performance differences are more pronounced for linear systems, where the downstream variance accumulation in SCMs is unbounded. Appendix F reports the results in terms of structural Hamming distance (SHD) and different weight ranges.

Properties of the implied SCM

When standardizing SCM data, the implied SCM corresponds to the SCM that could have generated the observations. Therefore, algorithms assuming that unstandardized SCMs generated the data will be susceptible to any assumption violations of the implied SCM, such as assumptions about the exogenous noise. Figure 5(b) (bottom) shows the distribution of inverse implied noise scales $1/\theta_{i}^{2}$ for the variables of the implied models (see Equation 2). Since $\operatorname{Var}[\varepsilon_{i}]=1$ in our experiments, these inverse squared noise scales are equal to the inverse variances of the full additive noise terms. We find that standardized SCMs induce inverse noise scales that are orders of magnitude greater than those of iSCMs. This distribution is essentially the footprint of the determinism in the depth limit discussed in Section 4.1. The modes at $1/\theta_{i}^{2}=1$ and at $1/\theta_{i}^{2}>1$ in the iSCM plot correspond to root and non-root nodes, respectively.

Figure 5(b) (top) shows the performance of Notears when isolating the noise properties of the implied models from the fact that standardized SCMs and iSCMs are not $\operatorname{Var}$ -sortable. For this, we construct SCMs that have the marginal variances (and $\operatorname{Var}$ -sortability, here $0.99$ on average) of unstandardized SCMs but the noise variances of the implied models by correcting their weights (see Appendix E.5). Notears performs better in such systems, suggesting that (i) the noise statistics may indeed explain the performance difference on iSCM data, and (ii) $\operatorname{Var}$ -sortability may not be the only reason why Notears performs significantly worse on standardized data (Reisach et al.,, 2021). This sheds light on existing benchmarking results, where MSE-based algorithms perform below expectations despite perhaps not intending to evaluate the algorithms under model mismatch (e.g., Reisach et al.,, 2021; Kaiser and Sipos,, 2021). For the MSE loss, Loh and Bühlmann, (2014) and Seng et al., (2024) show that smaller ratios of noise variances increase the magnitude of weights required for the true DAG to be the unique minimizer. The MSE loss ultimately does not account for the inverse variance factor in the Gaussian noise likelihood. Overall, the statistics of the implied models of standardized SCMs are empirically further from SCMs with equal noise variances than their iSCM counterparts.

6 Conclusions

We describe the iSCM, a one-line modification of the SCM that modulates the scale of interaction between the causal mechanism $f_{i}$ and noise $\varepsilon_{i}$ at each variable $x_{i}$ . Through several theoretical and experimental results, we study its properties in relation to standard SCMs as well as the models they imply after standardization. To conclude, we highlight the following key takeaways:

Standardizing during the generative process removes sortability artifacts.

When the functions $f_{i}$ and the noise $\varepsilon_{i}$ are, for example, sampled i.i.d. for each variable $x_{i}$ , SCMs exhibit artifacts that are not removed when shifting and scaling the generated data. Our results in Section 5 show that iSCMs are effective at removing $\operatorname{Var}$ - and $\operatorname{R^{2}}$ -sortability. This makes iSCMs a useful complement to structure learning benchmarks with SCMs, enabling a specific evaluation of the ability of algorithms to transfer to real-world settings that do not exhibit $\operatorname{R^{2}}$ artifacts. Despite the removed sortability artifacts, causal discovery algorithms are able to infer nontrivial structure from iSCM data (Figure 5).

Standardizing post-hoc can lead to partial identifiability and degenerate implied SCMs.

Scaling the units of SCM data is not innocuous. Theorem 3 shows that mild knowledge on the distribution of $f_{i}$ can identify edges in standardized SCMs that are typically not identifiable from observational data. To our knowledge, our result is the first concerning the identifiability of $\mathcal{G}$ from the standardized observational distribution of SCMs. This may make benchmarks, where similar assumptions on $f_{i}$ often hold, trivial under standardized SCMs. Moreover, Figure 5(b) shows that standard SCMs can collapse to modeling near-zero exogenous noise. Theorems 2 and 4 demonstrate that neither property appears in the analogous iSCMs. Ultimately, (non)identifiability may be either a feature or bug, depending on whether assumptions are verifiable in practice or a priori known during evaluation.

iSCMs are stable and scale-free, making them useful beyond benchmarking.

Beyond data generation, the stable generative process of iSCMs can make them useful for modeling, e.g., large or temporal systems (e.g., Kilian,, 2013; Pamfil et al.,, 2020). In iSCMs, the scale of a causal mechanism $f_{i}$ and its unexplained variation $\varepsilon_{i}$ is both unit-less and independent from its position in the causal ordering (Section 3). Since each iSCM implies a standard SCM, iSCMs can be viewed as a reparameterization of SCMs that enables modeling and learning the functions $f_{i}$ on the same scale, e.g., under a shared prior or level of regularization. Conceptually, iSCMs are related to batch normalization (Ioffe and Szegedy,, 2015), a technique used to stabilize the optimization of neural networks, which compose sequences of functions like SCMs, by adding internal standardization. Overall, these properties may make the iSCM a useful structural equation model beyond the benchmarking problem studied here.

Acknowledgments and Disclosure of Funding

This research was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program grant agreement no. 815943 and the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40 180545. This work was also supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039B, and by the Machine Learning Cluster of Excellence, EXC number 2064/1, project number 390727645.

References

Alon and Spencer, (2016) Alon, N. and Spencer, J. H. (2016). The probabilistic method. John Wiley & Sons.
Andersson et al., (1997) Andersson, S. A., Madigan, D., and Perlman, M. D. (1997). A characterization of markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505–541.
Barabási and Albert, (1999) Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. science, 286(5439):509–512.
Dibaeinia and Sinha, (2020) Dibaeinia, P. and Sinha, S. (2020). SERGIO: a single-cell expression simulator guided by gene regulatory networks. Cell systems, 11(3):252–271.
Erdős and Rényi, (1959) Erdős, P. and Rényi, A. (1959). On random graphs. Publicationes Mathematicae, 6:290–297.
Imbens and Rubin, (2015) Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge university press.
Ioffe and Szegedy, (2015) Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr.
Kaiser and Sipos, (2021) Kaiser, M. and Sipos, M. (2021). Unsuitability of NOTEARS for causal graph discovery. arXiv preprint arXiv:2104.05441.
Kilian, (2013) Kilian, L. (2013). Structural vector autoregressions. In Handbook of research methods and applications in empirical macroeconomics, pages 515–554. Edward Elgar Publishing.
Lachapelle et al., (2020) Lachapelle, S., Brouillard, P., Deleu, T., and Lacoste-Julien, S. (2020). Gradient-based neural DAG learning. In International Conference on Learning Representations.
Loh and Bühlmann, (2014) Loh, P.-L. and Bühlmann, P. (2014). High-dimensional learning of linear causal networks via inverse covariance estimation. The Journal of Machine Learning Research, 15(1):3065–3105.
Lorch et al., (2022) Lorch, L., Sussex, S., Rothfuss, J., Krause, A., and Schölkopf, B. (2022). Amortized inference for causal structure learning. Advances in Neural Information Processing Systems, 35:13104–13118.
Maathuis et al., (2010) Maathuis, M. H., Colombo, D., Kalisch, M., and Bühlmann, P. (2010). Predicting causal effects in large-scale systems from observational data. Nature methods, 7(4):247–248.
Meek, (1995) Meek, C. (1995). Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 403–410.
Mooij et al., (2020) Mooij, J. M., Magliacane, S., and Claassen, T. (2020). Joint causal inference from multiple contexts. The Journal of Machine Learning Research, 21(1):3919–4026.
Ng et al., (2020) Ng, I., Ghassami, A., and Zhang, K. (2020). On the role of sparsity and DAG constraints for learning linear DAGs. Advances in Neural Information Processing Systems, 33:17943–17954.
Pamfil et al., (2020) Pamfil, R., Sriwattanaworachai, N., Desai, S., Pilgerstorfer, P., Georgatzis, K., Beaumont, P., and Aragam, B. (2020). DYNOTEARS: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics, pages 1595–1605. Pmlr.
Pearl, (2009) Pearl, J. (2009). Causality. Cambridge university press.
Peters and Bühlmann, (2014) Peters, J. and Bühlmann, P. (2014). Identifiability of Gaussian structural equation models with equal error variances. Biometrika, 101(1):219–228.
Peters et al., (2017) Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of causal inference: foundations and learning algorithms. The MIT Press.
Rahimi and Recht, (2007) Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. Advances in neural information processing systems, 20.
Reisach et al., (2021) Reisach, A., Seiler, C., and Weichwald, S. (2021). Beware of the simulated DAG! Causal discovery benchmarks may be easy to game. Advances in Neural Information Processing Systems, 34:27772–27784.
Reisach et al., (2024) Reisach, A., Tami, M., Seiler, C., Chambaz, A., and Weichwald, S. (2024). A scale-invariant sorting criterion to find a causal order in additive noise models. Advances in Neural Information Processing Systems, 36.
Runge et al., (2019) Runge, J., Bathiany, S., Bollt, E., Camps-Valls, G., Coumou, D., Deyle, E., Glymour, C., Kretschmer, M., Mahecha, M. D., Muñoz-Marí, J., et al. (2019). Inferring causation from time series in earth system sciences. Nature communications, 10(1):2553.
Sachs et al., (2005) Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529.
Schaffter et al., (2011) Schaffter, T., Marbach, D., and Floreano, D. (2011). GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics, 27(16):2263–2270.
Schölkopf, (2022) Schölkopf, B. (2022). Causality for machine learning. In Probabilistic and causal inference: The works of Judea Pearl, pages 765–804.
Seng et al., (2024) Seng, J., Zečević, M., Dhami, D. S., and Kersting, K. (2024). Learning large DAGs is harder than you think: Many losses are minimal for the wrong DAG. In The Twelfth International Conference on Learning Representations.
Shimizu et al., (2006) Shimizu, S., Hoyer, P. O., Hyvärinen, A., Kerminen, A., and Jordan, M. (2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10).
Shimizu et al., (2011) Shimizu, S., Inazumi, T., Sogawa, Y., Hyvarinen, A., Kawahara, Y., Washio, T., Hoyer, P. O., Bollen, K., and Hoyer, P. (2011). DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model. Journal of Machine Learning Research-JMLR, 12(Apr):1225–1248.
Squires and Uhler, (2022) Squires, C. and Uhler, C. (2022). Causal structure learning: A combinatorial perspective. Foundations of Computational Mathematics, 23(5):1781–1815.
Squires et al., (2022) Squires, C., Yun, A., Nichani, E., Agrawal, R., and Uhler, C. (2022). Causal structure discovery between clusters of nodes induced by latent factors. In Conference on Causal Learning and Reasoning, pages 669–687. PMLR.
Verma and Pearl, (2013) Verma, T. S. and Pearl, J. (2013). On the equivalence of causal models.
Villar et al., (2023) Villar, S., Hogg, D. W., Yao, W., Kevrekidis, G. A., and Schölkopf, B. (2023). Towards fully covariant machine learning. arXiv preprint arXiv:2301.13724.
Vowels et al., (2022) Vowels, M. J., Camgoz, N. C., and Bowden, R. (2022). D’ya like DAGs? A survey on structure learning and causal discovery. ACM Computing Surveys, 55(4):1–36.
Wienöbst et al., (2023) Wienöbst, M., Luttermann, M., Bannach, M., and Liskiewicz, M. (2023). Efficient enumeration of markov equivalent dags. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12313–12320.
Yu et al., (2019) Yu, Y., Chen, J., Gao, T., and Yu, M. (2019). DAG-GNN: DAG structure learning with graph neural networks. In International Conference on Machine Learning, pages 7154–7163. PMLR.
Zheng et al., (2018) Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. (2018). DAGs with NO TEARS: Continuous optimization for structure learning. Advances in neural information processing systems, 31.
Zheng et al., (2020) Zheng, X., Dan, C., Aragam, B., Ravikumar, P., and Xing, E. (2020). Learning sparse nonparametric DAGs. In International Conference on Artificial Intelligence and Statistics, pages 3414–3425. Pmlr.

Appendix A Implied Models

In this section, we describe how to express the assignments of the observed variables of standardized SCMs and iSCMs with a general additive noise mechanism

\displaystyle f_{i}(\mathbf{x},\varepsilon_{i})=f_{i}(\mathbf{x})+\varepsilon_% {i}\,,

(6)

in the form of (SCM), while sharing the same causal graph $\mathcal{G}$ and exogenous noise variables $\bm{\varepsilon}$ . We obtain the SCM form by moving the standardization steps into the causal mechanisms by linearly rescaling $f_{i}$ and $\varepsilon_{i}$ , such that each observed variable is only a function of observed variables and the noise $\varepsilon_{i}$ . Throughout this work, the implied (SCM) model denotes the specific construction given in the following two subsections. For this, we assume that we can express the first two moments of the system in closed form. Similar to the main text, we overload notation for both standardized SCMs and iSCMs and write

\mu_{i}:=\mathds{E}[x_{i}]\quad\quad\text{and}\quad\quad s_{i}:=\sqrt{% \operatorname{Var}[x_{i}]}\,.

We also derive analytic expressions for the weights of the implied models of linear iSCMs defined by Equation (1), which we later use in our proofs.

A.1 Implied Model of a Standardized SCM

Let $\bf{x}^{s}$ be modeled by (Standardized SCM) with causal mechanisms defined by Equation (6). We recall that $\bf{x}^{s}$ are the observations obtained after standardizing $\bf{x}$ . Thus, we can rearrange $x^{s}_{i}$ as

x_{i}=s_{i}x^{s}_{i}+\mu_{i}

and substitute every unstandardized variable $x_{i}$ by a function of its standardized parents $\mathbf{x}_{\mathrm{pa}(i)}^{s}$ as

x^{s}_{i}=\frac{x_{i}-\mu_{i}}{s_{i}}=\frac{f_{i}(\mathbf{x}_{\mathrm{pa}(i)})% +\varepsilon_{i}-\mu_{i}}{s_{i}}=\frac{f_{i}(\mathbf{x}_{\mathrm{pa}(i)}^{s}% \odot\bm{s}_{{\mathrm{pa}(i)}}+\bm{\mu}_{{\mathrm{pa}(i)}})-\mu_{i}}{s_{i}}+% \frac{1}{s_{i}}\varepsilon_{i}\,,

where $\odot$ denotes elementwise multiplication, and $\bm{\mu}_{{\mathrm{pa}(i)}}$ and $\bm{s}_{{\mathrm{pa}(i)}}$ are the vectors of the parent means and standard deviations before standardization. Thus, the assignments of $\bf{x}^{s}$ in a standardized SCM can be written as the SCM given by

x^{s}_{i}=g^{s}_{i}(\mathbf{x}_{\mathrm{pa}(i)}^{s})+\theta^{s}_{i}\varepsilon% _{i}\,,

with implied noise scales $\theta^{s}_{i}:=1/s_{i}$ and implied causal mechanisms

\displaystyle g^{s}_{i}(\mathbf{x}_{\mathrm{pa}(i)}^{s})

\displaystyle:=\begin{cases}\displaystyle\frac{f_{i}(\mathbf{x}_{\mathrm{pa}(i% )}^{s}\odot\bm{s}_{{\mathrm{pa}(i)}}+\bm{\mu}_{{\mathrm{pa}(i)}})-\mu_{i}}{s_{% i}}&\text{if $i$ is a non-root variable, and}\\ \displaystyle\frac{f_{i}-\mu_{i}}{s_{i}}&\text{if $i$ is a root variable.}\end% {cases}

A.2 Implied Model of an iSCM

Let $\widetilde{\bf{x}}$ be modeled by (iSCM) with causal mechanisms defined by Equation (6). In an iSCM, $\widetilde{\bf{x}}$ are the observed variables and $\bf{x}$ are the latent variables. We can express every observation $\widetilde{x}_{i}$ in terms of its observed parents $\mathbf{\widetilde{x}}_{\mathrm{pa}(i)}$ as

\widetilde{x}_{i}=\frac{x_{i}-\mu_{i}}{s_{i}}=\frac{f_{i}(\mathbf{\widetilde{x% }}_{\mathrm{pa}(i)})+\varepsilon_{i}-\mu_{i}}{s_{i}}=\frac{f_{i}(\mathbf{% \widetilde{x}}_{\mathrm{pa}(i)})-\mu_{i}}{s_{i}}+\frac{1}{s_{i}}\varepsilon_{i% }\,.

Thus, the assignments of $\widetilde{\bf{x}}$ in a iSCM can be written as the SCM given by

\widetilde{x}_{i}=\widetilde{g}_{i}(\mathbf{\widetilde{x}}_{\mathrm{pa}(i)})+% \widetilde{\theta}_{i}\varepsilon_{i}\,,

with implied noise scales $\widetilde{\theta}_{i}:=1/s_{i}$ and implied causal mechanisms

\displaystyle\widetilde{g}_{i}(\mathbf{\widetilde{x}}_{\mathrm{pa}(i)})

\displaystyle:=\begin{cases}\displaystyle\frac{f_{i}(\mathbf{\widetilde{x}}_{% \mathrm{pa}(i)})-\mu_{i}}{s_{i}}&\text{if $i$ is a non-root variable, and}\\ \displaystyle\frac{f_{i}-\mu_{i}}{s_{i}}&\text{if $i$ is a root variable.}\end% {cases}

A.3 Weights of the Implied Model of a Linear iSCM

Here, we derive the analytical form for the mechanisms of the implied model of a linear iSCM with zero-centered, additive noise $\varepsilon_{i}$ . This iSCM is given by

x_{i}:=\mathbf{w}_{i}^{T}\mathbf{\widetilde{x}}_{\mathrm{pa}(i)}+\varepsilon_{% i}\quad\quad\text{and}\quad\quad\widetilde{x}_{i}:=\frac{x_{i}}{\sqrt{% \operatorname{Var}[x_{i}]}}\,,

where $\varepsilon_{i}$ satisfies $\mathds{E}[\varepsilon_{i}]=0$ and $\operatorname{Var}[\varepsilon_{i}]=\sigma_{i}^{2}$ . We can write the above as

\displaystyle\widetilde{x}_{i}

\displaystyle=\frac{\mathbf{w}_{i}^{T}\mathbf{\widetilde{x}}_{\mathrm{pa}(i)}+% \varepsilon_{i}}{\sqrt{\operatorname{Var}[x_{i}]}}=\frac{\sum_{j\in\mathrm{pa}% (i)}w_{j,i}\widetilde{x}_{j}+\varepsilon_{i}}{\sqrt{\operatorname{Var}[x_{i}]}% }=\sum_{j\in\mathrm{pa}(i)}\frac{w_{j,i}}{\sqrt{\operatorname{Var}[x_{i}]}}\,% \widetilde{x}_{j}+\frac{1}{\sqrt{\operatorname{Var}[x_{i}]}}\varepsilon_{i}\,.

It follows that the implied SCM of a linear iSCM is also linear, with weights and noise variances given by

\widetilde{w}_{j,i}=\frac{w_{j,i}}{\sqrt{\operatorname{Var}[x_{i}]}}\quad\quad% \text{and}\quad\quad\widetilde{\sigma}^{2}_{i}=\frac{\sigma^{2}_{i}}{% \operatorname{Var}[x_{i}]}\,.

(7)

In the above, we can write the variance of $x_{i}$ explicitly as

$\displaystyle\operatorname{Var}[x_{i}]$	$\displaystyle=\operatorname{Var}\bigg{[}\sum_{j\in\mathrm{pa}(i)}w_{j,i}% \widetilde{x}_{j}+\varepsilon_{i}\bigg{]}=\operatorname{Var}\bigg{[}\sum_{j\in% \mathrm{pa}(i)}w_{j,i}\widetilde{x}_{j}\bigg{]}+\sigma_{i}^{2}$	(8)
	$\displaystyle\overset{\tiny\leavevmode\hbox to10.08pt{\vbox to10.08pt{% \pgfpicture\makeatletter\hbox{\hskip 5.03984pt\lower-5.03984pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont% \pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@invoke{\lxSVG@closescope }% \pgfsys@endscope\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{% }{}{}{{}\pgfsys@moveto{4.83984pt}{0.0pt}\pgfsys@curveto{4.83984pt}{2.673pt}{2.% 673pt}{4.83984pt}{0.0pt}{4.83984pt}\pgfsys@curveto{-2.673pt}{4.83984pt}{-4.839% 84pt}{2.673pt}{-4.83984pt}{0.0pt}\pgfsys@curveto{-4.83984pt}{-2.673pt}{-2.673% pt}{-4.83984pt}{0.0pt}{-4.83984pt}\pgfsys@curveto{2.673pt}{-4.83984pt}{4.83984% pt}{-2.673pt}{4.83984pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1% .0}{-1.25pt}{-1.6111pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{1}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}{=}\sum_{k\in\mathrm{pa}(i)}\sum_{j\in% \mathrm{pa}(i)}\operatorname{Cov}[w_{k,i}\widetilde{x}_{k},w_{j,i}\widetilde{x% }_{j}]+\sigma_{i}^{2}$
	$\displaystyle\overset{\tiny\leavevmode\hbox to10.08pt{\vbox to10.08pt{% \pgfpicture\makeatletter\hbox{\hskip 5.03984pt\lower-5.03984pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont% \pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@invoke{\lxSVG@closescope }% \pgfsys@endscope\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{% }{}{}{{}\pgfsys@moveto{4.83984pt}{0.0pt}\pgfsys@curveto{4.83984pt}{2.673pt}{2.% 673pt}{4.83984pt}{0.0pt}{4.83984pt}\pgfsys@curveto{-2.673pt}{4.83984pt}{-4.839% 84pt}{2.673pt}{-4.83984pt}{0.0pt}\pgfsys@curveto{-4.83984pt}{-2.673pt}{-2.673% pt}{-4.83984pt}{0.0pt}{-4.83984pt}\pgfsys@curveto{2.673pt}{-4.83984pt}{4.83984% pt}{-2.673pt}{4.83984pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1% .0}{-1.25pt}{-1.6111pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}{=}\sum_{k\in\mathrm{pa}(i)}\sum_{j\in% \mathrm{pa}(i)}w_{k,i}w_{j,i}\operatorname{Cov}[\widetilde{x}_{k},\widetilde{x% }_{j}]+\sigma_{i}^{2}\,,$

where follows from Bienaymé’s identity and from covariance being bilinear. Substituting the variance into the expressions for the weights and noise variances, we obtain

	$\displaystyle\widetilde{w}_{j,i}$	$\displaystyle=\frac{w_{j,i}}{\sqrt{\sum_{k\in\mathrm{pa}(i)}\sum_{j\in\mathrm{% pa}(i)}w_{k,i}w_{j,i}\operatorname{Cov}[\widetilde{x}_{k},\widetilde{x}_{j}]+% \sigma_{i}^{2}}}\,,$		(9)
	$\displaystyle\widetilde{\sigma}^{2}_{i}$	$\displaystyle=\frac{\sigma^{2}_{i}}{\sum_{k\in\mathrm{pa}(i)}\sum_{j\in\mathrm% {pa}(i)}w_{k,i}w_{j,i}\operatorname{Cov}[\widetilde{x}_{k},\widetilde{x}_{j}]+% \sigma_{i}^{2}}\,.$		(10)

Finally, by construction, the variables $\widetilde{\bf{x}}$ of an iSCM have unit marginal variances. Thus, when the parents of $\widetilde{x}_{i}$ are pairwise independent, Equation 10 simplifies to

\widetilde{w}_{j,i}=\frac{w_{j,i}}{\sqrt{\sum_{j\in\mathrm{pa}(i)}w_{j,i}^{2}+% \sigma_{i}^{2}}}.

(11)

This independence condition always holds when the DAG $\mathcal{G}$ is a forest.

Efficient computation

We can efficiently compute the implied model weights using a bottom-up dynamic programming approach. This allows sampling data directly from the exact implied model of an iSCM without resorting to empirical standardization statistics. Algorithm 2 describes the procedure. We iteratively compute the weights and noise variances of the implied model following Equations (9) and (10). At each iteration, we update the covariance matrix according to Lemma 1. The algorithm processes the nodes in topological order, mirroring the proof by induction of Lemma 1.

Algorithm 2 Computing the Implied Model Parameters of Linear iSCMs

Input: DAG

\mathcal{G}

, weight matrix

[W]_{i,j}:=w_{i,j}

, noise variances

\bm{\sigma}^{2}\in\mathbb{R}_{+}^{d}

\widetilde{W}\leftarrow 0_{d\times d}

\Sigma\leftarrow I_{d}

\pi\leftarrow

topological ordering of

\mathcal{G}

for

i=1

d

\mathbf{w}\leftarrow W_{:,\pi_{i}}

\triangleright

Edge weights ingoing to

\pi_{i}

\operatorname{Var}[x_{\pi_{i}}]\leftarrow\mathbf{w}^{\top}\Sigma\mathbf{w}+% \sigma^{2}_{\pi_{i}}

\triangleright

Equation (8)

\widetilde{W}_{:,\pi_{i}}\leftarrow\mathbf{w}/\sqrt{\operatorname{Var}[x_{\pi_% {i}}]}

\triangleright

Equation (9)

\widetilde{\sigma}^{2}_{\pi_{i}}\leftarrow\sigma^{2}_{\pi_{i}}/\operatorname{% Var}[x_{\pi_{i}}]

\triangleright

Equation (10)

for

j=1

i

\Sigma_{\pi_{j},\pi_{i}}\leftarrow(\Sigma_{\pi_{j},:})^{\top}\widetilde{W}_{:,% \pi_{i}}

\Sigma_{\pi_{i},\pi_{j}}\leftarrow\Sigma_{\pi_{j},\pi_{i}}

return implied weights

\widetilde{W}

, implied noise variances

\widetilde{\bm{\sigma}}^{2}

Appendix B Interventions in iSCMs

For an iSCM $(\mathbf{S},\mathcal{P}_{\bm{\varepsilon}})$ , we can formalize interventions as changes to its causal mechanisms $f_{i}$ , analogous to the common definition for SCMs (Peters et al.,, 2017). Specifically, let $\mu_{i}:=\mathds{E}[x_{i}]$ and $s_{i}:=\smash{\sqrt{\operatorname{Var}[x_{i}]}}$ be the mean and standard deviation of the latent variable $x_{i}$ . We define an intervention as replacing one (or several) of the assignments to the latent variables as

\displaystyle\begin{split}x_{i}&:=h_{i}(\mathbf{\widetilde{x}}_{\mathrm{pa}(i)% },\varepsilon_{i}),\end{split}

for some function $h_{i}$ . Importantly, the statistics $\mu_{i}$ and $s_{i}$ used for the standardization operation

\displaystyle\widetilde{x}_{i}:=\frac{x_{i}-\mu_{i}}{s_{i}}

remain unchanged. Thus, if we intervene on mechanisms of iSCMs, the variables $\widetilde{\bf{x}}$ may no longer have zero mean and unit variance, and the perturbations of $x_{i}$ propagate downstream through the causal mechanisms. We note that, under the above definition, intervening on an iSCM through a new mechanism $h_{i}$ is equivalent to intervening on the implied SCM of an iSCM with the mechanism

\displaystyle\widetilde{h}_{i}(\mathbf{x},\varepsilon)=\frac{h_{i}(\mathbf{x},% \varepsilon)-\mu_{i}}{s_{i}}\,.

Appendix A.2 provides details on the implied models of iSCMs.

Appendix C Proofs

C.1 Definitions

We define the key concepts used throughout our analysis. A path $p_{j\leftrightarrow i}$ between $v_{i}$ and $v_{j}$ is a set of directed edges that allows reaching $v_{i}$ from $v_{j}$ (and vice versa), not taking into account edge directionality, and that joins unique vertices. We call a node a collider in a path if the node has two ingoing directed edges in the path. We say that a path between $v_{i}$ and $v_{j}$ is unblocked if and only if there is no node $v_{k}$ that is a collider in the path (see Figure 9(a)). Finally, we use the term undirected connected component to refer to any maximal subgraph of $\smash{\tilde{\mathcal{G}}}$ in which any two nodes are connected by a path containing only undirected edges (Wienöbst et al.,, 2023).

C.2 Explicit Covariance in Linear SCMs with Unit Marginal Variances

See 1

Proof.

We will give a proof by induction on the number of vertices $d=|\mathcal{V}|$ in the DAG $\mathcal{G}$ . Without loss of generality, we assume that the indices of the nodes are ordered according to some fixed topological ordering $\pi$ , so $\pi(j)<\pi(i)$ if $j<i$ . By the unit marginal variance assumption,

\operatorname{Cov}[x_{i},x_{i}]=\operatorname{Var}[x_{i}]=1\,.

(12)

From now on and without loss of generality, we consider two arbitrary indices $j<i$ . The covariance between $x_{i}$ and $x_{j}$ is symmetric.

Base case ( $d=2$ )

If $v_{j}$ is not an ancestor of $v_{i}$ in graph $\mathcal{G}$ , they both must be root nodes, because the edge $v_{i}\leftarrow v_{j}$ is the only possible edge when $\pi(j)<\pi(i)$ . Since $x_{i}$ and $x_{j}$ are root nodes, they are independent and $\operatorname{Cov}[x_{i},x_{j}]=0$ . Since a path of one edge cannot contain a collider, there are no unblocked paths between $v_{i}$ and $v_{j}$ , so the RHS of Equation (3) is also $0$ .

Conversely, if $v_{j}$ is an ancestor of $v_{i}$ in graph $\mathcal{G}$ , $v_{j}$ is the only parent and ancestor of $v_{i}$ . This implies that

\displaystyle\begin{split}\operatorname{Cov}[x_{i},x_{j}]&=\operatorname{Cov}[% w_{j,i}x_{j}+\varepsilon_{i},x_{j}]\\ &=w_{j,i}\operatorname{Cov}[x_{j},x_{j}]\\ &=w_{j,i}\,,\end{split}

where the last equality follows from Equation (12). This is exactly Equation (3) for a two-node graph.

Figure 6: Lemma 1 inductive step. If

v_{j}

is before

v_{i}

in the topological ordering, then all unblocked paths from

v_{j}

v_{i}

must contain a parent of

v_{i}

as the second to last node. To see this, suppose an unblocked path from

v_{j}

v_{i}

would instead contain a child of

v_{i}

as the last node. Then, there either exists a collider on the path to

v_{j}

, contradicting that the path is unblocked, or all edges in the path point away from

v_{i}

, implying that

v_{j}

is a descendant of

v_{i}

and contradicting the topological ordering. Dotted lines represent unblocked paths (which may have common nodes). Solid lines represent edges.

v_{j}

may or may not be a parent of

v_{i}

, which we illustrate with a blue arrow.

Induction step ( $d>2$ )

Let us assume that Equation (3) holds for all graphs of size $d-1$ , and let $\mathcal{G}$ have $d$ nodes. We will apply the inductive hypothesis to the subgraph of the first $d-1$ nodes in $\mathcal{G}$ and show that the full DAG $\mathcal{G}$ including the $d$ -th vertex still satisfies Equation (3). First, we note that, since the $d$ -th vertex is last in the topological ordering, it has no outgoing edges. Because the node has no outgoing edges, it is not visited on any unblocked paths between $v_{j}$ and $v_{i}$ for $i,j<d$ , as $v_{d}$ must be a collider in any path. Second, adding the node $v_{d}$ to a subsystem containing $x_{1},\dots,x_{d-1}$ results in no change to the joint distribution of $x_{i},x_{j}$ . Therefore, it has no effect on the covariance between $x_{i},x_{j}$ . Hence, both sides of Equation 3 are unchanged by the presence of a node $v_{d}$ for all $i,j<d$ and the equation still holds for all $i,j<d$ .

We want to show that Equation 3 also holds for $i=d$ and any $j<i$ . For this, we first construct all unblocked paths from $v_{j}$ to $v_{i}$ . First, we note that any unblocked path must go through the parents $k\in\text{pa}(i)$ , because $j<i$ in the topological ordering (see Figure 6). Moreover, for any $k\in\text{pa}(i)$ , appending $k\rightarrow i$ to an unblocked path $p_{j\leftrightarrow k}$ between $v_{j}$ and $v_{k}$ , creates a new unblocked path between $v_{j}$ and $v_{i}$ . Hence, for $i=d$ and any $j<i$ , it holds that

\displaystyle\begin{split}\operatorname{Cov}[x_{i},x_{j}]&=\operatorname{Cov}[% \sum_{k\in\mathrm{pa}(i)}w_{k,i}x_{k}+\varepsilon_{i},x_{j}]\\ &=\sum_{k\in\mathrm{pa}(i)}w_{k,i}\operatorname{Cov}[x_{k},x_{j}]\\ &\overset{\tiny\leavevmode\hbox to10.08pt{\vbox to10.08pt{\pgfpicture% \makeatletter\hbox{\hskip 5.03984pt\lower-5.03984pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont% \pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@invoke{\lxSVG@closescope }% \pgfsys@endscope\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{% }{}{}{{}\pgfsys@moveto{4.83984pt}{0.0pt}\pgfsys@curveto{4.83984pt}{2.673pt}{2.% 673pt}{4.83984pt}{0.0pt}{4.83984pt}\pgfsys@curveto{-2.673pt}{4.83984pt}{-4.839% 84pt}{2.673pt}{-4.83984pt}{0.0pt}\pgfsys@curveto{-4.83984pt}{-2.673pt}{-2.673% pt}{-4.83984pt}{0.0pt}{-4.83984pt}\pgfsys@curveto{2.673pt}{-4.83984pt}{4.83984% pt}{-2.673pt}{4.83984pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1% .0}{-1.25pt}{-1.6111pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{1}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}{=}w_{j,i}\operatorname{Cov}[x_{j},x_{j}]+% \sum_{k\in\mathrm{pa}(i)\setminus{j}}w_{k,i}\operatorname{Cov}[x_{k},x_{j}]\\ &\overset{\tiny\leavevmode\hbox to10.08pt{\vbox to10.08pt{\pgfpicture% \makeatletter\hbox{\hskip 5.03984pt\lower-5.03984pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont% \pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@invoke{\lxSVG@closescope }% \pgfsys@endscope\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{% }{}{}{{}\pgfsys@moveto{4.83984pt}{0.0pt}\pgfsys@curveto{4.83984pt}{2.673pt}{2.% 673pt}{4.83984pt}{0.0pt}{4.83984pt}\pgfsys@curveto{-2.673pt}{4.83984pt}{-4.839% 84pt}{2.673pt}{-4.83984pt}{0.0pt}\pgfsys@curveto{-4.83984pt}{-2.673pt}{-2.673% pt}{-4.83984pt}{0.0pt}{-4.83984pt}\pgfsys@curveto{2.673pt}{-4.83984pt}{4.83984% pt}{-2.673pt}{4.83984pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1% .0}{-1.25pt}{-1.6111pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}{=}w_{j,i}+\sum_{k\in\mathrm{pa}(i)\setminus% {j}}w_{k,i}\sum_{p_{j\leftrightarrow k}\in P_{j\leftrightarrow k}}\prod_{(l,m)% \in p_{j\leftrightarrow k}}w_{l,m}\\ &=w_{j,i}+\sum_{k\in\mathrm{pa}(i)\setminus{j}}\left(\sum_{p_{j\leftrightarrow k% }\in P_{j\leftrightarrow k}}w_{k,i}\prod_{(l,m)\in p_{j\leftrightarrow k}}w_{l% ,m}\right)\\ &\overset{\tiny\leavevmode\hbox to10.08pt{\vbox to10.08pt{\pgfpicture% \makeatletter\hbox{\hskip 5.03984pt\lower-5.03984pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont% \pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@invoke{\lxSVG@closescope }% \pgfsys@endscope\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{% }{}{}{{}\pgfsys@moveto{4.83984pt}{0.0pt}\pgfsys@curveto{4.83984pt}{2.673pt}{2.% 673pt}{4.83984pt}{0.0pt}{4.83984pt}\pgfsys@curveto{-2.673pt}{4.83984pt}{-4.839% 84pt}{2.673pt}{-4.83984pt}{0.0pt}\pgfsys@curveto{-4.83984pt}{-2.673pt}{-2.673% pt}{-4.83984pt}{0.0pt}{-4.83984pt}\pgfsys@curveto{2.673pt}{-4.83984pt}{4.83984% pt}{-2.673pt}{4.83984pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1% .0}{-1.25pt}{-1.6111pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{3}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}{=}\sum_{k\in\mathrm{pa}(i)}\left(\mathbbm{1% }[k=j]w_{j,i}+\mathbbm{1}[k\neq j]\left(\sum_{p_{j\leftrightarrow k}\in P_{j% \leftrightarrow k}}w_{k,i}\prod_{(l,m)\in p_{j\leftrightarrow k}}w_{l,m}\right% )\right)\\ &\overset{\tiny\leavevmode\hbox to10.08pt{\vbox to10.08pt{\pgfpicture% \makeatletter\hbox{\hskip 5.03984pt\lower-5.03984pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont% \pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@invoke{\lxSVG@closescope }% \pgfsys@endscope\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{% }{}{}{{}\pgfsys@moveto{4.83984pt}{0.0pt}\pgfsys@curveto{4.83984pt}{2.673pt}{2.% 673pt}{4.83984pt}{0.0pt}{4.83984pt}\pgfsys@curveto{-2.673pt}{4.83984pt}{-4.839% 84pt}{2.673pt}{-4.83984pt}{0.0pt}\pgfsys@curveto{-4.83984pt}{-2.673pt}{-2.673% pt}{-4.83984pt}{0.0pt}{-4.83984pt}\pgfsys@curveto{2.673pt}{-4.83984pt}{4.83984% pt}{-2.673pt}{4.83984pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1% .0}{-1.25pt}{-1.6111pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{4}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}{=}\sum_{p_{j\leftrightarrow i}\in P_{j% \leftrightarrow i}}\prod_{(l,m)\in p_{j\leftrightarrow i}}w_{l,m}\,.\end{split}

For step , consider two cases. If $j\notin\text{pa}(i)$ , then $w_{j,i}=0$ and the equality trivially holds. If $j\in\text{pa}(i)$ , then it holds by pulling the term for $j$ out of the sum in the previous line. In , we apply the inductive hypothesis to express the covariances in terms of a sum of products of weights. In , we rearrange terms to pull the $w_{j,i}$ term into the sum over parents. In , we use the fact that the set of unblocked paths from $v_{j}$ to $v_{i}$ corresponds to all paths from $v_{j}$ to any parent of $v_{i}$ , which is $v_{k}$ here, with an extra edge $k\rightarrow i$ appended, and a possible single-edge path directly connecting $v_{j}$ with $v_{i}$ (if $j\in\mathrm{pa}(i)$ ).

This completes the induction step and the proof. ∎

C.3 Bound on the Fraction of CEV

See 2

Proof.

We begin by bounding the variance of the latent variables $x_{i}$ in iSCMs. Starting from Equation (8), we can bound the covariances with a product of unit variances as

\displaystyle\begin{split}\operatorname{Var}[x_{i}]&=\sum_{k\in\mathrm{pa}(i)}% \sum_{j\in\mathrm{pa}(i)}w_{k,i}w_{j,i}\operatorname{Cov}[\widetilde{x}_{j},% \widetilde{x}_{k}]+\sigma^{2}\\ &\overset{\tiny\leavevmode\hbox to10.08pt{\vbox to10.08pt{\pgfpicture% \makeatletter\hbox{\hskip 5.03984pt\lower-5.03984pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont% \pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@invoke{\lxSVG@closescope }% \pgfsys@endscope\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{% }{}{}{{}\pgfsys@moveto{4.83984pt}{0.0pt}\pgfsys@curveto{4.83984pt}{2.673pt}{2.% 673pt}{4.83984pt}{0.0pt}{4.83984pt}\pgfsys@curveto{-2.673pt}{4.83984pt}{-4.839% 84pt}{2.673pt}{-4.83984pt}{0.0pt}\pgfsys@curveto{-4.83984pt}{-2.673pt}{-2.673% pt}{-4.83984pt}{0.0pt}{-4.83984pt}\pgfsys@curveto{2.673pt}{-4.83984pt}{4.83984% pt}{-2.673pt}{4.83984pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1% .0}{-1.25pt}{-1.6111pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{1}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}{\leq}\sum_{k\in\mathrm{pa}(i)}\sum_{j\in% \mathrm{pa}(i)}w_{k,i}w_{j,i}+\sigma^{2}\\ &=\Big{(}\sum_{j\in\mathrm{pa}(i)}w_{j,i}\Big{)}^{2}+\sigma^{2}\\ &\overset{\tiny\leavevmode\hbox to10.08pt{\vbox to10.08pt{\pgfpicture% \makeatletter\hbox{\hskip 5.03984pt\lower-5.03984pt\hbox to0.0pt{% \pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}% {0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont% \pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@invoke{\lxSVG@closescope }% \pgfsys@endscope\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{% }{}{}{{}\pgfsys@moveto{4.83984pt}{0.0pt}\pgfsys@curveto{4.83984pt}{2.673pt}{2.% 673pt}{4.83984pt}{0.0pt}{4.83984pt}\pgfsys@curveto{-2.673pt}{4.83984pt}{-4.839% 84pt}{2.673pt}{-4.83984pt}{0.0pt}\pgfsys@curveto{-4.83984pt}{-2.673pt}{-2.673% pt}{-4.83984pt}{0.0pt}{-4.83984pt}\pgfsys@curveto{2.673pt}{-4.83984pt}{4.83984% pt}{-2.673pt}{4.83984pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}% \pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1% .0}{-1.25pt}{-1.6111pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }% \pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}% \pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}% \lxSVG@closescope\endpgfpicture}}}{\leq}m^{2}w^{2}+\sigma^{2}\,,\end{split}

where uses $\operatorname{Cov}[\widetilde{x}_{j},\widetilde{x}_{k}]\leq 1$ since $\operatorname{Var}[\widetilde{x}_{j}]=1$ and $\operatorname{Var}[\widetilde{x}_{k}]=1$ , and applies the Cauchy-Schwartz inequality. Since we obtain $\widetilde{x_{i}}$ from $x_{i}$ just by shifting and scaling the latter, we observe that $\operatorname{CEV_{f}}[\widetilde{x_{i}}]=\operatorname{CEV_{f}}[x_{i}]$ . Using the upper bound on the variance of $x_{i}$ and the definition of the fraction of cause-explained variance in Equation (4)), we get

\displaystyle\begin{split}\operatorname{CEV_{f}}[\widetilde{x_{i}}]&=% \operatorname{CEV_{f}}[x_{i}]=1-\frac{\operatorname{Var}[x_{i}-\mathds{E}[x_{i% }|\mathbf{x}_{\mathrm{pa}(i)}]]}{\operatorname{Var}[x_{i}]}=1-\frac{% \operatorname{Var}[x_{i}-\mathbf{w}_{i}^{\top}\mathbf{x}_{\mathrm{pa}(i)}]}{% \operatorname{Var}[x_{i}]}\\ &=1-\frac{\operatorname{Var}[\varepsilon_{i}]}{\operatorname{Var}[x_{i}]}=1-% \frac{\sigma^{2}}{\operatorname{Var}[x_{i}]}\leq 1-\frac{\sigma^{2}}{m^{2}w^{2% }+\sigma^{2}}\,.\end{split}

∎

C.4 Identifiability

In this section, we prove Theorems 3 and 4. We begin by deriving the covariances for the 3-node example in Section 4.2 and then give the general proofs for forests. The proofs of both theorems share the same underlying argument. We first derive the SCM forms of the original models, i.e., standardized SCMs in Theorem 3 and iSCMs in Theorem 4. By showing that the standardized SCMs and iSCMs are SCMs with the same causal graphs $\mathcal{G}$ and observational distributions $p(\bf{x})$ , we can leverage Lemma 1 to obtain the covariances between the observed variables in both model classes. Ultimately, these covariances allow us to derive (non)identifiability conditions for the DAGs $\mathcal{G}$ in an MEC underlying the original models.

Theorems 3 and 4 assume that the exogenous noise is sampled from a zero-centered distribution with equal variance across variables. Since the results are based on the analysis of covariances, they also hold with the assumption that $\mathds{E}[\varepsilon_{i}]\neq 0$ , but the zero-mean assumption simplifies notation. To derive the results for iSCMs, we additionally assume that the noise is Gaussian (see Theorem 4) . When referring to an undirected edge between nodes $v_{i},v_{j}$ , for example, in an MEC, we still denote the edge with $(v_{i},v_{j})$ , but the ordering of the nodes is arbitrary.

C.4.1 3-Node Case

We begin by studying the 3-node example of Figure 3 in Section 4.2. Let $\alpha_{i},\beta_{i},\gamma_{i},\lambda_{i}\in\mathbb{R}$ be linear function weights, and consider the following three causal graphs $\mathcal{G}$ belonging to the same MEC, along with their corresponding SCMs and iSCMs.

$\mathcal{G}$

SCM

iSCM

$\displaystyle x_{1}$	$\displaystyle:=\varepsilon_{1}$	(13)
$\displaystyle x_{2}$	$\displaystyle:=\alpha_{1}x_{1}+\varepsilon_{2}$
$\displaystyle x_{3}$	$\displaystyle:=\beta_{1}x_{2}+\varepsilon_{3}$

$\displaystyle x_{1}$	$\displaystyle:=\varepsilon_{1}$	(14)
$\displaystyle x_{2}$	$\displaystyle:=\gamma_{1}\widetilde{x}_{1}+\varepsilon_{2}$
$\displaystyle x_{3}$	$\displaystyle:=\lambda_{1}\widetilde{x}_{2}+\varepsilon_{3}$

$\displaystyle x_{1}$	$\displaystyle:=\alpha_{2}x_{2}+\varepsilon_{1}$	(15)
$\displaystyle x_{2}$	$\displaystyle:=\varepsilon_{2}$
$\displaystyle x_{3}$	$\displaystyle:=\beta_{2}x_{2}+\varepsilon_{3}$

$\displaystyle x_{1}$	$\displaystyle:=\gamma_{2}\widetilde{x}_{1}+\varepsilon_{1}$	(16)
$\displaystyle x_{2}$	$\displaystyle:=\varepsilon_{2}$
$\displaystyle x_{3}$	$\displaystyle:=\lambda_{2}\widetilde{x}_{2}+\varepsilon_{3}$

In the following subsections, we derive the covariance matrices of each of the three systems, respectively. This leads us to the equivalence presented in Equation (5) for standardized SCMs. Moreover, we show that, for iSCMs, all three systems induce exactly the same observational distribution if and only if $\lambda_{1}=\lambda_{2}=\lambda_{3}$ and $\gamma_{1}=\gamma_{2}=\gamma_{3}$ . These are the 3-node special cases of Theorems 3 and 4.

Standardized SCM

To obtain the covariances between the observed variables in the standardized SCMs of Equations (13), (15), and (LABEL:eq:s3), we first show that the assignments to the observed variables in standardized SCMs can be written in the form of linear SCMs over the same causal graph, which allows us to use Lemma 1. In all three systems, every vertex has at most one parent. When the node $v_{j}$ is the only parent of $v_{i}$ , under our assumptions on the noise, we have $x_{j}=\smash{\sqrt{\operatorname{Var}[x_{j}]}x_{j}^{s}}$ , so the assignment of $x_{i}^{s}$ can be written in the form of an SCM over $\bf{x}^{s}$ as

x_{i}^{s}:=\frac{x_{i}}{\sqrt{\operatorname{Var}[x_{i}]}}=\frac{w_{j,i}x_{j}+% \varepsilon_{i}}{\sqrt{\operatorname{Var}[x_{i}]}}=\frac{w_{j,i}{\sqrt{% \operatorname{Var}[x_{j}]}}x_{j}^{s}+\varepsilon_{i}}{\sqrt{\operatorname{Var}% [x_{i}]}}=w_{j,i}\sqrt{\frac{\operatorname{Var}[x_{j}]}{\operatorname{Var}[x_{% i}]}}x_{j}^{s}+\frac{\varepsilon_{i}}{\sqrt{\operatorname{Var}[x_{i}]}}\,.

(19)

To use Equation (19), we first need to compute the marginal variances of the unstandardized observations $x_{i}$ . For the standardized SCMs, these marginal variances are, respectively:

for Equation (13):	for Equation (15):	for Equation (LABEL:eq:s3):
$\operatorname{Var}[x_{1}]=\sigma^{2}$	$\operatorname{Var}[x_{1}]=(\alpha_{2}^{2}+1)\sigma^{2}$	$\operatorname{Var}[x_{1}]=(\alpha_{3}^{2}(\beta_{3}^{2}+1)+1)\sigma^{2}$
$\operatorname{Var}[x_{2}]=(\alpha_{1}^{2}+1)\sigma^{2}$	$\operatorname{Var}[x_{2}]=\sigma^{2}$	$\operatorname{Var}[x_{2}]=(\beta_{3}^{2}+1)\sigma^{2}$
$\operatorname{Var}[x_{3}]=(\beta_{1}^{2}(\alpha_{1}^{2}+1)+1)\sigma^{2}$	$\operatorname{Var}[x_{3}]=(\beta_{2}^{2}+1)\sigma^{2}$	$\operatorname{Var}[x_{3}]=\sigma^{2}$

Given Equation (19) and the marginal variances, we know the weights of all three implied SCMs explicitly. Since all implied SCMs are linear, have unit marginal variances, and share the same causal graph, we can apply Lemma 1 and obtain the covariances of the observational distributions in the original models:

for Equation (13):	for Equation (15):	for Equation (LABEL:eq:s3):
$\operatorname{Cov}[x_{1}^{s},x_{2}^{s}]=\tfrac{\alpha_{1}}{\sqrt{\alpha_{1}^{2% }+1}}$	$\operatorname{Cov}[x_{1}^{s},x_{2}^{s}]=\tfrac{\alpha_{2}}{\sqrt{\alpha_{2}^{2% }+1}}$	$\operatorname{Cov}[x_{1}^{s},x_{2}^{s}]=\alpha_{3}\sqrt{\tfrac{\beta_{3}^{2}+1% }{\alpha_{3}^{2}(\beta_{3}^{2}+1)+1}}$
$\operatorname{Cov}[x_{1}^{s},x_{3}^{s}]=\tfrac{\alpha_{1}\beta_{1}}{\sqrt{% \beta_{1}^{2}(\alpha_{1}^{2}+1)+1}}$	$\operatorname{Cov}[x_{1}^{s},x_{3}^{s}]=\tfrac{\alpha_{2}\beta_{2}}{\sqrt{(% \alpha_{2}^{2}+1)(\beta_{2}^{2}+1)}}$	$\operatorname{Cov}[x_{1}^{s},x_{3}^{s}]=\tfrac{\alpha_{3}}{\sqrt{\alpha_{3}^{2% }(\beta_{3}^{2}+1)+1}}$
$\operatorname{Cov}[x_{2}^{s},x_{3}^{s}]=\beta_{1}\sqrt{\tfrac{\alpha_{1}^{2}+1% }{\beta_{1}^{2}(\alpha_{1}^{2}+1)+1}}$	$\operatorname{Cov}[x_{2}^{s},x_{3}^{s}]=\tfrac{\beta_{2}}{\sqrt{\beta_{2}^{2}+% 1}}$	$\operatorname{Cov}[x_{2}^{s},x_{3}^{s}]=\tfrac{\beta_{3}}{\sqrt{\beta_{3}^{2}+% 1}}$

In the standardized SCM (13), the causal graph is $v_{1}\rightarrow v_{2}\rightarrow v_{3}$ . Hence, the edge directions of the DAG $\mathcal{G}$ are consistent with the direction of increasing absolute covariance if and only if

\displaystyle\begin{split}\lvert\operatorname{Cov}[x_{1}^{s},x_{2}^{s}]\rvert<% \lvert\operatorname{Cov}[x_{2}^{s},x_{3}^{s}]\rvert&\quad\Longleftrightarrow% \quad\left\lvert\tfrac{\alpha_{1}}{\sqrt{\alpha_{1}^{2}+1}}\right\rvert<\left% \lvert\beta_{1}\sqrt{\tfrac{\alpha_{1}^{2}+1}{\beta_{1}^{2}(\alpha_{1}^{2}+1)+% 1}}\right\rvert\\ &\quad\Longleftrightarrow\quad\tfrac{\alpha_{1}^{2}}{\alpha_{1}^{2}+1}<\beta_{% 1}^{2}\tfrac{\alpha_{1}^{2}+1}{\beta_{1}^{2}(\alpha_{1}^{2}+1)+1}\\ &\quad\Longleftrightarrow\quad\alpha_{1}^{2}(\beta_{1}^{2}(\alpha_{1}^{2}+1)+1% )<\beta_{1}^{2}(\alpha_{1}^{2}+1)^{2}\\ &\quad\Longleftrightarrow\quad\cancel{\beta_{1}^{2}\alpha_{1}^{4}}+\cancel{% \beta_{1}^{2}\alpha_{1}^{2}}+\alpha_{1}^{2}<\cancel{\beta_{1}^{2}\alpha_{1}^{4% }}+\cancel{2}\beta_{1}^{2}\alpha_{1}^{2}+\beta_{1}^{2}\\ &\quad\Longleftrightarrow\quad\alpha_{1}^{2}<\beta_{1}^{2}(\alpha_{1}^{2}+1)\\ &\quad\Longleftrightarrow\quad\tfrac{\alpha_{1}^{2}}{\alpha_{1}^{2}+1}<\beta_{% 1}^{2}\,.\end{split}

(20)

In the above equivalences, we always multiply or divide by quantities greater than $0$ , so the direction of the inequality does not change, and transformations are equivalent. For the standardized SCM (LABEL:eq:s3) with causal graph $v_{1}\leftarrow v_{2}\leftarrow v_{3}$ , we get an analogous condition for the edges to be aligned with the order of increasing absolute covariance when following the same algebraic manipulations:

\displaystyle\lvert\operatorname{Cov}[x_{3}^{s},x_{2}^{s}]\rvert<\lvert% \operatorname{Cov}[x_{2}^{s},x_{1}^{s}]\rvert

\displaystyle\quad\Longleftrightarrow\quad\tfrac{\beta_{3}^{2}}{\beta_{3}^{2}+% 1}<\alpha_{3}^{2}.

We make use of both of these conditions in Section 4. Since $z/(z+1)<1$ for any $z>0$ , the right-hand sides of both conditions are true if all weights are greater than $1$ . In this case, the absolute covariance increases downstream in all SCMs of Equations (13) and (LABEL:eq:s3). Hence, among these two systems, only the DAG $\mathcal{G}$ whose edges aligns with the covariance ordering in the observed $p(\bf{x}^{s})$ can induce $p(\bf{x}^{s})$ , and we can conclude that the other DAG is not the true causal graph.

iSCM

To derive the observational distributions of the iSCMs in Equations (14), (16), and (LABEL:eq:s3_ours), we proceed in the same way as we did for standardized SCMs. We first show that the iSCM is an SCM with a specific set of mechanisms and then apply Lemma 1 to obtain the covariances between the observed variables. To see this, we write the assignment of $\widetilde{x_{i}}$ as

\widetilde{x_{i}}:=\frac{x_{i}}{\sqrt{\operatorname{Var}[x_{i}]}}=\frac{w_{j,i% }\widetilde{x_{j}}+\varepsilon_{i}}{\sqrt{\operatorname{Var}[x_{i}]}}=\frac{w_% {j,i}}{\sqrt{\operatorname{Var}[x_{i}]}}\widetilde{x_{j}}+\frac{\varepsilon_{i% }}{\sqrt{\operatorname{Var}[x_{i}]}}

(21)

As before, using Equation 21 requires first computing the marginal variances of the latent variables $x_{i}$ . For the iSCMs defined by Equations (14), (16), and (LABEL:eq:s3_ours), they are given by

for Equation (14):	for Equation (16):	for Equation (LABEL:eq:s3_ours):
$\operatorname{Var}[x_{1}]=\sigma^{2}$	$\operatorname{Var}[x_{1}]=\gamma_{2}^{2}+\sigma^{2}$	$\operatorname{Var}[x_{1}]=\gamma_{3}^{2}+\sigma^{2}$
$\operatorname{Var}[x_{2}]=\gamma_{1}^{2}+\sigma^{2}$	$\operatorname{Var}[x_{2}]=\sigma^{2}$	$\operatorname{Var}[x_{2}]=\lambda_{3}^{2}+\sigma^{2}$
$\operatorname{Var}[x_{3}]=\lambda_{1}^{2}+\sigma^{2}$	$\operatorname{Var}[x_{3}]=\lambda_{2}^{2}+\sigma^{2}$	$\operatorname{Var}[x_{3}]=\sigma^{2}$

Given Equation (21) and the marginal variances, we obtain an explicit form for the weights of all three implied SCMs. Since the implied SCMs are linear, have unit marginal variances, and share the same causal graph, we can apply Lemma 1 and obtain the covariances of the observational distributions in the original models. It turns out that the observational distribution of all three ground-truth systems $(\widetilde{x}_{1},\widetilde{x}_{2},\widetilde{x}_{3})$ in Equations (14), (16), and (LABEL:eq:s3_ours) is a multivariate Gaussian with the same covariance matrix, with the diagonal elements equal to $1$ and the off-diagonal elements given by

\displaystyle\begin{split}\operatorname{Cov}[\widetilde{x}_{1},\widetilde{x}_{% 2}]&=\displaystyle\frac{\gamma_{i}}{\sqrt{\gamma_{i}^{2}+\sigma^{2}}}\\ \operatorname{Cov}[\widetilde{x}_{1},\widetilde{x}_{3}]&=\displaystyle\frac{% \gamma_{i}\lambda_{i}}{\sqrt{(\lambda_{i}^{2}+\sigma^{2})(\gamma_{i}^{2}+% \sigma^{2})}}\\ \operatorname{Cov}[\widetilde{x}_{2},\widetilde{x}_{3}]&=\displaystyle\frac{% \lambda_{i}}{\sqrt{\lambda_{i}^{2}+\sigma^{2}}}\end{split}

(22)

Since the observational distribution of all three SCMs is a zero-centered multivariate Gaussian, the distributions are equal if and only if their their covariance matrices are identical. The covariances are equal if and only if $\lambda_{1}=\lambda_{2}=\lambda_{3}$ and $\gamma_{1}=\gamma_{2}=\gamma_{3}$ , because the function $f(z)=\smash{z/\sqrt{z^{2}+\sigma^{2}}}$ appearing in $\smash{\operatorname{Cov}[\widetilde{x}_{1},\widetilde{x}_{2}]}$ and $\smash{\operatorname{Cov}[\widetilde{x}_{2},\widetilde{x}_{3}]}$ of Equation 22 is injective for any $\sigma>0$ , which means that distinct weights $z$ are mapped to distinct covariances. Therefore, the three node linear iSCMs in the above MEC share the same observational distribution if and only if they also share the same weights for each edge, regardless of edge orientation.

This implies that the three DAGs $\mathcal{G}$ in the MEC of Equations (14), (16), and (LABEL:eq:s3_ours) are not identifiable from $p(\widetilde{\bf{x}})$ : given $p(\widetilde{\bf{x}})$ induced by an iSCM with DAG in this 3-node MEC, the two other DAGs with the same linear function weights induce the same distribution $p(\widetilde{\bf{x}})$ .

C.4.2 Forests

In this section, we generalize the above partial identifiability result for standardized SCMs to arbitrary forest DAGs (Theorem 3). After that, we similarly generalize the nonidentifiability of iSCMs to forests (Theorem 4). Our results concern the identification edge directions in an MEC represented by its partially directed graph $\smash{\smash{\tilde{\mathcal{G}}}=(\mathcal{V},\tilde{\mathcal{E}})}$ , where $\smash{\tilde{\mathcal{E}}}$ contains both directed and undirected edges.

Standardized SCM

Before proving the main theorem, we extend the 3-node example to chains of arbitrary length. We show that all but at most one edge in the MEC can be correctly oriented from observational data using the assumption on the support of the weights. Analogous to the three node case, we then use this to prove a similar result for forest graphs.

Lemma 5 (Orientation of edges in undirected chains of standardized SCMs).

Let $\bf{x}^{s}$ be modeled by a standardized linear SCM (1) with chain DAG $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , where $\operatorname{Var}[\varepsilon_{i}]=\sigma^{2}$ for non-root nodes and $\left\lvert w_{i,j}\right\rvert>1$ for all $i\in\text{pa}(j)$ . Additionally, suppose $\mathcal{G}$ contains no colliders. Then, given $p(\bf{x}^{s})$ and the partially directed graph $\smash{\tilde{\mathcal{G}}}$ representing the MEC of $\mathcal{G}$ , we can identify all but at most one edge $(v_{i},v_{j})$ of the true DAG $\mathcal{G}$ in each undirected connected component of the MEC $\smash{\tilde{\mathcal{G}}}$ . The possible undirected edge has the smallest absolute covariance of all variables connected by edges in the MEC, satisfying $\smash{\lvert\operatorname{Cov}[x^{s}_{i},x^{s}_{j}]\rvert<\lvert\operatorname% {Cov}[x^{s}_{k},x^{s}_{l}]\rvert}\,$ for all $(k,l)\in\smash{\tilde{\mathcal{E}}}\setminus(i,j)$ .

Proof.

(a) Subsystem 1

(b) Subsystem 2

Figure 7: Proof subcases of Lemma 5. Three possible subgraphs in a chain without a collider.

Throughout the proof, we label the nodes $v_{i}\in\mathcal{V}$ such that $v_{i-1}$ and $v_{i+1}$ are its neighbors for $i\in\{2,\dots,d-1\}$ . We start with the analysis of three arbitrary, consecutive vertices in a chain graph. The three possible subgraphs are depicted in Figure 7. We can always find $p\in\mathds{R}$ such that the variance of the latent root of this directed subgraph is $p^{2}\sigma^{2}$ . This relaxed assumption on specifically the root node allows for the root of the subgraph to have potential parents outside the subgraph, or to be the root of the whole chain, when later using this lemma to prove the main theorem.

We will follow similar derivations as in Section C.4.1. Specifically, we first write the observed variables of the standardized SCM in SCM form, and then invoke Lemma 1 to obtain the covariances of the observed variables. To use Equation 19, we again need to compute the marginal variances of the variables before standardization. For the subsystems in Figures 7(a) and 7(b), these are, respectively:

for Figure 7(a):	for Figure 7(b):
$\operatorname{Var}[x_{i}]=p^{2}\sigma^{2}$	$\operatorname{Var}[x_{i}]=(w_{i+1,i}^{2}p^{2}+1)\sigma^{2}$
$\operatorname{Var}[x_{i+1}]=(w_{i,i+1}^{2}p^{2}+1)\sigma^{2}$	$\operatorname{Var}[x_{i+1}]=p^{2}\sigma^{2}$
$\operatorname{Var}[x_{i+2}]=(w_{i+1,i+2}^{2}(w_{i,i+1}^{2}p^{2}+1)+1)\sigma^{2}$	$\operatorname{Var}[x_{i+2}]=(w_{i+1,i+2}^{2}p^{2}+1)\sigma^{2}$

By substituting the expressions for the marginal variances into Equation 19, we obtain the weights of the implied models of the standardized SCM. Using Lemma 1, we obtain the covariances between the observed variables $x_{i-1}^{s},x_{i}^{s},x_{i+1}^{s}$ . By construction, the marginal variances of the observed variables are equal to $1$ . We treat each subsystem separately:

Subsystem 1 (Figure 7(a))

Given the marginal variances and Lemma 1, the covariances are

	$\displaystyle\operatorname{Cov}[x^{s}_{i},x^{s}_{i+1}]$	$\displaystyle=\frac{w_{i,i+1}p}{\sqrt{w_{i,i+1}^{2}p^{2}+1}}$
	$\displaystyle\operatorname{Cov}[x^{s}_{i+1},x^{s}_{i+2}]$	$\displaystyle=w_{i+1,i+2}\sqrt{\frac{w_{i,i+1}^{2}p^{2}+1}{w_{i+1,i+2}^{2}(w_{% i,i+1}^{2}p^{2}+1)+1}}$

Following the same algebraic manipulations as in Equation 20, substituting $\alpha_{1}:=w_{i,i+1}p$ and $\beta_{1}:=w_{i+1,i+2}$ in the derivation, we obtain

\displaystyle\left\lvert\operatorname{Cov}[x^{s}_{i},x^{s}_{i+1}]\right\rvert<% \left\lvert\operatorname{Cov}[x^{s}_{i+1},x^{s}_{i+2}]\right\rvert\quad% \Longleftrightarrow\quad\frac{w_{i,i+1}^{2}p^{2}}{w_{i,i+1}^{2}p^{2}+1}<w_{i+1% ,i+2}^{2}\,.

(23)

The left-hand side of the right-hand inequality in Equation 23 is upper-bounded by $1$ , similar to the 3-node case. Therefore, if we assume that $\lvert w_{i+1,i+2}\rvert\geq 1$ , it must hold that $\lvert{\operatorname{Cov}[x^{s}_{i},x^{s}_{i+1}]}\rvert<\lvert{\operatorname{% Cov}[x^{s}_{i+1},x^{s}_{i+2}]}\rvert$ for any choice of $p$ .

Subsystem 2 (Figure 7(b))

Given the marginal variances and Lemma 1, the covariances are

	$\displaystyle\operatorname{Cov}[x^{s}_{i},x^{s}_{i+1}]$	$\displaystyle=\frac{w_{i+1,i}p}{\sqrt{w_{i+1,i}^{2}p^{2}+1}}$
	$\displaystyle\operatorname{Cov}[x^{s}_{i+1},x^{s}_{i+2}]$	$\displaystyle=\frac{w_{i+1,i+2}p}{\sqrt{w_{i+1,i+2}^{2}p^{2}+1}}\,.$

The ordering of the covariances in this case depends on the specific choice of the weights.

Subsystem 3 (Figure 7(c))

Following steps analogous to the symmetric subsystem 1, we conclude that, if $\lvert{w_{i+1,i}}\rvert\geq 1$ , it must hold that $\lvert{\operatorname{Cov}[x^{s}_{i},x^{s}_{i+1}]}\rvert>\lvert{\operatorname{% Cov}[x^{s}_{i+1},x^{s}_{i+2}]}\rvert$ for any $p$ .

Given the above, we can now study the relationship between the underlying DAG $\mathcal{G}$ and the absolute covariance magnitudes under the assumption that $\smash{\lvert w_{i,i+1}\rvert}>1$ . We will use the fact that, if the chain does not contain a collider, then there can be at most one node contained in edges pointing in opposite directions.

First, we treat the case where there exists a vertex $v_{i}$ such that $\lvert\operatorname{Cov}[x^{s}_{i-1},x^{s}_{i}]\rvert=\lvert\operatorname{Cov}% [x^{s}_{i},x^{s}_{i+1}]\rvert$ , that is, where some neighboring covariances are equal. If this occurs in a 3-node subsystem, only subsystem 2 can describe the true graph. To be consistent with the assumption that there are no colliders in the graph (see Lemma 5), all other edges must be oriented in a direction away from $v_{i}$ , which completely identifies the graph $\mathcal{G}$ in the MEC.

In the second case, $\lvert\operatorname{Cov}[x^{s}_{j-1},x^{s}_{j}]\rvert\neq\lvert\operatorname{% Cov}[x^{s}_{j},x^{s}_{j+1}]\rvert$ holds for all nodes $v_{j}$ that have two neighbors in the path. Let $x^{s}_{i},x^{s}_{i+1}$ be the unique pair of consecutive variables in the chain that minimizes $\lvert\operatorname{Cov}[x^{s}_{i},x^{s}_{i+1}]\rvert$ . We can show that this pair is the unique minimizer using a proof by contradiction. Suppose there exist two pairs $x^{s}_{i},x^{s}_{i+1}$ and $x^{s}_{j},x^{s}_{j+1}$ such that $\lvert\operatorname{Cov}[x^{s}_{i},x^{s}_{i+1}]\rvert=\lvert\operatorname{Cov}% [x^{s}_{j},x^{s}_{j+1}]\rvert$ is the minimum covariance. Without loss of generality, let $j+1<i$ . Then, the triple $x^{s}_{i-1},x^{s}_{i},x^{s}_{i+1}$ is consistent with only subsystems 2 or 3 based on their relative covariances, which implies that we must have $v_{i-1}\leftarrow v_{i}$ . Using the fact that we have no colliders, we can then orient all edges $v_{k-1}\leftarrow v_{k}$ for $1<k<i$ . Thus, we can find a subsystem containing $v_{j},v_{j+1},v_{j+2}$ , which has been already oriented as subsystem 3, meaning $\lvert\operatorname{Cov}[x^{s}_{j},x^{s}_{j+1}]\rvert>\lvert\operatorname{Cov}% [x^{s}_{j+1},x^{s}_{j+2}]\rvert$ , a contradiction.

Given $x^{s}_{i},x^{s}_{i+1}$ is the unique pair of consecutive variables that minimizes $\lvert{\operatorname{Cov}[x^{s}_{i},x^{s}_{i+1}]\rvert}$ , we now show that we can orient all edges except $(v_{i},v_{i+1})$ . We will do this in two parts. First, we show that one can orient all edges $(v_{j},v_{j+1})$ with $j<i$ , and then we show that we can do the same for all edges $(v_{j},v_{j+1})$ with $j>i$ . If $i>1$ , consider the subsystem $v_{i-1},v_{i},v_{i+1}$ . Since $\smash{\lvert{\operatorname{Cov}[x^{s}_{i-1},x^{s}_{i}]}\rvert}>\smash{\lvert{% \operatorname{Cov}[x^{s}_{i},x^{s}_{i+1}]}\lvert}$ , only subsystems 2 and 3 are possible for this subgraph. We can therefore orient $v_{i-1}\leftarrow v_{i}$ . Similarly, if $i<d-1$ , by a symmetric argument on $v_{i},v_{i+1},v_{i+2}$ , we can orient $v_{i+1}\rightarrow v_{i+2}$ . Since the graph cannot contain colliders, all other edges must be oriented as $v_{j}\leftarrow v_{j+1}$ for $j<i$ , and $v_{j}\rightarrow v_{j+1}$ for $j>i$ . In other words, all edges except $(v_{i},v_{i+1})$ point away from the two vertices $v_{i},v_{i+1}$ , and one of the two variables must be the root of the chain. Therefore, if $\smash{\lvert\operatorname{Cov}[x^{s}_{j-1},x^{s}_{j}]\rvert\neq\lvert% \operatorname{Cov}[x^{s}_{j},x^{s}_{j+1}]\rvert}$ holds for all vertices $v_{j}$ that have two neighbors, then there exists a unique covariance minimizing pair $x^{s}_{i},x^{s}_{i+1}$ , and all edges except $(v_{i},v_{i+1})$ are oriented.

The two cases above are exhaustive, and in the worst case at most one edge $(v_{j},v_{j+1})$ is left unoriented in the chain. This edge always corresponds to the minimizer of $\lvert\operatorname{Cov}[x^{s}_{j},x^{s}_{j+1}]\rvert$ . This completes the proof. ∎

Remark

From the proof of Lemma 5, it follows that if we are able to orient all the edges in the chain, then the root of the chain is the node joining the two edges with minimum absolute covariance. When we orient all but one edge $(v_{i},v_{i+1})$ , the root node of the chain is either $v_{i}$ or $v_{j}$ .

We can extend Lemma 5 to forest graphs. For this, we will make use of the first Meek rule (Meek,, 1995). The first Meek rule concerns an MEC $\smash{\tilde{\mathcal{G}}}$ , containing the undirected edges $(v_{i},v_{j}),(v_{j},v_{k})$ but not the edge $(v_{i},v_{k})$ . It states that, if one can orient $v_{i}\rightarrow v_{j}$ , we must have $v_{j}\rightarrow v_{k}$ .

See 3

Proof.

The undirected parts of an MEC $\smash{\tilde{\mathcal{G}}}$ are disjoint undirected connected components. Orienting the edges in all these undirected connected components without introducing a v-structure produces a valid DAG $\mathcal{G}$ in $\smash{\tilde{\mathcal{G}}}$ (Andersson et al.,, 1997). Each undirected connected components represents a Markov equivalence class of its own (Andersson et al.,, 1997). Thus, to prove the theorem, we consider these undirected connected components independently with respect to the rest of the graph and show how to orient the edges in each undirected connected component.¹¹1Orienting edges of an undirected connected component that touch a directed edge in $\smash{\tilde{\mathcal{G}}}$ never introduces an additional v-structure. If a directed edge pointed into the undirected connected component, the undirected edge downstream would have had to already be directed in $\smash{\tilde{\mathcal{G}}}$ by the first Meek rule. Hence, all directed edges bordering the undirected connected component must be oriented away from it, and none of the possible undirected edge orientations creates a new collider at the border node. This implies that all undirected connected components in $\smash{\tilde{\mathcal{G}}}$ are upstream of the colliders and directed subgraphs of $\smash{\tilde{\mathcal{G}}}$ . In the following argument, we therefore consider $\smash{\tilde{\mathcal{G}}}$ to be a single undirected connected component, with no directed edges by definition, and show that we can orient all but one edge in $\smash{\tilde{\mathcal{G}}}$ . This argument then extends to all undirected connected components of the original MEC $\smash{\tilde{\mathcal{G}}}$ , implying the statement made in Theorem 3.

If $\smash{\tilde{\mathcal{G}}}$ is an undirected connected component with no directed edges, we only have to consider SCMs with a ground-truth DAG $\mathcal{G}$ that are members of this MEC $\smash{\tilde{\mathcal{G}}}$ to distinguish among possible edge orientations in $\smash{\tilde{\mathcal{G}}}$ . In the case of undirected trees, the ground-truth DAG $\mathcal{G}$ must be a tree with no colliders and the same skeleton as $\smash{\tilde{\mathcal{G}}}$ , since any other DAGs would belong to a different MEC.

Figure 8: Inductive step of the proof of Theorem 3. Ground-truth DAG

\mathcal{G}

underlying an undirected connected component

\smash{\tilde{\mathcal{G}}}

in some given MEC. The nodes

\mathcal{V}_{C}=\{v_{1},\dots,v_{k}\}

are a longest chain in

\mathcal{G}

. Using Lemma 5, we can orient all edges in

\smash{\smash{\tilde{\mathcal{G}}}_{C}}

except possibly

(v_{i},v_{i+1})

(blue). Edges like

(v_{i-1},u)

are oriented by the first Meek rule. After Lemma 5, we are left with either the single undirected tree of

v_{i}

(left shaded tree) or the single undirected tree consisting of

(v_{i},v_{i+1})

(blue) and both undirected trees of

v_{i}

and

v_{i+1}

(both shaded trees). Either

v_{i}

v_{i+1}

must be the root of

\smash{\mathcal{G}_{C}}

. In this specific example,

v_{i}

is the root of

\smash{\mathcal{G}_{C}}

and is therefore the only node that can have a parent outside

\mathcal{G}_{C}

. Any node in

\mathcal{G}

may have directed, outgoing edges to children in an MEC the undirected connected component

\smash{\tilde{\mathcal{G}}}

may be a subgraph of.

We give a proof by strong induction on the number of vertices $\lvert\mathcal{V}\rvert$ in the MEC $\smash{\tilde{\mathcal{G}}}$ . The base case of the induction argument is an MEC with $\lvert\mathcal{V}\rvert=2$ nodes. This case holds trivially, since this MEC can contain at most one undirected edge. For the inductive step, we consider an undirected tree MEC $\smash{\tilde{\mathcal{G}}}$ with $\left\lvert\mathcal{V}\right\rvert=d$ and assume that we can orient all but one edge of undirected tree MECs with $\left\lvert\mathcal{V}\right\rvert<d$ .

Our argument will proceed by considering the longest chain of the undirected tree $\smash{\tilde{\mathcal{G}}}$ . We will use Lemma 5 to orient all but at most one edge in this chain and then apply the first Meek rule to possibly orient additional edges in $\smash{\tilde{\mathcal{G}}}$ outside the chain. After orienting these edges, we show that we reduced the original problem of orienting all but one edge in $\smash{\tilde{\mathcal{G}}}$ with $\left\lvert\mathcal{V}\right\rvert=d$ to orienting all but one edge in a single undirected connected component that has strictly fewer than $d$ nodes. This allows us to apply the inductive hypothesis and complete the proof (see Figure 8).

Consider a longest undirected chain $\tilde{\mathcal{G}}_{C}=(\mathcal{V}_{C},\tilde{\mathcal{E}}_{C})$ that is a subgraph of the undirected tree $\smash{\tilde{\mathcal{G}}}$ . Let $\smash{\mathcal{G}_{C}}$ refer to the directed subgraph of the DAG $\mathcal{G}$ induced by considering only the vertices $\mathcal{V}_{C}$ . We label the $k$ vertices in $\mathcal{V}_{C}$ as $v_{1},...,v_{k}$ , with undirected edges $(v_{i},v_{i+1})\in\smash{\tilde{\mathcal{E}}}$ for all $i\in\{1,\dots,k-1\}$ . The nodes $v_{1},v_{k}$ can have no undirected neighbours in $\smash{\tilde{\mathcal{G}}}$ outside the chain, because otherwise we could construct a longer chain in $\smash{\tilde{\mathcal{G}}}$ .

The only vertex in $\mathcal{V}_{C}$ that can have a parent in the DAG $\mathcal{G}$ outside the chain $\smash{\mathcal{G}_{C}}$ , that is, in $\mathcal{V}\smash{\setminus}\mathcal{V}_{C}$ , is the unique root of $\smash{\mathcal{G}_{C}}$ . To see this, we first note that all nodes $v_{i}$ have at most one parent in $\mathcal{G}$ , because any $v_{i}$ with $\left\lvert\text{pa}(v_{i}))\right\rvert>1$ in $\mathcal{G}$ would be a collider, but $\mathcal{G}$ contains no colliders. Since non-root nodes in $\smash{\mathcal{G}_{C}}$ have an in-chain parent, they cannot have a parent outside of $\mathcal{V}_{C}$ . Therefore, besides the root node of $\smash{\mathcal{G}_{C}}$ via its potential outside parent, $\smash{\mathcal{G}_{C}}$ is a completely disconnected subgraph from the rest of $\mathcal{G}$ . This implies that we may treat $\smash{\mathcal{G}_{C}}$ as a separate standardized SCM with undirected chain MEC, in which the potential parent of the root of $\smash{\mathcal{G}_{C}}$ is modeled as part of the exogenous noise of the root. This allows us to apply Lemma 5 to the variables of the subgraph $\smash{\mathcal{G}_{C}}$ .

By applying Lemma 5 to $\smash{\mathcal{G}_{C}}$ , we can orient all but at most one undirected edge in $\smash{\mathcal{\tilde{G}}_{C}}$ . We split the resulting analysis into the two cases of Lemma 5 leaving either $0$ or $1$ undirected edge. In the first case, we can orient all edges in $\smash{\mathcal{\tilde{G}}_{C}}$ with Lemma 5. In this case, we know that the root of $\smash{\mathcal{G}_{C}}$ is the node $v_{i}$ (see Remark of Lemma 5). By the first Meek rule, we can recursively orient all additional edges in $\smash{\tilde{\mathcal{G}}}$ outside of $\smash{\mathcal{\tilde{G}}_{C}}$ away from $v_{i}$ , except for the subtrees of $\smash{\tilde{\mathcal{G}}}$ connected to $v_{i}$ itself (Figure 8). This leaves at most a single connected undirected subtree containing $v_{i}$ and strictly less than $d$ vertices.

In the second case, we orient all but one edge $(v_{i},v_{i+1})$ in $\smash{\mathcal{\tilde{G}}_{C}}$ by applying Lemma 5. In this case, we know that the root of $\smash{\mathcal{G}_{C}}$ is either the node $v_{i}$ or $v_{i+1}$ (see Remark of Lemma 5). Similar to the first case, we can recursively use the first Meek rule to orient all additional edges in $\smash{\tilde{\mathcal{G}}}$ pointing away from $v_{i}$ and $v_{i+1}$ , except for the subtrees of $\smash{\tilde{\mathcal{G}}}$ connected to $v_{i}$ and $v_{i+1}$ itself. Since $v_{i}$ and $v_{i+1}$ are connected by an undirected edge, we are left with a single connected subtree containing the undirected edge $(v_{i},v_{i+1})$ that is strictly smaller than before.

In both cases, we orient at least one undirected edge of $\smash{\tilde{\mathcal{G}}}$ , because the longest undirected chain in $\smash{\tilde{\mathcal{G}}}$ with $\left\lvert\mathcal{V}\right\rvert>2$ has at least length $2$ . We always obtain at most a single undirected connected tree component with strictly less than $d$ vertices, allowing us to apply the inductive hypothesis and complete the proof.

∎

iSCM

See 4

Proof.

Because we consider linear iSCMs with Gaussian noise, the implied model is a linear SCM with additive Gaussian noise (see Section A.2). Hence, the observational distribution is a multivariate Gaussian with mean zero. In iSCMs, the marginal variance of an observed variable is always $1$ . Hence, we prove the statement if we show that for all $\widetilde{x}_{i},\widetilde{x}_{j}$ in the iSCM with graph $\mathcal{G}$ , and the corresponding $\widetilde{x}_{i}^{\prime},\widetilde{x}_{j}^{\prime}$ in the iSCM with graph $\mathcal{G}^{\prime}=(\mathcal{V},\mathcal{E}^{\prime})$ , $\operatorname{Cov}[\widetilde{x}_{i},\widetilde{x}_{j}]=\operatorname{Cov}[% \widetilde{x}_{i}^{\prime},\widetilde{x}_{j}^{\prime}]$ .

Let $\widetilde{x}_{i}^{\prime}$ and $\smash{\widetilde{x}_{j}^{\prime}}$ be the random variables associated with the nodes $v_{i}$ and $v_{j}$ from $\mathcal{G}^{\prime}$ , respectively. We consider two cases. First, if there is no path between $v_{i}$ and $v_{j}$ in the skeleton of $\mathcal{G}^{\prime}$ then there is no path between $v_{i}$ and $v_{j}$ in the skeleton of $\mathcal{G}$ and hence $\operatorname{Cov}[\widetilde{x}_{i},\widetilde{x}_{j}]=\operatorname{Cov}[% \widetilde{x}_{i}^{\prime},\widetilde{x}_{j}^{\prime}]=0.$ In the second case, there is a path between $v_{i}$ and $v_{j}$ in the skeleton of $\mathcal{G}^{\prime}$ , so there also exists a path in the skeleton of $\mathcal{G}$ , as both graphs have the same skeleton. Due to the acyclicity of the skeleton in forests, this path is the only one connecting $v_{i}$ and $v_{j}$ in both $\mathcal{G}$ and $\mathcal{G}^{\prime}$ .

(a) First subcase

(b) Second subcase (More than one parent in

\mathcal{G}^{\prime}

)

\mathcal{G}^{\prime}

)

Figure 9: Proof subcases of Theorem 4. (a) Path with a collider. In other words, a path blocked by an empty set. In the case of forests, this configuration implies that

v_{i}

and

v_{j}

are

d

-separated. (b) Unblocked path connecting

v_{i}

and

v_{j}

with one of the path nodes having a parent both in the path and outside the path. The weight

w_{p,k}

influences the weight

\widetilde{w}_{l,k}

in the implied model of the iSCM. If this structure is present in a forest, it has to be present in other graphs in the same MEC. (c) Unblocked path connecting

v_{i}

and

v_{j}

with the only parent of

v_{k}

being part of the considered path. The weight

\widetilde{w}_{l,k}

depends only on

w_{l,k}

, irrespective of the edge direction.

We further break this second case into two subcases. In the first subcase, this path contains a collider in $\mathcal{G}$ as shown in Figure 9(a). Because the skeleton cannot have undirected cycles under the forest assumption, this collider forms a $v$ -structure. $\mathcal{G}^{\prime}\in\smash{\tilde{\mathcal{G}}}$ implies that the same $v$ -structure must be present in $\mathcal{G}$ . Hence, $v_{i}$ and $v_{j}$ are $d$ -separated in both $\mathcal{G}$ and $\mathcal{G}^{\prime}$ . By the global Markov condition, this implies that $\widetilde{x}_{i}^{\prime}$ and $\widetilde{x}_{j}^{\prime}$ are independent, and that $\widetilde{x}_{i}$ and $\widetilde{x}_{j}$ are independent. This implies that both $\operatorname{Cov}[\widetilde{x}_{i}^{\prime},\widetilde{x}_{j}^{\prime}]=% \operatorname{Cov}[\widetilde{x}_{i},\widetilde{x}_{i}]=0$ .

In the second subcase, there exists an unblocked path between $v_{i}$ and $v_{j}$ in both $\mathcal{G}$ and $\mathcal{G}^{\prime}$ . Here, we denote the weight matrix associated with both iSCMs by $W:=[w_{i,j}]$ , with $W$ being symmetric, so that $w_{i,j}=w_{j,i}$ is the linear weight of the edge $(i,j)$ regardless of its orientation in the graph.

We now derive the analogous weights $\widetilde{W},\widetilde{W}^{\prime}$ in the implied SCMs for $\mathcal{G},\mathcal{G}^{\prime}$ respectively. Ultimately, we will demonstrate that the implied SCMs have the same weights. Specifically, we will show that $\smash{\widetilde{w}_{k,l}}=\smash{\widetilde{w}_{k,l}^{\prime}}$ . Given this, Lemma 1 implies that both iSCMs have the same covariance matrix over the observed variables.

Without loss of generality, since the node labelling is arbitrary, let $v_{k}$ have at least as many incoming edges as $v_{l}$ in $\mathcal{G}^{\prime}$ . We divide the analysis into two cases: $v_{k}$ having only $1$ parent in $\mathcal{G}^{\prime}$ , and $v_{k}$ having more than $1$ parent. The node $v_{k}$ must have at least one parent, since at least one of $v_{k},v_{l}$ have an incoming edge in $\mathcal{G}^{\prime}$ , and we chose $v_{k}$ to have at least as many incoming edges as $v_{l}$ .

More than one parent in $\mathcal{G}^{\prime}$

We know that any collider in $\mathcal{G}^{\prime}$ will appear as part of a $v$ -structure in $\smash{\tilde{\mathcal{G}}}$ due to the forest assumption, and therefore will also be a collider in $\mathcal{G}$ . Therefore, if $v_{k}$ has more than one parent in $\mathcal{G}^{\prime}$ (see Figure 9(b)), all pairs of edges incoming to $v_{k}$ will form $v$ -structures, so $v_{k}$ must have exactly the same set of parents in $\mathcal{G}$ .

Moreover, any two parents of $v_{k}$ are d-separated in $\mathcal{G}$ and $\mathcal{G}^{\prime}$ by the forest assumption, since the blocked path going through $v_{k}$ is the only path connecting them. By the global Markov condition, the parents are pairwise independent. Hence, we can use Equation (11) to compute $\widetilde{w}_{k,l},\widetilde{w}_{k,l}^{\prime}$ . Since the parent sets are the same between the two graphs, and $W$ is shared between the two iSCMs, the weight associated with the edge $(l,k)$ in both graphs in the implied models is given by

\widetilde{w}_{l,k}=\widetilde{w}_{l,k}^{\prime}=\frac{w_{l,k}}{\sqrt{\sum_{u% \in\mathrm{pa}(k)}w_{u,k}^{2}+\sigma^{2}}}\,.

(24)

A single parent in $\mathcal{G}^{\prime}$

Let $(l,k)$ be the only incoming edge to $v_{k}$ in $\mathcal{G}^{\prime}$ , as depicted in Figure 9(c). Then, the edge connecting $v_{l}$ and $v_{k}$ in $\mathcal{G}$ is either the only incoming edge to $v_{k}$ or the only incoming edge to $v_{l}$ . To see this, suppose that it was not the only incoming edge to $v_{k}$ or $v_{l}$ in $\mathcal{G}$ . This would make $v_{k}$ or $v_{l}$ a collider that would be common to both graphs, implying that $v_{k}$ or $v_{l}$ would have at least two parents in $\mathcal{G}^{\prime}$ . We operate under the assumption that $v_{k}$ has at least as many parents as $v_{l}$ , so it would imply that $v_{k}$ has more than one parent, contradicting the assumption we made for case we consider in this paragraph. Irrespective of the direction, the weight associated with the edge $(l,k)$ in the skeleton of both graphs in the implied model is, similar to Equation (21), given by

\widetilde{w}_{l,k}=\widetilde{w}_{l,k}^{\prime}=\frac{w_{l,k}}{\sqrt{w_{l,k}^% {2}+\sigma^{2}}}\,.

(25)

Equations (24) and (25) show that, for the SCM form of each iSCM, the edges connecting the same nodes irrespective of their direction in $\mathcal{G}^{\prime}$ and $\mathcal{G}$ have the same weights. By Lemma 1, the covariance between any $\widetilde{x}_{i}$ and $\widetilde{x}_{j}$ can be expressed as a product of the weights in the implied SCM corresponding to the edges on the path between $v_{i},v_{j}$ . Hence, $\operatorname{Cov}[\widetilde{x}_{i},\widetilde{x}_{j}]=\operatorname{Cov}[% \widetilde{x}_{i}^{\prime},\widetilde{x}_{j}^{\prime}]$ . ∎

Figure 10 shows an example for Theorem 4 for two trees from the same MEC.

Remark

In Figure 11, we empirically demonstrate that Theorem 4 no longer holds if we drop the forest assumption. For data generated from an iSCM and two graphs from the same $\smash{\tilde{\mathcal{G}}}$ with the same weights assigned to the skeleton edges, we observe that the estimated covariances differ. The two systems entail different observational distributions.

Appendix D Background on Related Work

D.1 Heuristics for Mitigating Variance Accumulation and $\operatorname{Var}$ -sortability in SCMs

Here, we review existing heuristics for avoiding the exploding variance in structure learning benchmarking with linear SCMs as defined in Equation (1). We also describe how these heuristics limit the causal dependencies that can be modeled in terms of the correlations among the SCM variables or their cause-explained variance, both of which does not occur in linear iSCMs.

Scaling weights by the inverse weight norm

Mooij et al., (2020, Section 5.2) sample the edge weights in linear SCMs as $w_{i,j}\sim\operatorname{Unif}_{\pm}{[0.5,1.5]}$ . To achieve a comparable variance of each variable $x_{j}$ in the SCM, they propose re-scaling the sampled weights prior to the data-generating process as

\displaystyle w_{i,j}\leftarrow\frac{w_{i,j}}{\sqrt{1+\sum_{i\in\mathrm{pa}(j)% }w_{i,j}^{2}}}\,.

If all parents of $x_{j}$ are i.i.d. Gaussian with variance $1$ , this adjustment ensures that the variance of $x_{j}$ is similar for all $x_{j}$ . However, this approximation does not take into account the covariances of the parents. Moreover, since $\smash{\operatorname{Var}[\varepsilon_{j}]}$ is unchanged, the scaling limits the strength of the causal effect that parents can have on $x_{j}$ . For example, when $x_{1}=\varepsilon_{1}$ and $x_{2}=wx_{1}+\varepsilon_{2}$ with $\smash{\operatorname{Var}[\varepsilon_{j}]=1}$ as for Mooij et al., (2020), the adjusted weight is $\smash{w^{\prime}=w/\sqrt{1+w^{2}}}<1$ . Thus, for any $w\neq 0$ , we have

\displaystyle\lvert\mathrm{Corr}[x_{1},x_{2}]\rvert=\frac{\lvert\operatorname{% Cov}[\varepsilon_{1},\smash{w^{\prime}}\varepsilon_{1}+\varepsilon_{2}]\rvert}% {\sqrt{\operatorname{Var}[\varepsilon_{1}]\operatorname{Var}[\smash{w^{\prime}% }\varepsilon_{1}+\varepsilon_{2}]}}=\frac{\lvert\smash{w^{\prime}}\rvert}{% \sqrt{\smash{w^{\prime}}^{2}+1}}<\frac{1}{\sqrt{2}}\approx 0.707\,.

This is the maximum correlation between neighbouring variables that any SCM can model under the proposed re-scaling when $\smash{\operatorname{Var}[\varepsilon_{j}]=1}$ , since additional parents decrease the parent-child correlations. By contrast, iSCMs can model any level of correlation by sampling arbitrary values of $w_{i,j}$ , while guaranteeing unit-variance observations $x_{j}$ . Intuitively, iSCMs achieve this by standardizing $x_{j}$ after the exogenous noise $\varepsilon_{j}$ is added to the endogenous contributions of the parents $\mathbf{x}_{\mathrm{pa}(j)}$ , while weight scaling is done before $\varepsilon_{j}$ is added to $x_{j}$ .

Scaling weights by the incoming variance

Squires et al., (2022, Section 5.1) sample the weights of linear SCMs as $w_{i,j}\sim\operatorname{Unif}_{\pm}{[0.25,1.0]}$ . Given the initial edge weights, they propose adjusting the weights during the generative process by first estimating the variance $\smash{\hat{\sigma}_{j}^{2}}$ of $x_{j}$ from samples drawn under an initial level of additive noise with $\operatorname{Var}[\varepsilon_{j}]=1$ and then re-scaling the weights as

\displaystyle w_{i,j}\leftarrow\frac{w_{i,j}}{\sqrt{2\hat{\sigma}_{j}^{2}}}\,.

When using additive noise with $\operatorname{Var}[\varepsilon_{j}]=0.5$ to generate the actual samples, this scaling results in $\operatorname{Var}[x_{j}]=1$ with a constant fraction of cause-explained variance $\operatorname{CEV_{f}}[x_{i}]=0.5$ . In benchmarks, however, we may be interested in evaluating SCMs with arbitrary levels of cause-explained variance. iSCMs allow this by construction. Contrary to Squires et al., (2022), iSCMs scale the variables $x_{j}$ rather than the weights $w_{i,j}$ while leaving the exogenous noise $\varepsilon_{j}$ unchanged, which enables modeling arbitrarily small or large levels of unexplained variation.

D.2 Sortability Metrics

In this section, we describe the definition of a sortability metric as introduced by Reisach et al., (2024), which we use in Section 5. For a function $\tau$ , $\tau$ -sortability assigns a scalar in $[0,1]$ to the variables $\mathbf{x}$ and graph $\mathcal{G}$ (with weight matrix $W_{\mathcal{G}}$ ) as

\frac{\sum_{i=1}^{d}\sum_{p_{s\rightarrow t}\in W^{i}_{\mathcal{G}}}\text{incr% }(\tau(\mathbf{x},s),\tau(\mathbf{x},t))}{\sum_{i=1}^{d}\sum_{p_{s\rightarrow t% }\in W^{i}_{\mathcal{G}}}1}\quad\text{ where }\text{incr}(a,b)=\begin{cases}1&% \text{if }a<b\\ \frac{1}{2}&\text{if }a=b\\ 0&\text{if }a>b\end{cases}

and $W^{i}_{\mathcal{G}}$ is the $i$ -th power of the adjacency matrix $W_{\mathcal{G}}$ and $p_{s\rightarrow t}\in W^{i}_{\mathcal{G}}$ if and only if at least one directed path from $v_{s}$ to $v_{t}$ of length $i$ exists in $\mathcal{G}$ . If $\tau(\mathbf{x},t)=\operatorname{Var}[x_{t}]$ , we obtain $\operatorname{Var}$ -sortability from Reisach et al., (2021). If

\tau(\mathbf{x},t)=R^{2}[x_{t}]=1-\frac{\operatorname{Var}[x_{t}-\mathds{E}[x_% {t}|\mathbf{x}_{\{1,...,d\}\backslash\{t\}}]]}{\operatorname{Var}[x_{t}]}\,,

we obtain $\operatorname{R^{2}}$ -sortability. Estimating $R^{2}[x_{t}]$ requires performing regression of $x_{t}$ onto $\mathbf{x}_{\{1,...,d\}\backslash\{t\}}$ .

D.3 Structure Learning Algorithms

To complement the interpretation of the results in Section 5, we provide some background on the structure learning methods we evaluate.

Notears (Zheng et al.,, 2018)

Notears uses continuous optimization to minimize the regularized mean-squared error (MSE) between the the variables modeled by a linear SCM and the observations, while enforcing a differentiable acyclicity constraint. The objective function of Notears is given by $F(\mathbf{W})=||\mathbf{X}-\mathbf{X}\mathbf{W}||^{2}_{F}/2n+\lambda||\mathbf{% W}||_{1}$ , where $||\cdot||_{F}$ and $||\cdot||_{1}$ are a Frobenius and $\ell_{1}$ norm respectively. When the objective is minimized, weights below a fixed threshold are set to zero.

Avici (Lorch et al.,, 2022)

Avici is an amortized variational inference method that approximates the posterior distribution over causal structures given a dataset through a pretrained inference model. The variational approximation of Avici uses a fully-factored product of Bernoulli distributions for every possible graph edge. The inference model is a neural network that predict the variational parameters of the Bernoulli distributions by minimizing the expected forward KL divergence between the true posterior and the approximation. To train the inference model, Avici can be optimized on any training distribution of (synthetic) dataset-graph pairs. Lorch et al., (2022) publish the pretrained parameters of inference models trained on standardized SCMs with linear and nonlinear mechanisms, which we evaluate in this work.

SortnRegress methods (Reisach et al.,, 2021, 2024)

The SortnRegress methods order the vertices by a chosen statistic and sparsely regress every node on all of its predecessors in the obtained order. They use Lasso regression with the Bayesian Information Criterion to learn the regression function for a given variable. $\operatorname{Var}$ -SortnRegress uses estimated marginal variances as the sorting criterion. $\operatorname{R^{2}}$ -SortnRegress uses $\operatorname{R^{2}}$ coefficient of determination estimated after performing a regression of every variable onto all remaining variables. Rand-SortnRegress orders the vertices randomly.

Appendix E Experimental Setup

E.1 Data

Causal mechanisms

We consider systems with additive noise, where

f_{i}(\mathbf{x},\varepsilon_{i})=h_{i}(\mathbf{x})+\varepsilon_{i},

for a chosen function $h_{i}$ . The Linear systems used in this experiments have causal mechanisms as defined in Equation 1. To model nonlinear systems, we use smooth nonlinear functional mechanisms as used by Lorch et al., (2022). Specifically, the function $h_{i}$ that models the relationship between $x_{i}$ and its parents is sampled from a Gaussian Process

h_{i}\sim\mathcal{GP}(0,k_{i})\,,

where $k$ is a squared exponential kernel $k_{i}(\mathbf{x},\mathbf{x}^{\prime})=c_{i}^{2}\exp\left(-||\mathbf{x}-\mathbf% {x}^{\prime}||^{2}_{2}/2l_{i}^{2}\right)$ with output and length scales $c_{i}$ and $l_{i}$ respectively. We can approximately express the function sample $h_{i}$ analytically using random Fourier features (Rahimi and Recht,, 2007) by sampling

h_{i}(\mathbf{x})=c_{i}\sqrt{\tfrac{2}{M}}\sum_{j=1}^{M}\alpha^{(i)}\cos\left(% \tfrac{\omega^{(i)}\cdot\mathbf{x}}{l_{i}}+\delta^{(i)}\right)

where $\alpha^{(i)}\sim\mathcal{N}(0,1)$ , $\omega^{(i)}\sim\mathcal{N}(0,\mathbf{I})$ , and $\delta^{(i)}\sim\operatorname{Unif}{[0,2\pi]}$ . In this work, we use $M=100$ .

Generating a random model

Following prior work (Section 2), we sample random systems in any simulation performed in this work by first drawing a graph $\mathcal{G}$ from the specified random graph distribution. Given the graph $\mathcal{G}$ , we sample function parameters of the structural mechanisms over $\mathcal{G}$ . For linear systems, we sample $w_{i,j}\sim\operatorname{Unif}_{\pm}{}[a,b]$ , where $a,b$ are fixed, i.i.d. for every graph edge. Similarly, for nonlinear systems, for every graph vertex, we draw the length scales $l_{i}\sim\operatorname{Unif}[a_{1},b_{1}]$ and output scales $c_{i}\sim\operatorname{Unif}[a_{2},b_{2}]$ with predefined $a_{1},b_{1},a_{2},b_{2}$ .

Sampling data from a model

Given a graph $\mathcal{G}$ , noise distribution $\mathcal{P}_{\bm{\varepsilon}}$ , and a set of functions $\{f_{1},...f_{d}\}$ , we sample $n$ datapoints from an SCM by traversing $\mathcal{G}$ in a topological ordering. For every vertex $v_{i}$ , we draw a noise sample $\varepsilon_{i}\sim\mathcal{P}_{\varepsilon_{i}}^{n}$ . The sample for $x_{i}$ is then deterministically computed by $f_{i}$ from the exogenous $\varepsilon_{i}$ and the parents of $x_{i}$ . To sample from a Standardized SCM, we draw a dataset from an SCM and standardize it. To sample from an iSCM, we use Algorithm 1.

E.2 Experiment Configurations

Sortability

For Figures 4(a) and 14(a), we generate Erdős-Rényi graphs (Erdős and Rényi,, 1959) with expected number of edges per vertex equal to $2$ and $4$ , respectively. For Figures 4(b) and 14(b), we generate undirected scale-free graphs (Barabási and Albert,, 1999) with $2$ and $4$ edges per node respectively. Then, we order the graphs according to random topological orderings. We do not sample ordered scale-free graphs directly to avoid high sortability by in-degree, which may confound the results. For all four figures, we generate Linear systems with weights sampled from three possible distributions $w_{i,j}\sim\operatorname{Unif}_{\pm}{}[0.3,1.8]$ , $w_{i,j}\sim\operatorname{Unif}_{\pm}{}[0.5,2.0]$ or $w_{i,j}\sim\operatorname{Unif}_{\pm}{}[1.3,3.0]$ and noise sampled from $\varepsilon_{i}\sim\mathcal{N}(0,1)$ . For every model configuration, we sample $100$ systems and $n=$ $1000$ data points each. We generate graphs of sizes $\{20,60,100,140,180,220\}$ .

Structure Learning (Section 5.2)

For Figures 5(a) and 12, we sample Linear systems with weights $w_{i,j}\sim\operatorname{Unif}_{\pm}{}[0.5,2.0]$ . Following Lorch et al., (2022), Nonlinear mechanisms have length scales $l_{i}\sim\operatorname{Unif}[7.0,10.0]$ and output scales $c_{i}\sim\operatorname{Unif}[10.0,20.0]$ . Both mechanisms are defined in Appendix E.1. For Figures 13(a) and 13(b), we generate Linear systems with weights $w_{i,j}\sim\operatorname{Unif}_{\pm}{}[0.3,0.8]$ and $w_{i,j}\sim\operatorname{Unif}_{\pm}{}[1.3,3.0]$ . For all four figures, we sample random $\operatorname{ER}(20,2)$ and $\operatorname{ER}(100,2)$ graphs with noise $\varepsilon_{i}\sim\mathcal{N}(0,1)$ . For every model configuration, we sample $20$ systems and $n=1000$ data points each.

Noise Transfer

For Figure 5(b) (top), we sample SCMs, standardized SCMs, and iSCMs with exactly the same underlying graph and weights sampled from $w_{i,j}\sim\operatorname{Unif}_{\pm}{}[0.5,2.0]$ . The noise variables are drawn from $\varepsilon_{i}\sim\mathcal{N}(0,1)$ . Then, for every triple of SCM, standardized SCM, and iSCM that shares a graph and weights, we create two more SCMs with the same marginal variances as the SCM, but with the noise variances of the implied models of the standardized SCM and iSCM, respectively. Appendix E.5 provides a motivation and detailed explanation of this procedure. Figure 5(b) (top) shows the performance of Notears on the original SCMs and the two SCMs with transferred noise.

For Figure 5(b) (bottom), we sample multiple instances of standardized SCMs, and iSCMs with weights drawn from $w_{i,j}\sim\operatorname{Unif}_{\pm}{}[0.5,2.0]$ and noise from $\varepsilon_{i}\sim\mathcal{N}(0,1)$ . For every model instance, we approximate the density of the inverse of their implied noise variances using kernel density estimation. The figure shows the mean and standard deviation of the p.d.f. values over $100$ systems. For both figures, we use $\operatorname{ER}(100,2)$ graphs.

E.3 Methods

Notears (Zheng et al.,, 2018)

To run Notears, we use the original implementation provided by the authors of Zheng et al., (2018) (Apache-2.0 license). Before benchmarking Notears, we run a hyperparameter search to calibrate the weight penalty ( $\lambda$ ) and threshold on held-out instances of each data generation method. The hyperparameters can be found in Appendix E.4.

Avici (Lorch et al.,, 2022)

To evaluate Avici, we use the code and model checkpoints provided by the authors of the method (MIT license). Specifically, we use the model trained on linear data to benchmark the method on Linear systems and the model trained on nonlinear data to benchmark on Nonlinear systems. We score an edge as predicted if the probability prediction by Avici is greater than $0.5$ . Since the parameters are pretrained, the method has otherwise no tuneable hyperparameters.

Sortabilities and SortnRegress methods (Reisach et al.,, 2021, 2024)

To compute the sortability metrics and run the SortnRegress baselines, we use the CausalDisco library (BSD-3-Clause license) created by the authors of the method. The algorithms require no tuneable hyperparameters.

E.4 Hyperparameter Selection

To run Notears, we need to specify the regularisation strength $\lambda$ and a weight threshold $\eta$ for thresholding the final weights for graph structure prediction. To select these hyperparameters, we run an parameter search with $\lambda\in\{0.0,0.05,0.1,0.15,0.2,0.25,0.3\}$ and three possible values of the weight threshold $\{0.1,0.2,0.3\}$ . We perform the search on a separate, held-out systems that follow the same configurations as the ones we present in our final experimental results. We run Notears $20$ times per configuration and choose the median $\operatorname{F1}$ score as the criterion for selecting the best hyperparameters. Table 1 presents all final hyperparameter configurations. For some hyperparameter configurations, $1$ in $20$ runs experienced numerical issues caused by the acyclicity constraint. However, this never occurs for the selected, optimal hyperparameters, neither when performing the hyperparameter search nor when running the reported experiments.

Table 1: Notears hyperparameters for all experiments. Final settings for the regularization strength

\lambda

and the weight threshold

\eta

after hyperparameter tuning on the respective models and data-generating processes together with the F1 (median) validation scores achieved by Notears.

(a)

\operatorname{ER}(20,2)

DAGs, Linear mechanisms

Weight Distribution	Model	$\lambda$	$\eta$	F1 (median)
$\operatorname{Unif}_{\pm}{[0.3,0.8]}$	SCM	0.05	0.20	0.97
$\operatorname{Unif}_{\pm}{[0.3,0.8]}$	Standardized SCM	0.15	0.10	0.59
$\operatorname{Unif}_{\pm}{[0.3,0.8]}$	iSCM	0.15	0.10	0.57
$\operatorname{Unif}_{\pm}{[0.5,2.0]}$	SCM	0.00	0.30	0.98
$\operatorname{Unif}_{\pm}{[0.5,2.0]}$	Standardized SCM	0.15	0.20	0.30
$\operatorname{Unif}_{\pm}{[0.5,2.0]}$	iSCM	0.15	0.10	0.50
$\operatorname{Unif}_{\pm}{[1.3,3.0]}$	SCM	0.05	0.30	0.98
$\operatorname{Unif}_{\pm}{[1.3,3.0]}$	Standardized SCM	0.25	0.10	0.24
$\operatorname{Unif}_{\pm}{[1.3,3.0]}$	iSCM	0.20	0.10	0.40

(b)

\operatorname{ER}(100,2)

DAGs, Linear mechanisms

Weight Distribution	Model	$\lambda$	$\eta$	F1 (median)
$\operatorname{Unif}_{\pm}{[0.3,0.8]}$	SCM	0.10	0.10	0.99
$\operatorname{Unif}_{\pm}{[0.3,0.8]}$	Standardized SCM	0.10	0.10	0.83
$\operatorname{Unif}_{\pm}{[0.3,0.8]}$	iSCM	0.10	0.10	0.84
$\operatorname{Unif}_{\pm}{[0.5,2.0]}$	SCM	0.05	0.30	0.94
$\operatorname{Unif}_{\pm}{[0.5,2.0]}$	Standardized SCM	0.15	0.10	0.47
$\operatorname{Unif}_{\pm}{[0.5,2.0]}$	iSCM	0.15	0.10	0.76
$\operatorname{Unif}_{\pm}{[1.3,3.0]}$	SCM	0.10	0.30	0.82
$\operatorname{Unif}_{\pm}{[1.3,3.0]}$	Standardized SCM	0.20	0.10	0.30
$\operatorname{Unif}_{\pm}{[1.3,3.0]}$	iSCM	0.15	0.10	0.70

(c)

\operatorname{ER}(20,2)

DAGs, Nonlinear mechanisms

Model	$\lambda$	$\eta$	F1 (median)
SCM	0.15	0.30	0.58
Standardized SCM	0.15	0.10	0.33
iSCM	0.15	0.20	0.42

(d)

\operatorname{ER}(100,2)

DAGs, Nonlinear mechanisms

Model	$\lambda$	$\eta$	F1 (median)
SCM	0.30	0.30	0.50
Standardized SCM	0.15	0.10	0.43
iSCM	0.15	0.10	0.61

(e) Noise transfer experiment:

\operatorname{ER}(100,2)

DAGs, Linear mechanisms

w_{ij}\sim\operatorname{Unif}_{\pm}{[0.5,2.0]}

Model	$\lambda$	$\eta$	F1 (median)
Original	0.05	0.30	0.96
Noise from standardized SCM	0.10	0.30	0.72
Noise from iSCM	0.05	0.30	0.82

E.5 Transferring Noise Variances While Kee** $\operatorname{Var}$ -Sortability Unchanged

Reisach et al., (2021) show that post-hoc standardization of SCM data strongly impairs the performance of Notears. When comparing the performance of Notears between data sampled from iSCM and standardized SCMs, there are at least two factors that can affect the performance of Notears, low $\operatorname{Var}$ -sortability and the violation of the equal noise variance assumption. Our experiments in Figure 5(b) of Section 5 aim at isolating the effect of the latter. Specifically, we investigate whether Notears performs better on $\operatorname{Var}$ -sortable datasets that have the noise scale patterns implied when assuming SCMs generated the data—when in fact the data was sampled from iSCMs or standardized SCMs. To achieve this, we ensure that the $\operatorname{Var}$ -sortability metrics of the data sampled from the models is the same, here close to $1$ .

Given two linear SCMs $S^{a}$ and $S^{b}$ with the same underlying graph $\mathcal{G}$ , our goal is to construct a system $S^{t}$ with the same marginal variances as $S^{a}$ (condition 1) and the same noise variances as $S^{b}$ (condition 2). For this task to be well-defined, we assume that the noise variances of the root variables in $S^{a}$ and $S^{b}$ are the same. The first step in constructing $S^{t}$ is to copy the noise variances from $S^{b}$ , so that for every $i\in\{1,...,d\}$ .

{\sigma_{i}^{2}}^{t}:={\sigma_{i}^{2}}^{b}\,.

This satisfies condition 2. Given this, we define $x_{i}^{t}$ as

x_{i}^{t}:=\sqrt{\frac{\operatorname{Var}[x_{i}^{a}]-{\sigma_{i}^{2}}^{b}}{% \operatorname{Var}[{\mathbf{w}^{a}_{i}}^{T}\mathbf{x}_{\mathrm{pa}(i)}^{t}]}}{% \mathbf{w}^{a}_{i}}^{T}\mathbf{x}_{\mathrm{pa}(i)}^{t}+\varepsilon_{i}^{t}\,,

where $\varepsilon_{i}^{t}$ has variance ${\sigma_{i}^{2}}^{t}$ . By construction, the condition of $S^{t}$ sharing the noise variances with $S^{b}$ and the marginal variances with $S^{a}$ is fulfilled for the root variables. For all the remaining variables, it holds that

	$\displaystyle\operatorname{Var}[x_{i}^{t}]$	$\displaystyle=\operatorname{Var}\left[\sqrt{\frac{\operatorname{Var}[x_{i}^{a}% ]-{\sigma_{i}^{2}}^{b}}{\operatorname{Var}[{\mathbf{w}^{a}_{i}}^{T}\mathbf{x}_% {\mathrm{pa}(i)}^{t}]}}{\mathbf{w}^{a}_{i}}^{T}\mathbf{x}_{\mathrm{pa}(i)}^{t}% +\varepsilon_{i}^{t}\right]$
		$\displaystyle=\frac{\operatorname{Var}[x_{i}^{a}]-{\sigma_{i}^{2}}^{b}}{% \operatorname{Var}[{\mathbf{w}^{a}_{i}}^{T}\mathbf{x}_{\mathrm{pa}(i)}^{t}]}% \operatorname{Var}[{\mathbf{w}^{a}_{i}}^{T}\mathbf{x}_{\mathrm{pa}(i)}^{t}]+{% \sigma_{i}^{2}}^{b}$
		$\displaystyle=\operatorname{Var}[x_{i}^{a}]\,,$

which satisfies condition 1. Since the systems $S^{t}$ and $S^{a}$ have the same marginal variances, they have the same $\operatorname{Var}$ -sortability. In the noise transfer experiment of Figure 5(b), we transfer the noise variances from the implied models of iSCMs and standardized SCMs. To obtain the noise variances in the implied models, we divide the original noise variances (equal to $1$ ) by the estimated marginal variances of the corresponding variable before standardization, which we estimate from $n=1000$ datapoints. For iSCM, this corresponds to an empirical statistics of Equation (7).

E.6 Compute Resources

Our experiments were run on an internal cluster. All experiments in this work were computed using CPUs with $3$ GB of memory per CPU, with an exception of the Avici runs on graphs with $100$ vertices, which used $12$ GB per CPU. The data generation takes less than a few minutes on a single CPU, with the exception of the sortability results (Section 5.1). For the sortability results, it takes around $30$ minutes to generate the datasets for a single graph specification across all weight supports and graph sizes. This is due to a bigger number of configurations and repetitions than in the other experiments. For a single graph specification and across all weight supports and graph sizes, it takes around $6$ hours to compute the sortability statistics on a single CPU. Running one execution of Notears (Avici) takes approximately $2$ min ( $1$ min) for $d=20$ and $30$ min ( $2$ min) for $d=100$ , respectively. The SortnRegress baselines run in less than $1$ min.

Appendix F Additional Experimental Results

F.1 Structure Learning

Figure 12 summarizes the structural Hamming distance (SHD) between the predicted and true graphs for the same datasets and algorithms as in Figure 5(a).

In Figures 13(a) and 13(b), we present the F1 scores and SHD attained by the structure learning algorithms on data of Linear iSCMs, SCMs, and standardized SCMs, across different weight distribution supports and graph sizes. We find that the difference in performance of Notears on data sampled from iSCM and standardized SCMs is larger for larger weight magnitudes and for bigger graphs. For smaller weights, the difference in the mean F1 score of Notears between the two standardization approaches is smaller, which is in line with our proposed explanation about the shifts of the implied noise variance distribution in Section 5.2.

In Figure 13(a), we also find that when weight magnitudes are below $1$ , $\operatorname{R^{2}}$ -SortnRegress performs similarly for both standardized SCMs and iSCMs. We also observe this for Avici. Meanwhile, for larger weights with support extending above $1$ , these algorithms achieve significantly higher F1 scores on standardized SCMs. This suggests that our condition of $\left\lvert w_{i,j}\right\rvert>1$ for all edges $(v_{i},v_{j})$ in the statement of Theorem 3, concerning the identifiability of linear standardized SCMs, may have a more fundamental practical significance, rather than being merely an artifact of the analysis.

F.2 $\operatorname{R^{2}}$ -Sortability

Figure 14 reports the $\operatorname{R^{2}}$ -sortability statistics across varying graph sizes and weight distributions, but this time for the denser graphs ER( $d$ , $4$ ) and SF( $d$ , $4$ ). We again observe $\operatorname{R^{2}}$ -sortability very close to $0.5$ for datasets sampled from iSCM and high degrees of $\operatorname{R^{2}}$ -sortability for data drawn from standardized SCMs. We omit standard SCMs from the plots as the datasets coming of SCMs and their standardized versions have the same $\operatorname{R^{2}}$ -sortability due to scale-invariance of the $\operatorname{R^{2}}$ coefficient.

F.3 Covariance Matrices for Figure 1

Figure 15 visualizes the full mean absolute covariance (correlation) matrices of the systems presented in Figure 1. The matrix shows that the pattern of increasing mean absolute covariance in standardized SCMs is not only a feature of neighboring nodes, but it also occurs for vertex pairs further apart, though less strongly. This is not the case for iSCMs, where any two pairs of equally spaced vertices have equal covariances in expectation over the weight sampling distribution.

Standardizing Structural Causal Models

Abstract

1 Introduction

2 Background and Related Work

Structural causal models

Structure learning and benchmarking

Data standardization and artifacts of SCMs

Identifiability

3 SCMs with Internal Standardization

3.1 Definition

Motivation

Interventions

Units

3.2 Implied SCMs

Lemma 1 (Covariance in linear SCMs with unit marginal variances).

4 Analysis

4.1 Behavior with Increasing Graph Depth

Theorem 2 (Bound on CEVfsubscriptCEVf\smash{\operatorname{CEV_{f}}}roman_CEV start_POSTSUBSCRIPT roman_f end_POSTSUBSCRIPT in linear iSCMs).

4.2 Identifiability

Theorem 3 (Partial identifiability of standardized linear SCMs with forest DAGs).

Theorem 4 (Nonidentifiability of linear Gaussian iSCMs with forest DAGs).

5 Experimental Results

5.1 R2superscriptR2\operatorname{R^{2}}roman_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-Sortability

5.2 Structure Learning

Properties of the implied SCM

6 Conclusions

Standardizing during the generative process removes sortability artifacts.

Standardizing post-hoc can lead to partial identifiability and degenerate implied SCMs.

iSCMs are stable and scale-free, making them useful beyond benchmarking.

Acknowledgments and Disclosure of Funding

References

Appendix A Implied Models

A.1 Implied Model of a Standardized SCM

A.2 Implied Model of an iSCM

A.3 Weights of the Implied Model of a Linear iSCM

Efficient computation

Appendix B Interventions in iSCMs

Appendix C Proofs

C.1 Definitions

C.2 Explicit Covariance in Linear SCMs with Unit Marginal Variances

Proof.

Base case (d=2𝑑2d=2italic_d = 2)

Induction step (d>2𝑑2d>2italic_d > 2)

C.3 Bound on the Fraction of CEV

Proof.

C.4 Identifiability

C.4.1 3-Node Case

Standardized SCM

iSCM

C.4.2 Forests

Standardized SCM

Lemma 5 (Orientation of edges in undirected chains of standardized SCMs).

Proof.

Subsystem 1 (Figure 7(a))

Subsystem 2 (Figure 7(b))

Subsystem 3 (Figure 7(c))

Remark

Proof.

iSCM

Proof.

More than one parent in 𝒢′superscript𝒢′\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

A single parent in 𝒢′superscript𝒢′\mathcal{G}^{\prime}caligraphic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

Remark

Appendix D Background on Related Work

D.1 Heuristics for Mitigating Variance Accumulation and VarVar\operatorname{Var}roman_Var-sortability in SCMs

Scaling weights by the inverse weight norm

Scaling weights by the incoming variance

D.2 Sortability Metrics

D.3 Structure Learning Algorithms

Notears (Zheng et al.,, 2018)

Avici (Lorch et al.,, 2022)

SortnRegress methods (Reisach et al.,, 2021, 2024)

Appendix E Experimental Setup

E.1 Data

Causal mechanisms

Generating a random model

Sampling data from a model

E.2 Experiment Configurations

Sortability

Structure Learning (Section 5.2)

Theorem 2 (Bound on $\smash{\operatorname{CEV_{f}}}$ in linear iSCMs).

5.1 $\operatorname{R^{2}}$ -Sortability

Base case ( $d=2$ )

Induction step ( $d>2$ )

More than one parent in $\mathcal{G}^{\prime}$

A single parent in $\mathcal{G}^{\prime}$

D.1 Heuristics for Mitigating Variance Accumulation and $\operatorname{Var}$ -sortability in SCMs

E.5 Transferring Noise Variances While Kee** $\operatorname{Var}$ -Sortability Unchanged

F.2 $\operatorname{R^{2}}$ -Sortability