Standardizing Structural Causal Models
Abstract
Synthetic datasets generated by structural causal models (SCMs) are commonly used for benchmarking causal structure learning algorithms. However, the variances and pairwise correlations in SCM data tend to increase along the causal ordering. Several popular algorithms exploit these artifacts, possibly leading to conclusions that do not generalize to real-world settings. Existing metrics like -sortability and -sortability quantify these patterns, but they do not provide tools to remedy them. To address this, we propose internally-standardized structural causal models (iSCMs), a modification of SCMs that introduces a standardization operation at each variable during the generative process. By construction, iSCMs are not -sortable, and as we show experimentally, not -sortable either for commonly-used graph families. Moreover, contrary to the post-hoc standardization of data generated by standard SCMs, we prove that linear iSCMs are less identifiable from prior knowledge on the weights and do not collapse to deterministic relationships in large systems, which may make iSCMs a useful model in causal inference beyond the benchmarking problem studied here.
1 Introduction
Predicting the effects of interventions and policy decisions requires reasoning about causality. Consequently, scientific fields ranging from biology and earth sciences to economics and statistics are interested in modeling causal structure (Pearl,, 2009; Maathuis et al.,, 2010; Imbens and Rubin,, 2015; Runge et al.,, 2019). A wide array of causal discovery algorithms has been proposed with the goal of inferring causal structure from data (e.g., Squires and Uhler,, 2022; Vowels et al.,, 2022). However, benchmarking these algorithms is challenging, since real-world datasets with an agreed-upon, ground-truth causal structure are rare (e.g., Sachs et al.,, 2005; see Mooij et al.,, 2020). The community predominantly relies on synthetic data for evaluating structure learning algorithms, where observations are generated according to a predetermined causal structure and system mechanisms. The inferred causal structures can then be directly compared to the ground truth. To generate synthetic data, it is common practice to sample from structural causal models with additive noise (SCMs) (Reisach et al.,, 2021). Unless stated otherwise, this work considers SCMs in which the variance scale of the additive noise is the same for all variables, a typical simplification made in benchmarking.
Under common benchmarking practices, synthetic datasets generated by SCMs contain patterns that are directly exploitable to make structure discovery easier. We will refer to such patterns as artifacts. In SCMs, the pairwise correlations between variables tend to increase along the causal ordering, since variance builds up downstream and, as a result, the proportion of the variance driven by the additive noise vanishes (Figure 1(a)). Reisach et al., (2024) characterize this phenomenon through an increase of the coefficients of determination () of the variables regressed on all others. Crucially, this artifact occurs both in the raw data and when shifting and scaling (standardizing) the variables to have zero mean and unit variance. One of the implications is that downstream causal dependencies in SCMs become effectively deterministic, especially in large-scale systems. As Reisach et al., (2024) demonstrate, simple causal discovery baselines can perform competitively on benchmarks of this kind by directly exploiting this phenomenon. This makes SCMs in their general definition possibly unsuitable for the purpose of benchmarking and, as we will argue, to some degree suboptimal for inferring causality more broadly. Ultimately, benchmarking on synthetic data with these patterns could lead to conclusions that do not generalize to real-world scenarios.
In this work, we propose a simple modification of SCMs that stabilizes the data-generating process and thereby removes exploitable covariance artifacts. Our models, denoted internally-standardized SCMs (iSCMs), introduce a standardization operation at each variable during the generative process (Figure 1(b)). In Section 4, we provide a theoretical motivation for this idea by studying linear iSCMs. We prove that, contrary to SCMs, the causal dependencies of iSCMs under mild assumptions never collapse to deterministic mechanisms as the graph size becomes large. Moreover, we formalize the correlation artifact commonly observed in benchmarks by proving that linear SCM structures in a Markov equivalence class (MEC) are partially identifiable for certain graph classes, given weak prior knowledge on the weight distribution of the ground-truth SCM. Most importantly, we show that this is not the case for the corresponding iSCMs. In Section 5, we empirically demonstrate that the baselines proposed in Reisach et al., (2021, 2024) are unable to exploit covariance artifacts in iSCMs, while practical classes of causal discovery algorithms are still able to learn causal structures in both linear and nonlinear systems. Our findings also reveal that SCM artifacts affect structure learning both positively and negatively, suggesting that generating (standardized) data from standard SCMs may be particularly ill-suited for benchmarking common approaches in use today.
2 Background and Related Work
We begin by introducing structural causal models and the problem of causal structure learning, before discussing how synthetic data is often generated for evaluating structure learning algorithms. We then review existing works that study identifiability and patterns frequently present in synthetic data.
Structural causal models
A structural causal model (SCM) (Peters et al.,, 2017) of variables consists of a collection of structural assignments, each given by
(SCM) |
where are called the parents of . Here, are arbitrary functions, and are independent random variables that model exogenous noise (or unexplained variation). Together, they entail a joint probability distribution over the variables . It is common to consider SCMs with additive noise, e.g., with linear functions , as given by
(1) |
The structural assignments in (SCM) induce a causal graph over the variables , which is assumed to be acyclic. Specifically, the directed acyclic graph (DAG) has vertices for every and a directed edge if . We will explicitly distinguish this DAG and its vertices from the variables . The skeleton of denotes with all edges undirected. If the skeleton of is acyclic, we call a forest.
Structure learning and benchmarking
Given a set of i.i.d. observations from the probability distribution induced by an unknown SCM, causal structure learning aims to infer the causal graph underlying the SCM. In this work, we focus on structure learning from observational, not interventional, data and only consider SCMs with no latent confounders.
Because it is difficult to obtain the ground-truth for many real-world datasets, it is common to evaluate structure learning algorithms on synthetic data where is known. A ubiquitous approach is to sample a DAG , then SCM functions defined over , and finally a dataset from this SCM, with the goal of later recovering from the data. It is common to consider with mean and fixed variance (often ), and for linear systems, to sample each uniformly and i.i.d. with support bounded away from (Shimizu et al.,, 2011; Peters and Bühlmann,, 2014; Zheng et al.,, 2018; Yu et al.,, 2019; Lachapelle et al.,, 2020; Zheng et al.,, 2020; Ng et al.,, 2020; Reisach et al.,, 2021; Lorch et al.,, 2022; Reisach et al.,, 2024). There exist alternative benchmarking strategies that involve sampling data from domain-specific simulators (Schaffter et al.,, 2011; Dibaeinia and Sinha,, 2020).
Data standardization and artifacts of SCMs
Previous work shows that generating data as described above can lead to strong artifacts. Reisach et al., (2021) observe that the variance of variables tends to increase along the topological ordering of . This leads to the -SortnRegress baseline, which sorts variables based on their empirical variance and then performs sparse regression to infer . Seng et al., (2024) show that structure learning algorithms minimizing an MSE-based loss (e.g., Zheng et al.,, 2018) can identify under similar conditions. Therefore, Reisach et al., (2021) propose using standardization (Figure 1(a)) to remove this variance artifact from benchmarks. Specifically, they first sample all according to a standard SCM and then post-hoc transform the variables as
(Standardized SCM) |
such that our observations correspond to samples from . Standardization, however, only removes the variance artifact. Even in standardized SCMs, the fraction of a variable’s variance that is explained by all others, measured by the coefficient of determination , tends to increase along the topological ordering (Reisach et al.,, 2024). -SortnRegress exploits this correlation artifact analogously to -SortnRegress. Existing heuristics aiming to avoid the increasing correlations adjust the sampling process of , but they ultimately limit the causal dependencies that can be modeled, e.g., to certain levels of correlations among the observed (Mooij et al.,, 2020) or a constant proportion of variance explained by the parents (Squires et al.,, 2022) (Appendix D.1). To our knowledge, there are currently no general methods for generating SCM data without strong correlation artifacts or significant limitations on the mechanisms and noise .
Identifiability
Given a class of SCMs, there may be several SCMs with different causal graphs that entail the same distribution (Peters et al.,, 2017). Thus, even with infinite observations from , we may be unable to identify the causal graph that generated the observations. However, some identifiability results are known depending on the class of functions and noise distributions of the SCMs considered. For example, among all linear SCMs (1) with Gaussian noise , the graph can only be uniquely identified up to its MEC (Verma and Pearl,, 2013). However, if the noise scales are equal (Peters and Bühlmann,, 2014) or the noise is non-Gaussian (Shimizu et al.,, 2006), can be uniquely identified given .
It is fundamental to recognize that existing identifiability results only concern the unstandardized distributions of SCMs. When we standardize the data and observe instead, existing results no longer apply, because the implied SCM after standardization may violate the properties of the original SCM (e.g., its noise variances). In this work, we present, to our knowledge, the first (partial) identifiability result for standardized SCMs. Our result concerns a setting with prior knowledge on the magnitudes of w in Equation 1, an assumption underlying common benchmarking practices.
3 SCMs with Internal Standardization
3.1 Definition
We propose internally-standardized SCMs (iSCMs) as a modification to the standard data-generating process of SCMs. An iSCM consists of pairs of assignments, where for each ,
(iSCM) |
with parents of in the underlying DAG and jointly-independent exogenous noise variables . The variables are latent, and the variables are observed. Figure 2 illustrates the generative process. Algorithm 1 summarizes how to sample from (iSCM). If computing the population expectations and variances of is intractable, the empirical statistics obtained from samples can be used for standardization at each loop iteration of Algorithm 1.
Motivation
By construction, iSCMs model observed variables with zero mean and unit marginal variance. Contrary to standard SCMs, iSCMs avoid the accumulation of variance downstream in the causal ordering that can occur in standard SCMs (see Figure 1) through the standardization operation. Because each variable only depends on the standardized variables , the relative scales of the noise distribution and the causal mechanisms are the same everywhere in the system and do not change, for example, downstream in the causal ordering. The causal mechansims of iSCMs are thus scale-free, in that the local interaction of mechanism and noise occurs at a scale independent of the position of in the global ordering. This property makes iSCMs particularly useful for benchmarking, where random ground-truth models are commonly generated from a fixed distribution over functions and noise . Contrary to existing heuristics (Section 2), iSCMs model arbitrarily strong or weak causal dependencies and levels of cause-explained variance.
Interventions
Analogous to standard SCMs, interventions in iSCMs can be defined as modifications of the structural assignments in (iSCM) (Figure 2), while kee** the standardization operation based on the observational distribution. When the population statistics for standardization are intractable, we first sample observational data to obtain empirical statistics. Since we do not study interventions in this work, we defer a further discussion of interventions in iSCMs to Appendix B.
Units
When modeling a physical system, the functional mechanisms in standard SCMs have to account for the difference in units between the variables for the model to be unit-covariant (see Villar et al.,, 2023). A side-effect of internal standardization is that variables of iSCMs become unit-less, so iSCMs obey the passive symmetry of unit covariance by construction. Therefore, iSCMs naturally model both unit-less quantities and variables measured in different units, which can make them useful beyond benchmarking. Learned iSCMs would be invariant to the units chosen by the experimenter, similar to the physical world being independent of the mathematical models chosen to describe it.
3.2 Implied SCMs
It is natural to investigate whether SCMs can generate the same observations as standardized SCMs or iSCMs, given the same causal graph and exogenous variables . In other words, can standardized SCMs and iSCMs be written as SCMs? For both models, the answer is yes. Specifically, we can express the generative process of in (Standardized SCM) and in (iSCM) as
(2) |
respectively, by moving the standardization operations into the causal mechanisms of the observables but leaving the DAG and the variables unchanged. Appendix A describes how to construct these implied causal mechanisms and and implied noise scales and . We refer to the above SCM form of a standardized SCM or an iSCM with additive noise as their implied (SCM) model. Correspondingly, the implied SCMs have zero mean and unit variance. The notion of implied SCMs is powerful, because it enables us to analyze standardized SCMs and iSCMs as SCMs, and it sheds light on the performance of structure learning algorithms that assume unstandardized SCMs to underlie the generative process of the data (e.g., Shimizu et al.,, 2011; Zheng et al.,, 2018; Yu et al.,, 2019; Lachapelle et al.,, 2020; Zheng et al.,, 2020).
To provide a first characterization of standardized SCMs and iSCMs, our theoretical analyses focus on systems where are linear functions with additive, zero-mean noise as given by Equation (1). As a step** stone for this analysis, we derive an analytical expression for the covariance of linear SCMs, whose variables have unit variance by construction, without any form of standardization:
Lemma 1 (Covariance in linear SCMs with unit marginal variances).
Let be modeled by a linear SCM defined by (1) with DAG that satisfies . Then, the covariance is the sum of products of the weights along all unblocked paths between the nodes of and in . Specifically, for any such that , it holds that
(3) |
where are all unblocked paths from to in , and indicates that the directed edge is part of the path .
We give a proof in Appendix C.2. Since the implied SCMs of linear standardized SCMs and iSCMs are linear SCMs, the setting of Lemma 1 applies precisely to the SCM forms of both models. Thus, Lemma 1 enables us to study the covariances in standardized SCMs and iSCMs, and as we show next, derive conditions for the (non)identifiability of their DAGs from the observational distribution.
4 Analysis
In this section, we give two theoretical results that support the suitability of iSCMs over standard SCMs for causal discovery benchmarking. First, we prove the general case of Figure 1. Contrary to standardized SCMs, iSCMs do not degenerate towards deterministic implied SCM mechanisms in deep graphs. Moreover, we prove that the DAGs of linear iSCMs cannot be identified beyond their MEC, assuming the DAG is a forest, even if the support of is known. Crucially, we also show that this is not generally true for standardized SCMs. This suggests that algorithms can less easily game benchmarks based on linear iSCMs when knowing the data-generating process. For all results, we consider linear SCMs (1) with zero-mean additive noise and equal noise variances. All results are at the population level, so assume we know or . Proofs are given in Appendix C.
4.1 Behavior with Increasing Graph Depth
Standardized SCMs tend towards increasing correlations between adjacent nodes down the topological ordering. This correlation artifact makes standardized SCMs problematic for benchmarking, because it may not be a property we expect to underlie real data. Reisach et al., (2024) show, under some assumptions on , that the dependencies in standardized SCMs become deterministic with increasing graph depth. This implies that any exogenous variation vanishes lower down in the system. Unless prior domain knowledge leads us to assume this holds in applications of interest, it may not be desirable to implicitly bias structure learning benchmarks towards such systems. For example, if the causal ordering represents time (Pamfil et al.,, 2020), the mechanisms of standardized SCMs are unable to model or characterize time-invariant or stable processes. Moreover, if we expect causal mechanisms to be independent (Schölkopf,, 2022), the qualitative behavior of a causal mechanism should not provide information about its position in the topological ordering relative to other mechanisms, as it would in SCMs. Reisach et al., (2024) show that baselines like -SortnRegress can perform competitively on benchmarks by exploiting this artifact (Section 2).
iSCMs do not tend towards determinism with increasing graph depth (Figure 1(b)). In standardized SCMs, the correlations increase downstream, because the marginal variances of the underlying SCM increase with node depth, while the variance scale is fixed (Reisach et al.,, 2021). Thus, for large , the variance scale of becomes large relative to the scale of , and the correlation of and tends towards . Since and are just standardized versions of these variables, they maintain the same correlation. iSCMs avoid this by standardizing internally, which scales the variance of any parents in a mechanism to , modulating the relative variance of and . In the following, we formalize this result for general graphs by bounding the fraction of cause explained variance (CEV). The fraction of CEV for is the proportion of explained by its causal parents and given by
(4) |
The following results shows that we can bound the fraction of CEV for any variable in a linear iSCM:
Theorem 2 (Bound on in linear iSCMs).
Let be modeled by a linear iSCM (1) with DAG and additive noise of equal variances . Suppose any node in has at most parents and . Then, for any , the fraction of CEV for is bounded as
Since the fraction of CEV is bounded, iSCMs are guaranteed not to collapse to determinism in large systems, alleviating several of the concerns with (standardized) SCMs discussed above.
4.2 Identifiability
![Refer to caption](x1.png)
Figure 1(a) illustrates that the pairwise correlations in SCMs over chain graphs depend on the position in the topological ordering. This can allow algorithms like -SortnRegress to infer the graph. By contrast, Figure 1(b) shows that iSCMs do no exhibit this pattern, with correlations between variables not increasing the identifiability of any part of the system.
In the following, we formalize this phenomenon for forests, that is, all DAGs with acyclic skeletons (Section 2). Specifically, we prove two results concerning the identifiability of the DAG from the observational distribution, for standardized SCMs and iSCMs. This makes our finding the first identifiability result for standardized SCMs. While not every DAG is a forest, DAGs have forests as subgraphs and resemble forests as sparsity increases, thus providing us with intuition for generally sparse systems (e.g., Alon and Spencer,, 2016, Chapter 11).
Our first result leverages the observation that, for standardized SCMs, many DAGs in an MEC are infeasible given when their edge directions are not consistent with the direction of increasing absolute covariance. To illustrate this idea, suppose our goal is to distinguish between the DAGs in the MEC in Figure 3(a). We overload notation and denote the weights of the edges and regardless of orientation. For standardized SCMs, we can apply Lemma 1 to the implied SCM of graph to obtain the covariances
See Appendix C.4.1. Together, both expressions imply that standardized SCMs with DAG satisfy
(5) |
If , then the right-hand side of Equation 5 is always true. In this case, the absolute covariance increases from to in all standardized SCMs with DAG . By symmetry, the covariance in SCMs with DAG increases from to when . Therefore, if both weights are greater than , the absolute covariance increases downstream in all SCMs of and . This implies that, among and , only the DAG whose edges align with the covariance ordering in can induce . Irrespectively, the DAG remains plausible. We can extend the intuition of this 3-variable example to identify almost all edges in any forest MEC:
Theorem 3 (Partial identifiability of standardized linear SCMs with forest DAGs).
Let be modeled by a standardized linear SCM (1) with forest DAG , additive noise of equal variances , and for all . Then, given and the partially directed graph representing the MEC of , we can identify all but at most one edge of the true DAG in each undirected connected component of the MEC .
Our proof of Theorem 3 considers each undirected component separately from the rest of the MEC . Hence, the identifiability result extends to undirected tree components of arbitrary, non-forest MECs as well. Theorem 3 shows that, when using standardized SCM data for benchmarking, algorithms can use pairwise correlations to orient additional edges correctly. The weights assumption of Theorem 3 is relevant to causal discovery benchmarking, because weights are often sampled i.i.d. from intervals bounded away from (Section 2). Hence, empirical evaluations may render standardized linear SCMs identifiable only through the design of their weights distribution. In the following, we show that, under similar conditions, iSCMs are more difficult to identify from their MEC. In the 3-variable example above, we can show that the observational distribution of iSCMs is the same for all DAGs , , and when the weights and are shared over the corresponding edges in the MEC (Figure 3(b); see Appendix C.4). This result generalizes to forests:
Theorem 4 (Nonidentifiability of linear Gaussian iSCMs with forest DAGs).
Let be modeled by a linear iSCM (1) with forest DAG and additive Gaussian noise of equal variances . Then, for every DAG in the MEC of , there exists a linear iSCM with DAG that has the same observational distribution as , the same noise variances, and the same weights on the corresponding edges in the MEC.
Our proof consists of showing that the covariance matrices of these systems are equal. For linear Gaussian iSCMs, this then implies that their observational distributions are identical. Theorem 4 thus shows that additional knowledge of the weight distribution in a benchmark does not allow identifying any additional edges beyond the MEC. By contrast, Theorem 3 shows that, for standardized SCMs, lower-bounding the weight magnitudes is sufficient for identifying most of the graph from its MEC. Without standardization, is fully identified from its observational distribution under even weaker assumptions (Peters and Bühlmann,, 2014). Importantly, Theorem 4 does not generalize to arbitrary graphs beyond forests. Appendix C.4 provides a counterexample involving a 3-node skeleton with a cycle. As we study in the next section, this implies that nontrivial causal structure can still be learned from iSCM data. However, DAGs in benchmarks are often sparse, so we expect the implications of our identifiability results to capture relevant parts of empirical phenomena in benchmarking settings.
5 Experimental Results
Our previous analyses suggest that iSCMs address shortcomings of naive standardization, in particular, when sampling each and from the same distribution, as commonly done in benchmarking. In this section, we now provide evidence that iSCMs do not contain the covariance artifacts of SCMs. Moreover, we benchmark the SortnRegress baselines (Section 2) and two representative structure learning algorithms to gain insights into how their performance varies when benchmarked on standardized SCMs and iSCMs. Appendix E provides all details of the experimental setup.
5.1 -Sortability
Reisach et al., (2024) introduce the -sortability metric to evaluate the correlation artifact underlying a dataset. -sortability measures the correlation between the variables’ causal ordering and the coefficients obtained from regressing each variable onto all others (Appendix D.2). The metric gives rise to the -SortnRegress baseline described in Section 2. Reisach et al., (2024) show that -sortability in SCMs is driven by an interplay of graph connectivity and the weight distribution of .
![Refer to caption](x2.png)
![Refer to caption](x3.png)
![Refer to caption](x4.png)
Figure 4 summarizes the -sortability statistics for linear SCM and iSCM data. We write and to denote Erdős-Rényi and scale-free graphs of size and (expected) degree , respectively. We find that iSCMs generate datasets that are not -sortable (-sortability ) and thus artifact-free while sampling over common graph structures (e.g., Zheng et al.,, 2018; Yu et al.,, 2019; Reisach et al.,, 2021). Conversely, standardized SCMs generate datasets that are strongly -sortable (). Since -sortability can be exploited for causal discovery, iSCM data serves as a test for evaluating whether algorithms utilize any data properties beyond the association between and the causal ordering in SCMs. Our results do not exclude the possibility of iSCM configurations that still produce -sortable datasets. However, we show that, for commonly-used , , and , iSCM datasets are not -sortable with high probability. Appendix F provides results for denser graphs.
5.2 Structure Learning
Under the same weight and noise distributions, standardized SCMs and iSCMs have different implied SCMs and generate qualitatively different datasets. Here, we study how this affects causal structure learning in practice. We evaluate - and -SortnRegress (SR) and a baseline using random orderings (Reisach et al.,, 2021, 2024). In addition, we evaluate representative algorithms from two orthogonal approaches to learning structure from (co)variance information. Notears by Zheng et al., (2018) leverages continuous optimization to minimize an MSE loss, which is affected by noise scaling (Loh and Bühlmann,, 2014; Seng et al.,, 2024). Avici by Lorch et al., (2022) predicts graphs using a model pretrained on simulated data and is thus optimized to exploit any artifacts that improve predictive accuracy. To investigate its susceptibility to artifacts, we evaluate the public model checkpoints trained on standardized SCMs.
Figure 5(a) summarizes the results for linear and nonlinear systems. Here, the nonlinear mechanisms are samples from a Gaussian process with squared exponential kernel. As expected, -SortnRegress performs best when SCMs are not standardized. Likewise, -SortnRegress performs better on SCMs and standardized SCMs, as iSCMs have -sortability close to (Section 5.1). Avici shows the same trend, suggesting it may indeed be exploiting the correlation artifacts present in its training distribution. Like Reisach et al., (2021), we find that Notears performs best on unstandardized data. However, and more interestingly, Notears also performs better on iSCMs than on standardized SCMs, especially in linear and larger systems. As we investigate next, this gap may be explained by the fact that the implied models of standardized SCMs violate the assumptions of Notears more strongly than iSCMs. Overall, performance differences are more pronounced for linear systems, where the downstream variance accumulation in SCMs is unbounded. Appendix F reports the results in terms of structural Hamming distance (SHD) and different weight ranges.
![Refer to caption](x5.png)
![]() ![]() |
|
![]() ![]() |
![Refer to caption](x10.png)
![Refer to caption](x11.png)
![Refer to caption](x12.png)
Properties of the implied SCM
When standardizing SCM data, the implied SCM corresponds to the SCM that could have generated the observations. Therefore, algorithms assuming that unstandardized SCMs generated the data will be susceptible to any assumption violations of the implied SCM, such as assumptions about the exogenous noise. Figure 5(b) (bottom) shows the distribution of inverse implied noise scales for the variables of the implied models (see Equation 2). Since in our experiments, these inverse squared noise scales are equal to the inverse variances of the full additive noise terms. We find that standardized SCMs induce inverse noise scales that are orders of magnitude greater than those of iSCMs. This distribution is essentially the footprint of the determinism in the depth limit discussed in Section 4.1. The modes at and at in the iSCM plot correspond to root and non-root nodes, respectively.
Figure 5(b) (top) shows the performance of Notears when isolating the noise properties of the implied models from the fact that standardized SCMs and iSCMs are not -sortable. For this, we construct SCMs that have the marginal variances (and -sortability, here on average) of unstandardized SCMs but the noise variances of the implied models by correcting their weights (see Appendix E.5). Notears performs better in such systems, suggesting that (i) the noise statistics may indeed explain the performance difference on iSCM data, and (ii) -sortability may not be the only reason why Notears performs significantly worse on standardized data (Reisach et al.,, 2021). This sheds light on existing benchmarking results, where MSE-based algorithms perform below expectations despite perhaps not intending to evaluate the algorithms under model mismatch (e.g., Reisach et al.,, 2021; Kaiser and Sipos,, 2021). For the MSE loss, Loh and Bühlmann, (2014) and Seng et al., (2024) show that smaller ratios of noise variances increase the magnitude of weights required for the true DAG to be the unique minimizer. The MSE loss ultimately does not account for the inverse variance factor in the Gaussian noise likelihood. Overall, the statistics of the implied models of standardized SCMs are empirically further from SCMs with equal noise variances than their iSCM counterparts.
6 Conclusions
We describe the iSCM, a one-line modification of the SCM that modulates the scale of interaction between the causal mechanism and noise at each variable . Through several theoretical and experimental results, we study its properties in relation to standard SCMs as well as the models they imply after standardization. To conclude, we highlight the following key takeaways:
Standardizing during the generative process removes sortability artifacts.
When the functions and the noise are, for example, sampled i.i.d. for each variable , SCMs exhibit artifacts that are not removed when shifting and scaling the generated data. Our results in Section 5 show that iSCMs are effective at removing - and -sortability. This makes iSCMs a useful complement to structure learning benchmarks with SCMs, enabling a specific evaluation of the ability of algorithms to transfer to real-world settings that do not exhibit artifacts. Despite the removed sortability artifacts, causal discovery algorithms are able to infer nontrivial structure from iSCM data (Figure 5).
Standardizing post-hoc can lead to partial identifiability and degenerate implied SCMs.
Scaling the units of SCM data is not innocuous. Theorem 3 shows that mild knowledge on the distribution of can identify edges in standardized SCMs that are typically not identifiable from observational data. To our knowledge, our result is the first concerning the identifiability of from the standardized observational distribution of SCMs. This may make benchmarks, where similar assumptions on often hold, trivial under standardized SCMs. Moreover, Figure 5(b) shows that standard SCMs can collapse to modeling near-zero exogenous noise. Theorems 2 and 4 demonstrate that neither property appears in the analogous iSCMs. Ultimately, (non)identifiability may be either a feature or bug, depending on whether assumptions are verifiable in practice or a priori known during evaluation.
iSCMs are stable and scale-free, making them useful beyond benchmarking.
Beyond data generation, the stable generative process of iSCMs can make them useful for modeling, e.g., large or temporal systems (e.g., Kilian,, 2013; Pamfil et al.,, 2020). In iSCMs, the scale of a causal mechanism and its unexplained variation is both unit-less and independent from its position in the causal ordering (Section 3). Since each iSCM implies a standard SCM, iSCMs can be viewed as a reparameterization of SCMs that enables modeling and learning the functions on the same scale, e.g., under a shared prior or level of regularization. Conceptually, iSCMs are related to batch normalization (Ioffe and Szegedy,, 2015), a technique used to stabilize the optimization of neural networks, which compose sequences of functions like SCMs, by adding internal standardization. Overall, these properties may make the iSCM a useful structural equation model beyond the benchmarking problem studied here.
Acknowledgments and Disclosure of Funding
This research was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program grant agreement no. 815943 and the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40 180545. This work was also supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039B, and by the Machine Learning Cluster of Excellence, EXC number 2064/1, project number 390727645.
References
- Alon and Spencer, (2016) Alon, N. and Spencer, J. H. (2016). The probabilistic method. John Wiley & Sons.
- Andersson et al., (1997) Andersson, S. A., Madigan, D., and Perlman, M. D. (1997). A characterization of markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505–541.
- Barabási and Albert, (1999) Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. science, 286(5439):509–512.
- Dibaeinia and Sinha, (2020) Dibaeinia, P. and Sinha, S. (2020). SERGIO: a single-cell expression simulator guided by gene regulatory networks. Cell systems, 11(3):252–271.
- Erdős and Rényi, (1959) Erdős, P. and Rényi, A. (1959). On random graphs. Publicationes Mathematicae, 6:290–297.
- Imbens and Rubin, (2015) Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge university press.
- Ioffe and Szegedy, (2015) Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr.
- Kaiser and Sipos, (2021) Kaiser, M. and Sipos, M. (2021). Unsuitability of NOTEARS for causal graph discovery. arXiv preprint arXiv:2104.05441.
- Kilian, (2013) Kilian, L. (2013). Structural vector autoregressions. In Handbook of research methods and applications in empirical macroeconomics, pages 515–554. Edward Elgar Publishing.
- Lachapelle et al., (2020) Lachapelle, S., Brouillard, P., Deleu, T., and Lacoste-Julien, S. (2020). Gradient-based neural DAG learning. In International Conference on Learning Representations.
- Loh and Bühlmann, (2014) Loh, P.-L. and Bühlmann, P. (2014). High-dimensional learning of linear causal networks via inverse covariance estimation. The Journal of Machine Learning Research, 15(1):3065–3105.
- Lorch et al., (2022) Lorch, L., Sussex, S., Rothfuss, J., Krause, A., and Schölkopf, B. (2022). Amortized inference for causal structure learning. Advances in Neural Information Processing Systems, 35:13104–13118.
- Maathuis et al., (2010) Maathuis, M. H., Colombo, D., Kalisch, M., and Bühlmann, P. (2010). Predicting causal effects in large-scale systems from observational data. Nature methods, 7(4):247–248.
- Meek, (1995) Meek, C. (1995). Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 403–410.
- Mooij et al., (2020) Mooij, J. M., Magliacane, S., and Claassen, T. (2020). Joint causal inference from multiple contexts. The Journal of Machine Learning Research, 21(1):3919–4026.
- Ng et al., (2020) Ng, I., Ghassami, A., and Zhang, K. (2020). On the role of sparsity and DAG constraints for learning linear DAGs. Advances in Neural Information Processing Systems, 33:17943–17954.
- Pamfil et al., (2020) Pamfil, R., Sriwattanaworachai, N., Desai, S., Pilgerstorfer, P., Georgatzis, K., Beaumont, P., and Aragam, B. (2020). DYNOTEARS: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics, pages 1595–1605. Pmlr.
- Pearl, (2009) Pearl, J. (2009). Causality. Cambridge university press.
- Peters and Bühlmann, (2014) Peters, J. and Bühlmann, P. (2014). Identifiability of Gaussian structural equation models with equal error variances. Biometrika, 101(1):219–228.
- Peters et al., (2017) Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of causal inference: foundations and learning algorithms. The MIT Press.
- Rahimi and Recht, (2007) Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. Advances in neural information processing systems, 20.
- Reisach et al., (2021) Reisach, A., Seiler, C., and Weichwald, S. (2021). Beware of the simulated DAG! Causal discovery benchmarks may be easy to game. Advances in Neural Information Processing Systems, 34:27772–27784.
- Reisach et al., (2024) Reisach, A., Tami, M., Seiler, C., Chambaz, A., and Weichwald, S. (2024). A scale-invariant sorting criterion to find a causal order in additive noise models. Advances in Neural Information Processing Systems, 36.
- Runge et al., (2019) Runge, J., Bathiany, S., Bollt, E., Camps-Valls, G., Coumou, D., Deyle, E., Glymour, C., Kretschmer, M., Mahecha, M. D., Muñoz-Marí, J., et al. (2019). Inferring causation from time series in earth system sciences. Nature communications, 10(1):2553.
- Sachs et al., (2005) Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529.
- Schaffter et al., (2011) Schaffter, T., Marbach, D., and Floreano, D. (2011). GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics, 27(16):2263–2270.
- Schölkopf, (2022) Schölkopf, B. (2022). Causality for machine learning. In Probabilistic and causal inference: The works of Judea Pearl, pages 765–804.
- Seng et al., (2024) Seng, J., Zečević, M., Dhami, D. S., and Kersting, K. (2024). Learning large DAGs is harder than you think: Many losses are minimal for the wrong DAG. In The Twelfth International Conference on Learning Representations.
- Shimizu et al., (2006) Shimizu, S., Hoyer, P. O., Hyvärinen, A., Kerminen, A., and Jordan, M. (2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10).
- Shimizu et al., (2011) Shimizu, S., Inazumi, T., Sogawa, Y., Hyvarinen, A., Kawahara, Y., Washio, T., Hoyer, P. O., Bollen, K., and Hoyer, P. (2011). DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model. Journal of Machine Learning Research-JMLR, 12(Apr):1225–1248.
- Squires and Uhler, (2022) Squires, C. and Uhler, C. (2022). Causal structure learning: A combinatorial perspective. Foundations of Computational Mathematics, 23(5):1781–1815.
- Squires et al., (2022) Squires, C., Yun, A., Nichani, E., Agrawal, R., and Uhler, C. (2022). Causal structure discovery between clusters of nodes induced by latent factors. In Conference on Causal Learning and Reasoning, pages 669–687. PMLR.
- Verma and Pearl, (2013) Verma, T. S. and Pearl, J. (2013). On the equivalence of causal models.
- Villar et al., (2023) Villar, S., Hogg, D. W., Yao, W., Kevrekidis, G. A., and Schölkopf, B. (2023). Towards fully covariant machine learning. arXiv preprint arXiv:2301.13724.
- Vowels et al., (2022) Vowels, M. J., Camgoz, N. C., and Bowden, R. (2022). D’ya like DAGs? A survey on structure learning and causal discovery. ACM Computing Surveys, 55(4):1–36.
- Wienöbst et al., (2023) Wienöbst, M., Luttermann, M., Bannach, M., and Liskiewicz, M. (2023). Efficient enumeration of markov equivalent dags. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12313–12320.
- Yu et al., (2019) Yu, Y., Chen, J., Gao, T., and Yu, M. (2019). DAG-GNN: DAG structure learning with graph neural networks. In International Conference on Machine Learning, pages 7154–7163. PMLR.
- Zheng et al., (2018) Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. (2018). DAGs with NO TEARS: Continuous optimization for structure learning. Advances in neural information processing systems, 31.
- Zheng et al., (2020) Zheng, X., Dan, C., Aragam, B., Ravikumar, P., and Xing, E. (2020). Learning sparse nonparametric DAGs. In International Conference on Artificial Intelligence and Statistics, pages 3414–3425. Pmlr.
Appendix A Implied Models
In this section, we describe how to express the assignments of the observed variables of standardized SCMs and iSCMs with a general additive noise mechanism
(6) |
in the form of (SCM), while sharing the same causal graph and exogenous noise variables . We obtain the SCM form by moving the standardization steps into the causal mechanisms by linearly rescaling and , such that each observed variable is only a function of observed variables and the noise . Throughout this work, the implied (SCM) model denotes the specific construction given in the following two subsections. For this, we assume that we can express the first two moments of the system in closed form. Similar to the main text, we overload notation for both standardized SCMs and iSCMs and write
We also derive analytic expressions for the weights of the implied models of linear iSCMs defined by Equation (1), which we later use in our proofs.
A.1 Implied Model of a Standardized SCM
Let be modeled by (Standardized SCM) with causal mechanisms defined by Equation (6). We recall that are the observations obtained after standardizing . Thus, we can rearrange as
and substitute every unstandardized variable by a function of its standardized parents as
where denotes elementwise multiplication, and and are the vectors of the parent means and standard deviations before standardization. Thus, the assignments of in a standardized SCM can be written as the SCM given by
with implied noise scales and implied causal mechanisms
A.2 Implied Model of an iSCM
Let be modeled by (iSCM) with causal mechanisms defined by Equation (6). In an iSCM, are the observed variables and are the latent variables. We can express every observation in terms of its observed parents as
Thus, the assignments of in a iSCM can be written as the SCM given by
with implied noise scales and implied causal mechanisms
A.3 Weights of the Implied Model of a Linear iSCM
Here, we derive the analytical form for the mechanisms of the implied model of a linear iSCM with zero-centered, additive noise . This iSCM is given by
where satisfies and . We can write the above as
It follows that the implied SCM of a linear iSCM is also linear, with weights and noise variances given by
(7) |
In the above, we can write the variance of explicitly as
(8) | ||||
where follows from Bienaymé’s identity and from covariance being bilinear. Substituting the variance into the expressions for the weights and noise variances, we obtain
(9) | ||||
(10) |
Finally, by construction, the variables of an iSCM have unit marginal variances. Thus, when the parents of are pairwise independent, Equation 10 simplifies to
(11) |
This independence condition always holds when the DAG is a forest.
Efficient computation
We can efficiently compute the implied model weights using a bottom-up dynamic programming approach. This allows sampling data directly from the exact implied model of an iSCM without resorting to empirical standardization statistics. Algorithm 2 describes the procedure. We iteratively compute the weights and noise variances of the implied model following Equations (9) and (10). At each iteration, we update the covariance matrix according to Lemma 1. The algorithm processes the nodes in topological order, mirroring the proof by induction of Lemma 1.
Appendix B Interventions in iSCMs
For an iSCM , we can formalize interventions as changes to its causal mechanisms , analogous to the common definition for SCMs (Peters et al.,, 2017). Specifically, let and be the mean and standard deviation of the latent variable . We define an intervention as replacing one (or several) of the assignments to the latent variables as
for some function . Importantly, the statistics and used for the standardization operation
remain unchanged. Thus, if we intervene on mechanisms of iSCMs, the variables may no longer have zero mean and unit variance, and the perturbations of propagate downstream through the causal mechanisms. We note that, under the above definition, intervening on an iSCM through a new mechanism is equivalent to intervening on the implied SCM of an iSCM with the mechanism
Appendix A.2 provides details on the implied models of iSCMs.
Appendix C Proofs
C.1 Definitions
We define the key concepts used throughout our analysis. A path between and is a set of directed edges that allows reaching from (and vice versa), not taking into account edge directionality, and that joins unique vertices. We call a node a collider in a path if the node has two ingoing directed edges in the path. We say that a path between and is unblocked if and only if there is no node that is a collider in the path (see Figure 9(a)). Finally, we use the term undirected connected component to refer to any maximal subgraph of in which any two nodes are connected by a path containing only undirected edges (Wienöbst et al.,, 2023).
C.2 Explicit Covariance in Linear SCMs with Unit Marginal Variances
See 1
Proof.
We will give a proof by induction on the number of vertices in the DAG . Without loss of generality, we assume that the indices of the nodes are ordered according to some fixed topological ordering , so if . By the unit marginal variance assumption,
(12) |
From now on and without loss of generality, we consider two arbitrary indices . The covariance between and is symmetric.
Base case ()
If is not an ancestor of in graph , they both must be root nodes, because the edge is the only possible edge when . Since and are root nodes, they are independent and . Since a path of one edge cannot contain a collider, there are no unblocked paths between and , so the RHS of Equation (3) is also .
Conversely, if is an ancestor of in graph , is the only parent and ancestor of . This implies that
where the last equality follows from Equation (12). This is exactly Equation (3) for a two-node graph.
Induction step ()
Let us assume that Equation (3) holds for all graphs of size , and let have nodes. We will apply the inductive hypothesis to the subgraph of the first nodes in and show that the full DAG including the -th vertex still satisfies Equation (3). First, we note that, since the -th vertex is last in the topological ordering, it has no outgoing edges. Because the node has no outgoing edges, it is not visited on any unblocked paths between and for , as must be a collider in any path. Second, adding the node to a subsystem containing results in no change to the joint distribution of . Therefore, it has no effect on the covariance between . Hence, both sides of Equation 3 are unchanged by the presence of a node for all and the equation still holds for all .
We want to show that Equation 3 also holds for and any . For this, we first construct all unblocked paths from to . First, we note that any unblocked path must go through the parents , because in the topological ordering (see Figure 6). Moreover, for any , appending to an unblocked path between and , creates a new unblocked path between and . Hence, for and any , it holds that
For step , consider two cases. If , then and the equality trivially holds. If , then it holds by pulling the term for out of the sum in the previous line. In , we apply the inductive hypothesis to express the covariances in terms of a sum of products of weights. In , we rearrange terms to pull the term into the sum over parents. In , we use the fact that the set of unblocked paths from to corresponds to all paths from to any parent of , which is here, with an extra edge appended, and a possible single-edge path directly connecting with (if ).
This completes the induction step and the proof. ∎
C.3 Bound on the Fraction of CEV
See 2
Proof.
We begin by bounding the variance of the latent variables in iSCMs. Starting from Equation (8), we can bound the covariances with a product of unit variances as
where uses since and , and applies the Cauchy-Schwartz inequality. Since we obtain from just by shifting and scaling the latter, we observe that . Using the upper bound on the variance of and the definition of the fraction of cause-explained variance in Equation (4)), we get
∎
C.4 Identifiability
In this section, we prove Theorems 3 and 4. We begin by deriving the covariances for the 3-node example in Section 4.2 and then give the general proofs for forests. The proofs of both theorems share the same underlying argument. We first derive the SCM forms of the original models, i.e., standardized SCMs in Theorem 3 and iSCMs in Theorem 4. By showing that the standardized SCMs and iSCMs are SCMs with the same causal graphs and observational distributions , we can leverage Lemma 1 to obtain the covariances between the observed variables in both model classes. Ultimately, these covariances allow us to derive (non)identifiability conditions for the DAGs in an MEC underlying the original models.
Theorems 3 and 4 assume that the exogenous noise is sampled from a zero-centered distribution with equal variance across variables. Since the results are based on the analysis of covariances, they also hold with the assumption that , but the zero-mean assumption simplifies notation. To derive the results for iSCMs, we additionally assume that the noise is Gaussian (see Theorem 4) . When referring to an undirected edge between nodes , for example, in an MEC, we still denote the edge with , but the ordering of the nodes is arbitrary.
C.4.1 3-Node Case
We begin by studying the 3-node example of Figure 3 in Section 4.2. Let be linear function weights, and consider the following three causal graphs belonging to the same MEC, along with their corresponding SCMs and iSCMs.
SCM
iSCM
(13) | ||||
(14) | ||||
(15) | ||||
(16) | ||||
In the following subsections, we derive the covariance matrices of each of the three systems, respectively. This leads us to the equivalence presented in Equation (5) for standardized SCMs. Moreover, we show that, for iSCMs, all three systems induce exactly the same observational distribution if and only if and . These are the 3-node special cases of Theorems 3 and 4.
Standardized SCM
To obtain the covariances between the observed variables in the standardized SCMs of Equations (13), (15), and (LABEL:eq:s3), we first show that the assignments to the observed variables in standardized SCMs can be written in the form of linear SCMs over the same causal graph, which allows us to use Lemma 1. In all three systems, every vertex has at most one parent. When the node is the only parent of , under our assumptions on the noise, we have , so the assignment of can be written in the form of an SCM over as
(19) |
To use Equation (19), we first need to compute the marginal variances of the unstandardized observations . For the standardized SCMs, these marginal variances are, respectively:
Given Equation (19) and the marginal variances, we know the weights of all three implied SCMs explicitly. Since all implied SCMs are linear, have unit marginal variances, and share the same causal graph, we can apply Lemma 1 and obtain the covariances of the observational distributions in the original models:
In the standardized SCM (13), the causal graph is . Hence, the edge directions of the DAG are consistent with the direction of increasing absolute covariance if and only if
(20) |
In the above equivalences, we always multiply or divide by quantities greater than , so the direction of the inequality does not change, and transformations are equivalent. For the standardized SCM (LABEL:eq:s3) with causal graph , we get an analogous condition for the edges to be aligned with the order of increasing absolute covariance when following the same algebraic manipulations:
We make use of both of these conditions in Section 4. Since for any , the right-hand sides of both conditions are true if all weights are greater than . In this case, the absolute covariance increases downstream in all SCMs of Equations (13) and (LABEL:eq:s3). Hence, among these two systems, only the DAG whose edges aligns with the covariance ordering in the observed can induce , and we can conclude that the other DAG is not the true causal graph.
iSCM
To derive the observational distributions of the iSCMs in Equations (14), (16), and (LABEL:eq:s3_ours), we proceed in the same way as we did for standardized SCMs. We first show that the iSCM is an SCM with a specific set of mechanisms and then apply Lemma 1 to obtain the covariances between the observed variables. To see this, we write the assignment of as
(21) |
As before, using Equation 21 requires first computing the marginal variances of the latent variables . For the iSCMs defined by Equations (14), (16), and (LABEL:eq:s3_ours), they are given by
for Equation (14): | for Equation (16): | for Equation (LABEL:eq:s3_ours): |
---|---|---|
Given Equation (21) and the marginal variances, we obtain an explicit form for the weights of all three implied SCMs. Since the implied SCMs are linear, have unit marginal variances, and share the same causal graph, we can apply Lemma 1 and obtain the covariances of the observational distributions in the original models. It turns out that the observational distribution of all three ground-truth systems in Equations (14), (16), and (LABEL:eq:s3_ours) is a multivariate Gaussian with the same covariance matrix, with the diagonal elements equal to and the off-diagonal elements given by
(22) |
Since the observational distribution of all three SCMs is a zero-centered multivariate Gaussian, the distributions are equal if and only if their their covariance matrices are identical. The covariances are equal if and only if and , because the function appearing in and of Equation 22 is injective for any , which means that distinct weights are mapped to distinct covariances. Therefore, the three node linear iSCMs in the above MEC share the same observational distribution if and only if they also share the same weights for each edge, regardless of edge orientation.
C.4.2 Forests
In this section, we generalize the above partial identifiability result for standardized SCMs to arbitrary forest DAGs (Theorem 3). After that, we similarly generalize the nonidentifiability of iSCMs to forests (Theorem 4). Our results concern the identification edge directions in an MEC represented by its partially directed graph , where contains both directed and undirected edges.
Standardized SCM
Before proving the main theorem, we extend the 3-node example to chains of arbitrary length. We show that all but at most one edge in the MEC can be correctly oriented from observational data using the assumption on the support of the weights. Analogous to the three node case, we then use this to prove a similar result for forest graphs.
Lemma 5 (Orientation of edges in undirected chains of standardized SCMs).
Let be modeled by a standardized linear SCM (1) with chain DAG , where for non-root nodes and for all . Additionally, suppose contains no colliders. Then, given and the partially directed graph representing the MEC of , we can identify all but at most one edge of the true DAG in each undirected connected component of the MEC . The possible undirected edge has the smallest absolute covariance of all variables connected by edges in the MEC, satisfying for all .
Proof.
Throughout the proof, we label the nodes such that and are its neighbors for . We start with the analysis of three arbitrary, consecutive vertices in a chain graph. The three possible subgraphs are depicted in Figure 7. We can always find such that the variance of the latent root of this directed subgraph is . This relaxed assumption on specifically the root node allows for the root of the subgraph to have potential parents outside the subgraph, or to be the root of the whole chain, when later using this lemma to prove the main theorem.
We will follow similar derivations as in Section C.4.1. Specifically, we first write the observed variables of the standardized SCM in SCM form, and then invoke Lemma 1 to obtain the covariances of the observed variables. To use Equation 19, we again need to compute the marginal variances of the variables before standardization. For the subsystems in Figures 7(a) and 7(b), these are, respectively:
for Figure 7(a): | for Figure 7(b): |
---|---|
By substituting the expressions for the marginal variances into Equation 19, we obtain the weights of the implied models of the standardized SCM. Using Lemma 1, we obtain the covariances between the observed variables . By construction, the marginal variances of the observed variables are equal to . We treat each subsystem separately:
Subsystem 1 (Figure 7(a))
Given the marginal variances and Lemma 1, the covariances are
Following the same algebraic manipulations as in Equation 20, substituting and in the derivation, we obtain
(23) |
The left-hand side of the right-hand inequality in Equation 23 is upper-bounded by , similar to the 3-node case. Therefore, if we assume that , it must hold that for any choice of .
Subsystem 2 (Figure 7(b))
Given the marginal variances and Lemma 1, the covariances are
The ordering of the covariances in this case depends on the specific choice of the weights.
Subsystem 3 (Figure 7(c))
Following steps analogous to the symmetric subsystem 1, we conclude that, if , it must hold that for any .
Given the above, we can now study the relationship between the underlying DAG and the absolute covariance magnitudes under the assumption that . We will use the fact that, if the chain does not contain a collider, then there can be at most one node contained in edges pointing in opposite directions.
First, we treat the case where there exists a vertex such that , that is, where some neighboring covariances are equal. If this occurs in a 3-node subsystem, only subsystem 2 can describe the true graph. To be consistent with the assumption that there are no colliders in the graph (see Lemma 5), all other edges must be oriented in a direction away from , which completely identifies the graph in the MEC.
In the second case, holds for all nodes that have two neighbors in the path. Let be the unique pair of consecutive variables in the chain that minimizes . We can show that this pair is the unique minimizer using a proof by contradiction. Suppose there exist two pairs and such that is the minimum covariance. Without loss of generality, let . Then, the triple is consistent with only subsystems 2 or 3 based on their relative covariances, which implies that we must have . Using the fact that we have no colliders, we can then orient all edges for . Thus, we can find a subsystem containing , which has been already oriented as subsystem 3, meaning , a contradiction.
Given is the unique pair of consecutive variables that minimizes , we now show that we can orient all edges except . We will do this in two parts. First, we show that one can orient all edges with , and then we show that we can do the same for all edges with . If , consider the subsystem . Since , only subsystems 2 and 3 are possible for this subgraph. We can therefore orient . Similarly, if , by a symmetric argument on , we can orient . Since the graph cannot contain colliders, all other edges must be oriented as for , and for . In other words, all edges except point away from the two vertices , and one of the two variables must be the root of the chain. Therefore, if holds for all vertices that have two neighbors, then there exists a unique covariance minimizing pair , and all edges except are oriented.
The two cases above are exhaustive, and in the worst case at most one edge is left unoriented in the chain. This edge always corresponds to the minimizer of . This completes the proof. ∎
Remark
From the proof of Lemma 5, it follows that if we are able to orient all the edges in the chain, then the root of the chain is the node joining the two edges with minimum absolute covariance. When we orient all but one edge , the root node of the chain is either or .
We can extend Lemma 5 to forest graphs. For this, we will make use of the first Meek rule (Meek,, 1995). The first Meek rule concerns an MEC , containing the undirected edges but not the edge . It states that, if one can orient , we must have .
See 3
Proof.
The undirected parts of an MEC are disjoint undirected connected components. Orienting the edges in all these undirected connected components without introducing a v-structure produces a valid DAG in (Andersson et al.,, 1997). Each undirected connected components represents a Markov equivalence class of its own (Andersson et al.,, 1997). Thus, to prove the theorem, we consider these undirected connected components independently with respect to the rest of the graph and show how to orient the edges in each undirected connected component.111Orienting edges of an undirected connected component that touch a directed edge in never introduces an additional v-structure. If a directed edge pointed into the undirected connected component, the undirected edge downstream would have had to already be directed in by the first Meek rule. Hence, all directed edges bordering the undirected connected component must be oriented away from it, and none of the possible undirected edge orientations creates a new collider at the border node. This implies that all undirected connected components in are upstream of the colliders and directed subgraphs of . In the following argument, we therefore consider to be a single undirected connected component, with no directed edges by definition, and show that we can orient all but one edge in . This argument then extends to all undirected connected components of the original MEC , implying the statement made in Theorem 3.
If is an undirected connected component with no directed edges, we only have to consider SCMs with a ground-truth DAG that are members of this MEC to distinguish among possible edge orientations in . In the case of undirected trees, the ground-truth DAG must be a tree with no colliders and the same skeleton as , since any other DAGs would belong to a different MEC.
We give a proof by strong induction on the number of vertices in the MEC . The base case of the induction argument is an MEC with nodes. This case holds trivially, since this MEC can contain at most one undirected edge. For the inductive step, we consider an undirected tree MEC with and assume that we can orient all but one edge of undirected tree MECs with .
Our argument will proceed by considering the longest chain of the undirected tree . We will use Lemma 5 to orient all but at most one edge in this chain and then apply the first Meek rule to possibly orient additional edges in outside the chain. After orienting these edges, we show that we reduced the original problem of orienting all but one edge in with to orienting all but one edge in a single undirected connected component that has strictly fewer than nodes. This allows us to apply the inductive hypothesis and complete the proof (see Figure 8).
Consider a longest undirected chain that is a subgraph of the undirected tree . Let refer to the directed subgraph of the DAG induced by considering only the vertices . We label the vertices in as , with undirected edges for all . The nodes can have no undirected neighbours in outside the chain, because otherwise we could construct a longer chain in .
The only vertex in that can have a parent in the DAG outside the chain , that is, in , is the unique root of . To see this, we first note that all nodes have at most one parent in , because any with in would be a collider, but contains no colliders. Since non-root nodes in have an in-chain parent, they cannot have a parent outside of . Therefore, besides the root node of via its potential outside parent, is a completely disconnected subgraph from the rest of . This implies that we may treat as a separate standardized SCM with undirected chain MEC, in which the potential parent of the root of is modeled as part of the exogenous noise of the root. This allows us to apply Lemma 5 to the variables of the subgraph .
By applying Lemma 5 to , we can orient all but at most one undirected edge in . We split the resulting analysis into the two cases of Lemma 5 leaving either or undirected edge. In the first case, we can orient all edges in with Lemma 5. In this case, we know that the root of is the node (see Remark of Lemma 5). By the first Meek rule, we can recursively orient all additional edges in outside of away from , except for the subtrees of connected to itself (Figure 8). This leaves at most a single connected undirected subtree containing and strictly less than vertices.
In the second case, we orient all but one edge in by applying Lemma 5. In this case, we know that the root of is either the node or (see Remark of Lemma 5). Similar to the first case, we can recursively use the first Meek rule to orient all additional edges in pointing away from and , except for the subtrees of connected to and itself. Since and are connected by an undirected edge, we are left with a single connected subtree containing the undirected edge that is strictly smaller than before.
In both cases, we orient at least one undirected edge of , because the longest undirected chain in with has at least length . We always obtain at most a single undirected connected tree component with strictly less than vertices, allowing us to apply the inductive hypothesis and complete the proof.
∎
iSCM
See 4
Proof.
Because we consider linear iSCMs with Gaussian noise, the implied model is a linear SCM with additive Gaussian noise (see Section A.2). Hence, the observational distribution is a multivariate Gaussian with mean zero. In iSCMs, the marginal variance of an observed variable is always . Hence, we prove the statement if we show that for all in the iSCM with graph , and the corresponding in the iSCM with graph , .
Let and be the random variables associated with the nodes and from , respectively. We consider two cases. First, if there is no path between and in the skeleton of then there is no path between and in the skeleton of and hence In the second case, there is a path between and in the skeleton of , so there also exists a path in the skeleton of , as both graphs have the same skeleton. Due to the acyclicity of the skeleton in forests, this path is the only one connecting and in both and .
We further break this second case into two subcases. In the first subcase, this path contains a collider in as shown in Figure 9(a). Because the skeleton cannot have undirected cycles under the forest assumption, this collider forms a -structure. implies that the same -structure must be present in . Hence, and are -separated in both and . By the global Markov condition, this implies that and are independent, and that and are independent. This implies that both .
In the second subcase, there exists an unblocked path between and in both and . Here, we denote the weight matrix associated with both iSCMs by , with being symmetric, so that is the linear weight of the edge regardless of its orientation in the graph.
We now derive the analogous weights in the implied SCMs for respectively. Ultimately, we will demonstrate that the implied SCMs have the same weights. Specifically, we will show that . Given this, Lemma 1 implies that both iSCMs have the same covariance matrix over the observed variables.
Without loss of generality, since the node labelling is arbitrary, let have at least as many incoming edges as in . We divide the analysis into two cases: having only parent in , and having more than parent. The node must have at least one parent, since at least one of have an incoming edge in , and we chose to have at least as many incoming edges as .
More than one parent in
We know that any collider in will appear as part of a -structure in due to the forest assumption, and therefore will also be a collider in . Therefore, if has more than one parent in (see Figure 9(b)), all pairs of edges incoming to will form -structures, so must have exactly the same set of parents in .
Moreover, any two parents of are d-separated in and by the forest assumption, since the blocked path going through is the only path connecting them. By the global Markov condition, the parents are pairwise independent. Hence, we can use Equation (11) to compute . Since the parent sets are the same between the two graphs, and is shared between the two iSCMs, the weight associated with the edge in both graphs in the implied models is given by
(24) |
A single parent in
Let be the only incoming edge to in , as depicted in Figure 9(c). Then, the edge connecting and in is either the only incoming edge to or the only incoming edge to . To see this, suppose that it was not the only incoming edge to or in . This would make or a collider that would be common to both graphs, implying that or would have at least two parents in . We operate under the assumption that has at least as many parents as , so it would imply that has more than one parent, contradicting the assumption we made for case we consider in this paragraph. Irrespective of the direction, the weight associated with the edge in the skeleton of both graphs in the implied model is, similar to Equation (21), given by
(25) |
Equations (24) and (25) show that, for the SCM form of each iSCM, the edges connecting the same nodes irrespective of their direction in and have the same weights. By Lemma 1, the covariance between any and can be expressed as a product of the weights in the implied SCM corresponding to the edges on the path between . Hence, . ∎
![Refer to caption](x13.png)
Remark
In Figure 11, we empirically demonstrate that Theorem 4 no longer holds if we drop the forest assumption. For data generated from an iSCM and two graphs from the same with the same weights assigned to the skeleton edges, we observe that the estimated covariances differ. The two systems entail different observational distributions.
![Refer to caption](x14.png)
![Refer to caption](x15.png)
Appendix D Background on Related Work
D.1 Heuristics for Mitigating Variance Accumulation and -sortability in SCMs
Here, we review existing heuristics for avoiding the exploding variance in structure learning benchmarking with linear SCMs as defined in Equation (1). We also describe how these heuristics limit the causal dependencies that can be modeled in terms of the correlations among the SCM variables or their cause-explained variance, both of which does not occur in linear iSCMs.
Scaling weights by the inverse weight norm
Mooij et al., (2020, Section 5.2) sample the edge weights in linear SCMs as . To achieve a comparable variance of each variable in the SCM, they propose re-scaling the sampled weights prior to the data-generating process as
If all parents of are i.i.d. Gaussian with variance , this adjustment ensures that the variance of is similar for all . However, this approximation does not take into account the covariances of the parents. Moreover, since is unchanged, the scaling limits the strength of the causal effect that parents can have on . For example, when and with as for Mooij et al., (2020), the adjusted weight is . Thus, for any , we have
This is the maximum correlation between neighbouring variables that any SCM can model under the proposed re-scaling when , since additional parents decrease the parent-child correlations. By contrast, iSCMs can model any level of correlation by sampling arbitrary values of , while guaranteeing unit-variance observations . Intuitively, iSCMs achieve this by standardizing after the exogenous noise is added to the endogenous contributions of the parents , while weight scaling is done before is added to .
Scaling weights by the incoming variance
Squires et al., (2022, Section 5.1) sample the weights of linear SCMs as . Given the initial edge weights, they propose adjusting the weights during the generative process by first estimating the variance of from samples drawn under an initial level of additive noise with and then re-scaling the weights as
When using additive noise with to generate the actual samples, this scaling results in with a constant fraction of cause-explained variance . In benchmarks, however, we may be interested in evaluating SCMs with arbitrary levels of cause-explained variance. iSCMs allow this by construction. Contrary to Squires et al., (2022), iSCMs scale the variables rather than the weights while leaving the exogenous noise unchanged, which enables modeling arbitrarily small or large levels of unexplained variation.
D.2 Sortability Metrics
In this section, we describe the definition of a sortability metric as introduced by Reisach et al., (2024), which we use in Section 5. For a function , -sortability assigns a scalar in to the variables and graph (with weight matrix ) as
and is the -th power of the adjacency matrix and if and only if at least one directed path from to of length exists in . If , we obtain -sortability from Reisach et al., (2021). If
we obtain -sortability. Estimating requires performing regression of onto .
D.3 Structure Learning Algorithms
To complement the interpretation of the results in Section 5, we provide some background on the structure learning methods we evaluate.
Notears (Zheng et al.,, 2018)
Notears uses continuous optimization to minimize the regularized mean-squared error (MSE) between the the variables modeled by a linear SCM and the observations, while enforcing a differentiable acyclicity constraint. The objective function of Notears is given by , where and are a Frobenius and norm respectively. When the objective is minimized, weights below a fixed threshold are set to zero.
Avici (Lorch et al.,, 2022)
Avici is an amortized variational inference method that approximates the posterior distribution over causal structures given a dataset through a pretrained inference model. The variational approximation of Avici uses a fully-factored product of Bernoulli distributions for every possible graph edge. The inference model is a neural network that predict the variational parameters of the Bernoulli distributions by minimizing the expected forward KL divergence between the true posterior and the approximation. To train the inference model, Avici can be optimized on any training distribution of (synthetic) dataset-graph pairs. Lorch et al., (2022) publish the pretrained parameters of inference models trained on standardized SCMs with linear and nonlinear mechanisms, which we evaluate in this work.
SortnRegress methods (Reisach et al.,, 2021, 2024)
The SortnRegress methods order the vertices by a chosen statistic and sparsely regress every node on all of its predecessors in the obtained order. They use Lasso regression with the Bayesian Information Criterion to learn the regression function for a given variable. -SortnRegress uses estimated marginal variances as the sorting criterion. -SortnRegress uses coefficient of determination estimated after performing a regression of every variable onto all remaining variables. Rand-SortnRegress orders the vertices randomly.
Appendix E Experimental Setup
E.1 Data
Causal mechanisms
We consider systems with additive noise, where
for a chosen function . The Linear systems used in this experiments have causal mechanisms as defined in Equation 1. To model nonlinear systems, we use smooth nonlinear functional mechanisms as used by Lorch et al., (2022). Specifically, the function that models the relationship between and its parents is sampled from a Gaussian Process
where is a squared exponential kernel with output and length scales and respectively. We can approximately express the function sample analytically using random Fourier features (Rahimi and Recht,, 2007) by sampling
where , , and . In this work, we use .
Generating a random model
Following prior work (Section 2), we sample random systems in any simulation performed in this work by first drawing a graph from the specified random graph distribution. Given the graph , we sample function parameters of the structural mechanisms over . For linear systems, we sample , where are fixed, i.i.d. for every graph edge. Similarly, for nonlinear systems, for every graph vertex, we draw the length scales and output scales with predefined .
Sampling data from a model
Given a graph , noise distribution , and a set of functions , we sample datapoints from an SCM by traversing in a topological ordering. For every vertex , we draw a noise sample . The sample for is then deterministically computed by from the exogenous and the parents of . To sample from a Standardized SCM, we draw a dataset from an SCM and standardize it. To sample from an iSCM, we use Algorithm 1.
E.2 Experiment Configurations
Sortability
For Figures 4(a) and 14(a), we generate Erdős-Rényi graphs (Erdős and Rényi,, 1959) with expected number of edges per vertex equal to and , respectively. For Figures 4(b) and 14(b), we generate undirected scale-free graphs (Barabási and Albert,, 1999) with and edges per node respectively. Then, we order the graphs according to random topological orderings. We do not sample ordered scale-free graphs directly to avoid high sortability by in-degree, which may confound the results. For all four figures, we generate Linear systems with weights sampled from three possible distributions , or and noise sampled from . For every model configuration, we sample systems and data points each. We generate graphs of sizes .
Structure Learning (Section 5.2)
For Figures 5(a) and 12, we sample Linear systems with weights . Following Lorch et al., (2022), Nonlinear mechanisms have length scales and output scales . Both mechanisms are defined in Appendix E.1. For Figures 13(a) and 13(b), we generate Linear systems with weights and . For all four figures, we sample random and graphs with noise . For every model configuration, we sample systems and data points each.
Noise Transfer
For Figure 5(b) (top), we sample SCMs, standardized SCMs, and iSCMs with exactly the same underlying graph and weights sampled from . The noise variables are drawn from . Then, for every triple of SCM, standardized SCM, and iSCM that shares a graph and weights, we create two more SCMs with the same marginal variances as the SCM, but with the noise variances of the implied models of the standardized SCM and iSCM, respectively. Appendix E.5 provides a motivation and detailed explanation of this procedure. Figure 5(b) (top) shows the performance of Notears on the original SCMs and the two SCMs with transferred noise.
For Figure 5(b) (bottom), we sample multiple instances of standardized SCMs, and iSCMs with weights drawn from and noise from . For every model instance, we approximate the density of the inverse of their implied noise variances using kernel density estimation. The figure shows the mean and standard deviation of the p.d.f. values over systems. For both figures, we use graphs.
E.3 Methods
Notears (Zheng et al.,, 2018)
To run Notears, we use the original implementation provided by the authors of Zheng et al., (2018) (Apache-2.0 license). Before benchmarking Notears, we run a hyperparameter search to calibrate the weight penalty () and threshold on held-out instances of each data generation method. The hyperparameters can be found in Appendix E.4.
Avici (Lorch et al.,, 2022)
To evaluate Avici, we use the code and model checkpoints provided by the authors of the method (MIT license). Specifically, we use the model trained on linear data to benchmark the method on Linear systems and the model trained on nonlinear data to benchmark on Nonlinear systems. We score an edge as predicted if the probability prediction by Avici is greater than . Since the parameters are pretrained, the method has otherwise no tuneable hyperparameters.
Sortabilities and SortnRegress methods (Reisach et al.,, 2021, 2024)
To compute the sortability metrics and run the SortnRegress baselines, we use the CausalDisco library (BSD-3-Clause license) created by the authors of the method. The algorithms require no tuneable hyperparameters.
E.4 Hyperparameter Selection
To run Notears, we need to specify the regularisation strength and a weight threshold for thresholding the final weights for graph structure prediction. To select these hyperparameters, we run an parameter search with and three possible values of the weight threshold . We perform the search on a separate, held-out systems that follow the same configurations as the ones we present in our final experimental results. We run Notears times per configuration and choose the median score as the criterion for selecting the best hyperparameters. Table 1 presents all final hyperparameter configurations. For some hyperparameter configurations, in runs experienced numerical issues caused by the acyclicity constraint. However, this never occurs for the selected, optimal hyperparameters, neither when performing the hyperparameter search nor when running the reported experiments.
Weight Distribution | Model | F1 (median) | ||
---|---|---|---|---|
SCM | 0.05 | 0.20 | 0.97 | |
Standardized SCM | 0.15 | 0.10 | 0.59 | |
iSCM | 0.15 | 0.10 | 0.57 | |
SCM | 0.00 | 0.30 | 0.98 | |
Standardized SCM | 0.15 | 0.20 | 0.30 | |
iSCM | 0.15 | 0.10 | 0.50 | |
SCM | 0.05 | 0.30 | 0.98 | |
Standardized SCM | 0.25 | 0.10 | 0.24 | |
iSCM | 0.20 | 0.10 | 0.40 |
Weight Distribution | Model | F1 (median) | ||
---|---|---|---|---|
SCM | 0.10 | 0.10 | 0.99 | |
Standardized SCM | 0.10 | 0.10 | 0.83 | |
iSCM | 0.10 | 0.10 | 0.84 | |
SCM | 0.05 | 0.30 | 0.94 | |
Standardized SCM | 0.15 | 0.10 | 0.47 | |
iSCM | 0.15 | 0.10 | 0.76 | |
SCM | 0.10 | 0.30 | 0.82 | |
Standardized SCM | 0.20 | 0.10 | 0.30 | |
iSCM | 0.15 | 0.10 | 0.70 |
Model | F1 (median) | ||
---|---|---|---|
SCM | 0.15 | 0.30 | 0.58 |
Standardized SCM | 0.15 | 0.10 | 0.33 |
iSCM | 0.15 | 0.20 | 0.42 |
Model | F1 (median) | ||
---|---|---|---|
SCM | 0.30 | 0.30 | 0.50 |
Standardized SCM | 0.15 | 0.10 | 0.43 |
iSCM | 0.15 | 0.10 | 0.61 |
Model | F1 (median) | ||
---|---|---|---|
Original | 0.05 | 0.30 | 0.96 |
Noise from standardized SCM | 0.10 | 0.30 | 0.72 |
Noise from iSCM | 0.05 | 0.30 | 0.82 |
E.5 Transferring Noise Variances While Kee** -Sortability Unchanged
Reisach et al., (2021) show that post-hoc standardization of SCM data strongly impairs the performance of Notears. When comparing the performance of Notears between data sampled from iSCM and standardized SCMs, there are at least two factors that can affect the performance of Notears, low -sortability and the violation of the equal noise variance assumption. Our experiments in Figure 5(b) of Section 5 aim at isolating the effect of the latter. Specifically, we investigate whether Notears performs better on -sortable datasets that have the noise scale patterns implied when assuming SCMs generated the data—when in fact the data was sampled from iSCMs or standardized SCMs. To achieve this, we ensure that the -sortability metrics of the data sampled from the models is the same, here close to .
Given two linear SCMs and with the same underlying graph , our goal is to construct a system with the same marginal variances as (condition 1) and the same noise variances as (condition 2). For this task to be well-defined, we assume that the noise variances of the root variables in and are the same. The first step in constructing is to copy the noise variances from , so that for every .
This satisfies condition 2. Given this, we define as
where has variance . By construction, the condition of sharing the noise variances with and the marginal variances with is fulfilled for the root variables. For all the remaining variables, it holds that
which satisfies condition 1. Since the systems and have the same marginal variances, they have the same -sortability. In the noise transfer experiment of Figure 5(b), we transfer the noise variances from the implied models of iSCMs and standardized SCMs. To obtain the noise variances in the implied models, we divide the original noise variances (equal to ) by the estimated marginal variances of the corresponding variable before standardization, which we estimate from datapoints. For iSCM, this corresponds to an empirical statistics of Equation (7).
E.6 Compute Resources
Our experiments were run on an internal cluster. All experiments in this work were computed using CPUs with GB of memory per CPU, with an exception of the Avici runs on graphs with vertices, which used GB per CPU. The data generation takes less than a few minutes on a single CPU, with the exception of the sortability results (Section 5.1). For the sortability results, it takes around minutes to generate the datasets for a single graph specification across all weight supports and graph sizes. This is due to a bigger number of configurations and repetitions than in the other experiments. For a single graph specification and across all weight supports and graph sizes, it takes around hours to compute the sortability statistics on a single CPU. Running one execution of Notears (Avici) takes approximately min (min) for and min (min) for , respectively. The SortnRegress baselines run in less than min.
Appendix F Additional Experimental Results
F.1 Structure Learning
Figure 12 summarizes the structural Hamming distance (SHD) between the predicted and true graphs for the same datasets and algorithms as in Figure 5(a).
In Figures 13(a) and 13(b), we present the F1 scores and SHD attained by the structure learning algorithms on data of Linear iSCMs, SCMs, and standardized SCMs, across different weight distribution supports and graph sizes. We find that the difference in performance of Notears on data sampled from iSCM and standardized SCMs is larger for larger weight magnitudes and for bigger graphs. For smaller weights, the difference in the mean F1 score of Notears between the two standardization approaches is smaller, which is in line with our proposed explanation about the shifts of the implied noise variance distribution in Section 5.2.
In Figure 13(a), we also find that when weight magnitudes are below , -SortnRegress performs similarly for both standardized SCMs and iSCMs. We also observe this for Avici. Meanwhile, for larger weights with support extending above , these algorithms achieve significantly higher F1 scores on standardized SCMs. This suggests that our condition of for all edges in the statement of Theorem 3, concerning the identifiability of linear standardized SCMs, may have a more fundamental practical significance, rather than being merely an artifact of the analysis.
![Refer to caption](x16.png)
![]() ![]() |
|
![]() ![]() |
![Refer to caption](x16.png)
![]() ![]() |
|
![]() ![]() |
![]() ![]() |
|
![]() ![]() |
F.2 -Sortability
Figure 14 reports the -sortability statistics across varying graph sizes and weight distributions, but this time for the denser graphs ER(, ) and SF(, ). We again observe -sortability very close to for datasets sampled from iSCM and high degrees of -sortability for data drawn from standardized SCMs. We omit standard SCMs from the plots as the datasets coming of SCMs and their standardized versions have the same -sortability due to scale-invariance of the coefficient.
![Refer to caption](x29.png)
![Refer to caption](x30.png)
![Refer to caption](x31.png)
F.3 Covariance Matrices for Figure 1
Figure 15 visualizes the full mean absolute covariance (correlation) matrices of the systems presented in Figure 1. The matrix shows that the pattern of increasing mean absolute covariance in standardized SCMs is not only a feature of neighboring nodes, but it also occurs for vertex pairs further apart, though less strongly. This is not the case for iSCMs, where any two pairs of equally spaced vertices have equal covariances in expectation over the weight sampling distribution.
![Refer to caption](x32.png)