Trust Your \nabla: Gradient-based Intervention Targeting for Causal Discovery

Mateusz Olko1,2,  Michał Zając3,∗  Aleksandra Nowak2,3,4,∗
Nino Scherrer5Yashas Annadani6Stefan Bauer6
Łukasz Kuciński2,8Piotr Miłoś2,8,7
1Warsaw University, 2IDEAS NCBR,
3Jagiellonian Univeristy, Faculty of Mathematics and Computer Science,
4Jagiellonian University, Doctoral School of Exact and Natural Sciences,
5ETH Zurich, 6Helmholtz, TU Munich, 7deepsense.ai,
8Institute of Mathematics, Polish Academy of Sciences
These authors contributed equally. Corresponding author: [email protected]
Abstract

Inferring causal structure from data is a challenging task of fundamental importance in science. Often, observational data alone is not enough to uniquely identify a system’s causal structure. The use of interventional data can address this issue, however, acquiring these samples typically demands a considerable investment of time and physical or financial resources. In this work, we are concerned with the acquisition of interventional data in a targeted manner to minimize the number of required experiments. We propose a novel Gradient-based Intervention Targeting method, abbreviated GIT, that ’trusts’ the gradient estimator of a gradient-based causal discovery framework to provide signals for the intervention targeting function. We provide extensive experiments in simulated and real-world datasets and demonstrate that GIT performs on par with competitive baselines, surpassing them in the low-data regime.

1 Introduction

Estimating causal structure from data, commonly known as causal discovery, is central to the progress of science (Pearl, 2009). Real-world systems can often be explained as a composition of smaller parts connected by causal relationships. Understanding this underlying structure is essential for making accurate predictions about the system’s behavior after a perturbation or treatment has been applied (Peters et al., 2016). Causal discovery methods have been successfully deployed in various fields, such as biology (Sachs et al., 2005; Triantafillou et al., 2017; Glymour et al., 2019), medicine (Shen et al., 2020; Castro et al., 2020; Wu et al., 2022), earth system science (Ebert-Uphoff and Deng, 2012), or neuroscience (Sanchez-Romero et al., 2019). In machine learning, causal decomposition has been shown to enable sample-efficient learning and fast adaptation to distribution shifts by only updating a subset of parameters (Bengio et al., 2020; Scherrer et al., 2022).

Observational data, that is the data obtained directly from the unperturbed system, are, in general, insufficient to identify a system’s causal structure and only allow to determine the structure up to the so-called Markov Equivalence Class (Spirtes et al., 2000a; Peters et al., 2017). To overcome this limited identifiability problem, causal discovery algorithms commonly leverage interventional data (Hauser and Bühlmann, 2012; Brouillard et al., 2020; Ke et al., 2019), which are acquired by gathering data from an experiment perturbing a part of the system (Spirtes et al., 2000b; Pearl, 2009). The field of experimental design (Lindley, 1956; Murphy, 2001; Tong and Koller, 2001) is concerned with the acquisition of interventional data in a targeted manner to minimize the number of required experiments, which often requires spending a significant amount of time and physical or financial resources.

Refer to caption
Figure 1: Overview of GIT’s usage in a gradient-based causal discovery framework. The framework infers a posterior distribution over graphs from observational and interventional data (denoted as 𝒟obssubscript𝒟𝑜𝑏𝑠\mathcal{D}_{obs}caligraphic_D start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT and 𝒟intsubscript𝒟𝑖𝑛𝑡\mathcal{D}_{int}caligraphic_D start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT) through gradient-based optimization. The distribution over graphs and the gradient estimator ()\nabla\mathcal{L}(\cdot)∇ caligraphic_L ( ⋅ ) are then used by GIT in order to score the intervention targets based on the magnitude of the estimated gradients. The intervention target with the highest score is then selected, upon which the intervention is performed. New interventional data 𝒟intnewsuperscriptsubscript𝒟𝑖𝑛𝑡𝑛𝑒𝑤\mathcal{D}_{int}^{new}caligraphic_D start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_e italic_w end_POSTSUPERSCRIPT are then collected and the procedure is repeated.

In this work, we introduce a simple and effective experimental design algorithm called Gradient-based Intervention Targeting, or GIT for short, see Figure 1. GIT can be easily combined with various gradient-based causal discovery frameworks to provide an efficient active selection of intervention targets. Our method, which is grounded in the ideas from active and curriculum learning (Settles et al., 2007; Graves et al., 2017; Ash et al., 2020), collects interventional data that induce the biggest gradient on parameters of causal structure. GIT leverages the gradient-based nature of the underlying causal discovery framework and achieves better performance than the contemporary baselines.

Our contributions include:

  • We introduce GIT, which is to our knowledge, the first gradient-based intervention targeting method. Due to it’s plug-and-play nature, our method can be easily combined with various gradient-based causal discovery frameworks.

  • Our extensive experiments on synthetic and real-world graphs demonstrate that GIT effectively reduces the amount of interventional data needed to discover the causal structure, and performs well in the low-data regime. This makes GIT a compelling option when access to interventional data is limited.

  • We provide a theoretical justification of GIT and a suite of analyses introspecting its behavior and performance.

2 Related Work

Experimental Design / Intervention Design. There are two major classes of methods for selecting optimal interventions for causal discovery. One class of approaches is based on graph-theoretical properties. Typically, a completed partially directed acyclic graph (CPDAG), describing an equivalence class of DAGs, is first specified. Then, either substructures, such as cliques or trees, are investigated and used to inform decisions (He and Geng, 2008; Eberhardt, 2008; Squires et al., 2020; Greenewald et al., 2019), or edges of a proposed graph are iteratively refined until reaching a prescribed budget (Ghassami et al., 2018, 2019; Kocaoglu et al., 2017; Lindgren et al., 2018). One limitation of graph-theoretical approaches is that misspecification of the CPDAG at the beginning of the process can deteriorate the final solution. Another class of methods is based on Bayesian Optimal Experiment Design (Lindley, 1956), which aims to select interventions with the highest mutual information (MI) between the observations and model parameters. MI is approximated in different ways: AIT (Scherrer et al., 2021) uses F-score inspired metric to implicitly approximate MI; CBED (Tigas et al., 2022) incorporates BALD-like estimator (Houlsby et al., 2011); ABCD (Agrawal et al., 2019) uses estimator based on weighted importance sampling. Although theoretically principled, computing mutual information suffers from approximation errors and model mismatches. Therefore, in this work, we explore using scores based on different principles.

Gradient-based Causal Structure Learning. The appealing properties of neural networks have sparked a flurry of gradient-based causal structure learning methods. The most prevalent approaches are self-supervised formulations that optimize a data-dependent scoring metric (for instance, penalized log-likelihood) to find the best causal graph G𝐺Gitalic_G. Existing self-supervised methods that are capable (or can be extended) of incorporating interventional data can be categorized based on the underlying optimization formulation into: (i) frameworks with a joint optimization objective (Brouillard et al., 2020; Lorch et al., 2021; Cundy et al., 2021; Annadani et al., 2021; Geffner et al., 2022; Deleu et al., 2022) and (ii) frameworks with alternating phases of optimization (Bengio et al., 2020; Ke et al., 2019; Lippe et al., 2022). While structural and functional parameters are optimized under a joint objective in the former, the latter splits the optimization into two phases with separate objectives. All the aforementioned methods allow evaluation of gradient with respect to the structural and functional parameters with a batch of (real or hypothesized) interventional samples and can serve as a base framework for our proposed gradient-based intervention acquisition strategy.

Gradients in Active and Curriculum Learning. Gradients have been successfully used as a criterion to select data to process in previous work. Settles et al. (2007) introduce Expected Gradient Length (EGL), computed under the current belief, as a criterion for active learning. A batch active learning method introduced in Ash et al. (2020) also targets data points with high gradient magnitude, including uncertainty and diversity in the decision. In the area of curriculum learning, Graves et al. (2017) considers Gradient Prediction Gain (GPG), which is defined as the gradient’s magnitude and is meant to be a proxy for expected learning progress. We take inspiration from those approaches to propose a novel usage of the gradient criterion in the field of causal discovery.

3 Preliminaries

3.1 Structural Causal Models and Causal Structure Discovery

Causal relationships can be formalized using structural causal models (SCM) (Peters et al., 2017). Each of the endogenous variables X=(X1,,Xn)𝑋subscript𝑋1subscript𝑋𝑛X=(X_{1},\dots,X_{n})italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is expressed as a function Xi=fi(PAi,Ui)subscript𝑋𝑖subscript𝑓𝑖𝑃subscript𝐴𝑖subscript𝑈𝑖X_{i}=f_{i}(PA_{i},U_{i})italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_P italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) of its direct causes PAiX𝑃subscript𝐴𝑖𝑋PA_{i}\subseteq Xitalic_P italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊆ italic_X and an external independent noise Uisubscript𝑈𝑖U_{i}italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. It is assumed that the assignments are acyclic and thus associated with a directed acyclic graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ). The nodes V={1,,n}𝑉1𝑛V=\{1,\dots,n\}italic_V = { 1 , … , italic_n } represent the random variables and the edges correspond to the direct causes, that is (i,j)E𝑖𝑗𝐸(i,j)\in E( italic_i , italic_j ) ∈ italic_E if and only if XiPAjsubscript𝑋𝑖𝑃subscript𝐴𝑗X_{i}\in PA_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_P italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The joint distribution factorizes according to

(X1,,Xn)=i=1n(Xi|PAi).subscript𝑋1subscript𝑋𝑛superscriptsubscriptproduct𝑖1𝑛conditionalsubscript𝑋𝑖𝑃subscript𝐴𝑖\mathbb{P}(X_{1},\dots,X_{n})=\prod_{i=1}^{n}\mathbb{P}(X_{i}|PA_{i}).blackboard_P ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_P ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (1)

Causal structure discovery aims to recover the ground truth graph G𝐺Gitalic_G. The solution to this problem is not uniquely defined when having access only to observational data from the ground truth distribution \mathbb{P}blackboard_P. Formally, it can be determined solely up to a Markov Equivalence Class (MEC) (Spirtes et al., 2000b; Peters et al., 2017) without additional restrictive assumptions. To achieve identifiability, data from additional experiments, called interventions, need to be gathered.

A single-node intervention on Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT replaces the conditional distribution (Xi|PAi)conditionalsubscript𝑋𝑖𝑃subscript𝐴𝑖\mathbb{P}(X_{i}|PA_{i})blackboard_P ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with a new distribution denoted as ~(Xi|PAi)~conditionalsubscript𝑋𝑖𝑃subscript𝐴𝑖\smash{\widetilde{\mathbb{P}}(X_{i}|PA_{i})}over~ start_ARG blackboard_P end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), yielding a so-called interventional distribution:

i(X)~(Xi|PAi)ji(Xj|PAj).subscript𝑖𝑋~conditionalsubscript𝑋𝑖𝑃subscript𝐴𝑖subscriptproduct𝑗𝑖conditionalsubscript𝑋𝑗𝑃subscript𝐴𝑗\mathbb{P}_{i}(X)\triangleq\widetilde{\mathbb{P}}(X_{i}|PA_{i})\prod_{j\neq i}% \mathbb{P}(X_{j}|PA_{j}).blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X ) ≜ over~ start_ARG blackboard_P end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT blackboard_P ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (2)

The node iV𝑖𝑉i\in Vitalic_i ∈ italic_V is called the intervention target. An intervention that removes the dependency of a variable Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on its parents, yielding ~(Xi|PAi)=~(Xi)~conditionalsubscript𝑋𝑖𝑃subscript𝐴𝑖~subscript𝑋𝑖\smash{\widetilde{\mathbb{P}}(X_{i}|PA_{i})=\widetilde{\mathbb{P}}(X_{i})}over~ start_ARG blackboard_P end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = over~ start_ARG blackboard_P end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), is called hard. In this paper, we use data gathered by performing single-node interventions.

3.2 Online Causal Discovery and Targeting Methods

Algorithm 1 Online Causal Discovery
0:  causal discovery algorithm 𝒜𝒜\mathcal{A}caligraphic_A (e.g., ENCO, see Sec 4.1), intervention targeting method \mathcal{M}caligraphic_M, number of data acquisition rounds T𝑇Titalic_T, observational dataset 𝒟obssubscript𝒟𝑜𝑏𝑠\mathcal{D}_{obs}caligraphic_D start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT
0:  final parameters of graph model: φTsubscript𝜑𝑇\varphi_{T}italic_φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
1:  𝒟intsubscript𝒟𝑖𝑛𝑡\mathcal{D}_{int}\leftarrow\varnothingcaligraphic_D start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT ← ∅
2:  Fit graph model φ0subscript𝜑0\varphi_{0}italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with algorithm 𝒜𝒜\mathcal{A}caligraphic_A on 𝒟obssubscript𝒟𝑜𝑏𝑠\mathcal{D}_{obs}caligraphic_D start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT
3:  for round i=1,2,,T𝑖12𝑇i=1,2,\ldots,Titalic_i = 1 , 2 , … , italic_T do
4:     I generate intervention targets using 𝐼 generate intervention targets using I\leftarrow\text{ generate intervention targets using }\mathcal{M}italic_I ← generate intervention targets using caligraphic_M
5:     DintI query for data from interventions Isuperscriptsubscript𝐷𝑖𝑛𝑡𝐼 query for data from interventions 𝐼D_{int}^{I}\leftarrow\text{ query for data from interventions }Iitalic_D start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ← query for data from interventions italic_I
6:     𝒟int𝒟intDintIsubscript𝒟𝑖𝑛𝑡subscript𝒟𝑖𝑛𝑡superscriptsubscript𝐷𝑖𝑛𝑡𝐼\mathcal{D}_{int}\leftarrow\mathcal{D}_{int}\cup D_{int}^{I}caligraphic_D start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT ← caligraphic_D start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT ∪ italic_D start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT
7:     Fit φisubscript𝜑𝑖\varphi_{i}italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with algorithm 𝒜𝒜\mathcal{A}caligraphic_A on 𝒟int and 𝒟obssubscript𝒟𝑖𝑛𝑡 and subscript𝒟𝑜𝑏𝑠\mathcal{D}_{int}\text{ and }\mathcal{D}_{obs}caligraphic_D start_POSTSUBSCRIPT italic_i italic_n italic_t end_POSTSUBSCRIPT and caligraphic_D start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT
8:  end for

In this work, we consider an online causal discovery procedure outlined in Algorithm 1. Given a causal discovery Algorithm 𝒜𝒜\mathcal{A}caligraphic_A, the graph model φ0subscript𝜑0\varphi_{0}italic_φ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is fitted using observational data 𝒟obssubscript𝒟𝑜𝑏𝑠\mathcal{D}_{obs}caligraphic_D start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT. Following that, batches of interventional samples are acquired iteratively and are used by the algorithm to improve the belief about the causal structure (line 7). Intervention targets are chosen by intervention targetting method \mathcal{M}caligraphic_M to optimize the overall performance, taking into account the current belief about the graph structure encoded in φi1subscript𝜑𝑖1\varphi_{i-1}italic_φ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. Below we discuss two popular choices for the method \mathcal{M}caligraphic_M (with more details deferred to Appendix D).

Active Intervention Targeting (AIT)

AIT selects the intervention target according to an F𝐹Fitalic_F-test inspired criterion (Scherrer et al., 2021). It assumes that the causal discovery algorithm 𝒜𝒜\mathcal{A}caligraphic_A maintains a posterior distribution over graphs (by design or using bootstrap**). To select an intervention target, a set of graphs is sampled from the posterior distribution, and interventional sample distributions are generated by intervening on each of the sampled graphs. Each potential intervention target is assigned a score by measuring the discrepancy across the corresponding interventional sample distributions.

CBED targeting

Another approach to causal discovery is approximating the posterior distribution over the possible causal DAGs. This allows using the framework of Bayesian Optimal Experimental Design to select the most informative intervention (experiment). The score of a new experiment is given by the mutual information (MI) between the interventional data due to the experiment and the current belief about the graph structure. Hence, such an approach requires estimating MI. For instance, Causal Bayesian Experimental Design (CBED) (Tigas et al., 2022) uses a BALD-like estimator (Houlsby et al., 2011) to sample batches of interventional targets.

4 GIT method

In this work, we present a new intervention targeting method GIT. GIT chooses intervention targets that induce the largest update of the parameters modeling the causal structure. Inspired by hallucinated gradients exploited by (Ash et al., 2020) we calculate gradients on imaginary data generated by the causal model, to score possible interventions for real data acquisition.

To formally introduce our method, we first describe the requirements that need to be fulfilled by a causal algorithm 𝒜𝒜\mathcal{A}caligraphic_A in order to use it with GIT. We then explain how GIT works and follow up with a discussion about causal assumptions and theoretical justification of our approach. Finally, in Section 4.1, we present a practical implementation of our method with a causal discovery algorithm 𝒜𝒜\mathcal{A}caligraphic_A, using a popular ENCO algorithm as an example.

Requirements for causal discovery algorithm 𝒜𝒜\mathcal{A}caligraphic_A.

The intervention targeting method GIT can be coupled with any gradient-based causal discovery algorithm 𝒜𝒜\mathcal{A}caligraphic_A (see Algorithm 1) that fulfills the following conditions:

  1. 1.

    𝒜𝒜\mathcal{A}caligraphic_A models a distribution over the causal DAGs, denoted by a family of probability measures ρ(G)subscript𝜌𝐺\mathbb{P}_{\rho}(G)blackboard_P start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_G ) parameterized by ρ𝜌\rhoitalic_ρ, that allows sampling.

  2. 2.

    For each causal graph G𝐺Gitalic_G, 𝒜𝒜\mathcal{A}caligraphic_A maintains a corresponding family of conditional distributions, G,ϕ(Xi|PA(i,G))subscript𝐺italic-ϕconditionalsubscript𝑋𝑖𝑃subscript𝐴𝑖𝐺\mathbb{P}_{G,\phi}(X_{i}|PA_{(i,G)})blackboard_P start_POSTSUBSCRIPT italic_G , italic_ϕ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT ( italic_i , italic_G ) end_POSTSUBSCRIPT ), parametrized by ϕitalic-ϕ\phiitalic_ϕ, which induces the joint distribution G,ϕsubscript𝐺italic-ϕ\mathbb{P}_{G,\phi}blackboard_P start_POSTSUBSCRIPT italic_G , italic_ϕ end_POSTSUBSCRIPT:

    G,ϕ(X)iG,ϕ(Xi|PA(i,G)).subscript𝐺italic-ϕ𝑋subscriptproduct𝑖subscript𝐺italic-ϕconditionalsubscript𝑋𝑖𝑃subscript𝐴𝑖𝐺\mathbb{P}_{G,\phi}(X)\triangleq\prod_{i}\mathbb{P}_{G,\phi}\left(X_{i}|PA_{(i% ,G)}\right).blackboard_P start_POSTSUBSCRIPT italic_G , italic_ϕ end_POSTSUBSCRIPT ( italic_X ) ≜ ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G , italic_ϕ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT ( italic_i , italic_G ) end_POSTSUBSCRIPT ) . (3)

    If G𝐺Gitalic_G corresponds to the ground truth graph, G,ϕsubscript𝐺italic-ϕ\mathbb{P}_{G,\phi}blackboard_P start_POSTSUBSCRIPT italic_G , italic_ϕ end_POSTSUBSCRIPT approximates the ground truth distribution over X𝑋Xitalic_X.

  3. 3.

    𝒜𝒜\mathcal{A}caligraphic_A gives access to its loss function \mathcal{L}caligraphic_L and gradient of the loss function ρsubscript𝜌\nabla_{\rho}\mathcal{L}∇ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT caligraphic_L with respect to ρ𝜌\rhoitalic_ρ.

These requirements are mildly restrictive and they are fulfilled by many gradient-based discovery methods (for instance, ENCO (Lippe et al., 2022), SDI (Ke et al., 2019), DiBS (Lorch et al., 2021), DCDI (Brouillard et al., 2020) or DECI (Geffner et al., 2022)).

Method.

GIT scores each possible intervention target by calculating the expected magnitude of the gradient using imaginary interventional data generated by the causal model. Gradient magnitude serves as a proxy for the size of the update that can be induced on the parameters of the causal model. The method picks intervention that has the highest score. Formally, for a given intervention iV𝑖𝑉i\in Vitalic_i ∈ italic_V we define its score sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as follows:

si𝔼Xρ,ϕ,iρ(X).subscript𝑠𝑖subscript𝔼similar-to𝑋subscript𝜌italic-ϕ𝑖delimited-∥∥subscript𝜌𝑋\begin{split}s_{i}&\triangleq\mathbb{E}_{X\sim\mathbb{P}_{\rho,\phi,{i}}}\|% \nabla_{\rho}\mathcal{L}(X)\|.\end{split}start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL ≜ blackboard_E start_POSTSUBSCRIPT italic_X ∼ blackboard_P start_POSTSUBSCRIPT italic_ρ , italic_ϕ , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT caligraphic_L ( italic_X ) ∥ . end_CELL end_ROW (4)

Note that the expected value is computed with the interventional distribution coming from the model, instead of ground truth, defined as:

ρ,ϕ,i(X)Gρ(G)G,ϕ,i(X).subscript𝜌italic-ϕ𝑖𝑋subscript𝐺subscript𝜌𝐺subscript𝐺italic-ϕ𝑖𝑋\begin{split}\mathbb{P}_{\rho,\phi,{i}}(X)&\triangleq\sum_{G}\mathbb{P}_{\rho}% (G)\mathbb{P}_{G,\phi,{i}}(X).\end{split}start_ROW start_CELL blackboard_P start_POSTSUBSCRIPT italic_ρ , italic_ϕ , italic_i end_POSTSUBSCRIPT ( italic_X ) end_CELL start_CELL ≜ ∑ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_G ) blackboard_P start_POSTSUBSCRIPT italic_G , italic_ϕ , italic_i end_POSTSUBSCRIPT ( italic_X ) . end_CELL end_ROW (5)

The summation in equation 5 is taken over all DAGs and G,ϕ,isubscript𝐺italic-ϕ𝑖\mathbb{P}_{G,\phi,i}blackboard_P start_POSTSUBSCRIPT italic_G , italic_ϕ , italic_i end_POSTSUBSCRIPT corresponds to the joint distribution from the model for graph G𝐺Gitalic_G:

G,ϕ,i(X)~(Xi|PAi)jiG,ϕ(Xj|PA(j,G)).subscript𝐺italic-ϕ𝑖𝑋~conditionalsubscript𝑋𝑖𝑃subscript𝐴𝑖subscriptproduct𝑗𝑖subscript𝐺italic-ϕconditionalsubscript𝑋𝑗𝑃subscript𝐴𝑗𝐺\mathbb{P}_{G,\phi,i}(X)\triangleq\widetilde{\mathbb{P}}(X_{i}|PA_{i})\prod_{j% \neq i}\mathbb{P}_{G,\phi}(X_{j}|PA_{(j,G)}).blackboard_P start_POSTSUBSCRIPT italic_G , italic_ϕ , italic_i end_POSTSUBSCRIPT ( italic_X ) ≜ over~ start_ARG blackboard_P end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G , italic_ϕ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT ( italic_j , italic_G ) end_POSTSUBSCRIPT ) . (6)

The computational procedure of GIT’s intervention target selection is listed in Algorithm 2. The expected value in sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is approximated using the Monte-Carlo method, see line 4 of Algorithm 6. We also use a version of Algorithm 2 where real interventional data are used in line 3 (instead of the imaginary ones from the model) and call it GIT-privileged. GIT-privileged serves as a soft upper bound in our analysis.

Algorithm 2 GIT’s Intervention target selection
0:  parameters ρ𝜌\rhoitalic_ρ of distribution over graphs, functional parameters ϕitalic-ϕ\phiitalic_ϕ, loss function \mathcal{L}caligraphic_L, graph nodes V𝑉Vitalic_V
0:  batch of intervention targets to execute: I𝐼Iitalic_I
1:  𝒢𝒢absent\mathcal{G}\leftarrowcaligraphic_G ← sample a set of DAGs according to ρ(G)subscript𝜌𝐺\mathbb{P}_{\rho}(G)blackboard_P start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_G )
2:  for intervention target iV𝑖𝑉i\in Vitalic_i ∈ italic_V do
3:     𝒟G,i sample batch of data according to G,ϕ,isubscript𝒟𝐺𝑖 sample batch of data according to subscript𝐺italic-ϕ𝑖\mathcal{D}_{G,{i}}\leftarrow\text{ sample batch of data according to }\mathbb% {P}_{G,\phi,{i}}caligraphic_D start_POSTSUBSCRIPT italic_G , italic_i end_POSTSUBSCRIPT ← sample batch of data according to blackboard_P start_POSTSUBSCRIPT italic_G , italic_ϕ , italic_i end_POSTSUBSCRIPT
4:     si1|𝒢|G𝒢1|𝒟G,i|X𝒟G,iρ(X)subscript𝑠𝑖1𝒢subscript𝐺𝒢1subscript𝒟𝐺𝑖subscript𝑋subscript𝒟𝐺𝑖delimited-∥∥subscript𝜌𝑋s_{i}\leftarrow\frac{1}{|\mathcal{G}|}\sum_{G\in\mathcal{G}}\frac{1}{|\mathcal% {D}_{G,{i}}|}\sum_{X\in\mathcal{D}_{G,{i}}}\lVert\nabla_{\rho}\mathcal{L}(X)\rVertitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG | caligraphic_G | end_ARG ∑ start_POSTSUBSCRIPT italic_G ∈ caligraphic_G end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_G , italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_X ∈ caligraphic_D start_POSTSUBSCRIPT italic_G , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT caligraphic_L ( italic_X ) ∥
5:  end for
6:  Iselect a batch of targets with highest scores si𝐼select a batch of targets with highest scores subscript𝑠𝑖I\leftarrow\text{select a batch of targets with highest scores }s_{i}italic_I ← select a batch of targets with highest scores italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
Assumptions of GIT.

From the causal perspective, GIT relies exclusively on the Markov property assumption, which allows factorization of joined distribution (see Equation 1). However, GIT as a plug-and-play extension for causal discovery algorithms 𝒜𝒜\mathcal{A}caligraphic_A inherits their assumptions. This may include, for instance, causal sufficiency or faithfulness. Our method does not require any additional assumptions on the variables Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and allows for both discrete and continuous setups.

Theoretical justification of GIT.

We show the convergence of GIT in two contexts. First, we prove that the main setup of this paper, i.e., GIT with ENCO (Lippe et al., 2022), described in Section 4.1, converges. The detailed result can be found in Appendix B, but the gist of the argument is that vertices for which the model structure is not aligned with the ground truth will have non-trivial gradients and hence will be queried by GIT, allowing the model to improve. Moreover, we show empirically that GIT gradients are well correlated with the principled GPG signal of GIT-privileged, see Appendix F.6. Second, we show that given any convergent causal discovery algorithm, GIT converges if we allow a uniform sampling of intervention with small probability ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, see Appendix A. We call this approach ϵitalic-ϵ\epsilonitalic_ϵ-greedy GIT. Importantly, on a finite sample with small enough ϵitalic-ϵ\epsilonitalic_ϵ, GIT and ϵitalic-ϵ\epsilonitalic_ϵ-greedy GIT are statistically indistinguishable.

4.1 Applicability to ENCO

We choose to use ENCO as the gradient-based causal discovery framework 𝒜𝒜\mathcal{A}caligraphic_A in our main experiments (recall Algorithm 1) due to its strong empirical results and good computational performance on GPUs. ENCO maintains a parameterized distribution over graph structures, with the so-called structural parameters {ρi,j}i,jsubscriptsubscript𝜌𝑖𝑗𝑖𝑗\{\rho_{i,j}\}_{i,j}{ italic_ρ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT representing the adjacency matrix and a set of parameters modeling the functional dependencies, ϕitalic-ϕ\phiitalic_ϕ. The structural parameters, ρi,jsubscript𝜌𝑖𝑗\rho_{i,j}italic_ρ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, are factorized into an edge existence parameter, γi,jsubscript𝛾𝑖𝑗\gamma_{i,j}italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, and an edge orientation parameter, θi,j=θj,isubscript𝜃𝑖𝑗subscript𝜃𝑗𝑖\theta_{i,j}=-\theta_{j,i}italic_θ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = - italic_θ start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT.

The parameters are updated by iteratively alternating between two optimization phases. The goal of the first phase is to learn functions fϕi(xi|PA(i,G))subscript𝑓subscriptitalic-ϕ𝑖conditionalsubscript𝑥𝑖𝑃subscript𝐴𝑖𝐺f_{\phi_{i}}\left(x_{i}|PA_{(i,G)}\right)italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT ( italic_i , italic_G ) end_POSTSUBSCRIPT ), which model the conditional density of (Xi|PA(i,G))conditionalsubscript𝑋𝑖𝑃subscript𝐴𝑖𝐺\mathbb{P}\left(X_{i}|PA_{(i,G)}\right)blackboard_P ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT ( italic_i , italic_G ) end_POSTSUBSCRIPT ). The training objective is the log-likelihood loss. The second phase aims to update the parametrized edge probabilities ρi,jsubscript𝜌𝑖𝑗\rho_{i,j}italic_ρ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT’s. To this end, ENCO collects a data sample from a mixture of interventional distributions denoted by Isubscript𝐼\mathbb{P}_{I}blackboard_P start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT . The graph parameters are optimized by minimizing 𝔼XILgraph(X)subscript𝔼similar-to𝑋subscript𝐼subscript𝐿graph𝑋\mathbb{E}_{X\sim\mathbb{P}_{I}}L_{\text{graph}}(X)blackboard_E start_POSTSUBSCRIPT italic_X ∼ blackboard_P start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT ( italic_X ) where:

Lgraph(X)𝔼GPγ,θ[i=1nLG(Xi)],LG(xi)logfϕi(xi|PA(i,G)),formulae-sequencesubscript𝐿graph𝑋subscript𝔼similar-to𝐺subscript𝑃𝛾𝜃delimited-[]superscriptsubscript𝑖1𝑛subscript𝐿𝐺subscript𝑋𝑖subscript𝐿𝐺subscript𝑥𝑖subscript𝑓subscriptitalic-ϕ𝑖conditionalsubscript𝑥𝑖𝑃subscript𝐴𝑖𝐺L_{\text{graph}}(X)\triangleq\mathbb{E}_{G\sim P_{\gamma,\theta}}\bigg{[}\sum_% {i=1}^{n}L_{G}(X_{i})\bigg{]},\quad L_{G}(x_{i})\triangleq-\log{f_{\phi_{i}}% \left(x_{i}|PA_{(i,G)}\right)},italic_L start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT ( italic_X ) ≜ blackboard_E start_POSTSUBSCRIPT italic_G ∼ italic_P start_POSTSUBSCRIPT italic_γ , italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] , italic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≜ - roman_log italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT ( italic_i , italic_G ) end_POSTSUBSCRIPT ) , (7)

For a detailed description of the method, distributions, and the estimators see Appendix C.1.

GIT with ENCO details.

The loss function \mathcal{L}caligraphic_L utilized by GIT is denoted Lgraphsubscript𝐿graphL_{\text{graph}}italic_L start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT. We incorporate information from both structural parameters and use γLgraph(X)2+θLgraph(X)2superscriptdelimited-∥∥subscript𝛾subscript𝐿graph𝑋2superscriptdelimited-∥∥subscript𝜃subscript𝐿graph𝑋2\lVert\nabla_{\gamma}L_{\text{graph}}(X)\rVert^{2}+\lVert\nabla_{\theta}L_{% \text{graph}}(X)\rVert^{2}∥ ∇ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT ( italic_X ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT ( italic_X ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to compute the score for the intervention i𝑖iitalic_i in line 4 of Algorithm 2. In order to sample DAGs from the current graph distribution (line 1 of Algorithm 2), we use a two-phase sampling procedure proposed in (Scherrer et al., 2021, Section 3.2) as it is scalable and guarantees DAG-ness by construction opposed to Gibbs sampling or rejection sampling techniques.

5 Experiments

We compare GIT against the following baselines: AIT, CBED, Random, and GIT-privileged. AIT and CBED are competitive intervention acquisition methods for gradient-based causal discovery (which we discussed in Section 3.2). The Random method selects interventions uniformly in a round-robin fashion111At every step, a target node is chosen uniformly at random from the set of yet not visited nodes. After every node has been selected, the visitation counts are reset to 0.. The last approach, GIT-privileged, is the oracle method described in Section 4.

Our main result is that GIT brings substantial improvement in the low data regime, being the best among benchmarked methods for all considered synthetic graph classes and half of the considered real graphs in terms of the AUSHD metric (see Equation 9). On the remaining real graphs, our approach performs similarly to the baseline methods. Notably, in most cases, GIT surpasses MI-based approaches: CBED and AIT. We present the summary in Table 1. This result is accompanied by an in-depth analysis of the relationships between different strategies and the distributions of the selected intervention targets. Additional results in the DiBS framework (Lorch et al., 2021) with continuous data are presented in Appendix F.1.

Table 1: We count the number of setups (24), where a given method was best or comparable to the other methods (AIT, CBED, Random, and GIT; GIT-privileged was not compared against), based on 90% confidence intervals for SHD and AUSHD. Each entry shows the total count, broken down into two data regimes, N=1056𝑁1056N=1056italic_N = 1056 and N=3200𝑁3200N=3200italic_N = 3200, respectively, presented in parentheses.
AIT CBED Random GIT (ours) GIT-privileged
mean AUSHD 6 (2 + 4) 6 (4 + 2) 12 (5 + 7) 18 (11 + 7) 24 (12 + 12)
mean SHD 10 (4 + 6) 7 (4 + 3) 22 (12 + 10) 17 (10 + 7) 24 (12 + 12)

5.1 Experimental Setup

We evaluate the different intervention targeting methods in online causal discovery, see Algorithm 1. We utilize an observational dataset of size 5000500050005000. We use T=100𝑇100T=100italic_T = 100 rounds, in each one acquiring an interventional batch of 32323232 samples. We distinguish two regimes: regular, with all 100100100100 rounds (N=3200𝑁3200N=3200italic_N = 3200 interventional samples), and low, with 33333333 rounds (N=1056𝑁1056N=1056italic_N = 1056 interventional samples). We use |𝒢|=50𝒢50|\mathcal{G}|=50| caligraphic_G | = 50 graphs and |𝒟G,i|=128subscript𝒟𝐺𝑖128|\mathcal{D}_{G,i}|=128| caligraphic_D start_POSTSUBSCRIPT italic_G , italic_i end_POSTSUBSCRIPT | = 128 data samples from each graph for the Monte-Carlo approximation of the GIT score. We tested different sizes of the Monte-Carlo sample and found that it does not have a major impact on performance, see Appendix F.4. For all experiments in this section we assume, following the approach of Lippe et al. (2022), that all interventions are single-node, hard, and change the conditional distribution of the intervened node to uniform.

Datasets

We use synthetic and real-world datasets. The synthetic dataset consists of bidiag, chain, collider, jungle, fulldag and random DAGs, each with 25252525 nodes. The variable distributions are categorical, with 10101010 categories222We create the datasets using the code provided by Lippe et al. (2022). See Appendix E.1 for details.. The real-world dataset consists of alarm, asia, cancer, child, earthquake, and sachs graphs, taken from the BnLearn repository (Scutari, 2010). Both synthetic and real-world graphs are commonly used as benchmarking datasets  (Ke et al., 2019; Lippe et al., 2022; Scherrer et al., 2021).

Metrics

We use the Structural Hamming Distance (SHD) (Tsamardinos et al., 2006) between the predicted and the ground truth graph as the main metric. SHD between two directed graphs is defined as the number of edges that need to be added, removed, or reversed in order to transform one graph into the other. More precisely, for two DAGs represented as adjacency matrices c𝑐citalic_c and csuperscript𝑐c^{\prime}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT,

SHD(c,c):=i>j𝟏(cij+cjicij+cji or cijcij).assignSHD𝑐superscript𝑐subscript𝑖𝑗1subscript𝑐𝑖𝑗subscript𝑐𝑗𝑖superscriptsubscript𝑐𝑖𝑗superscriptsubscript𝑐𝑗𝑖 or subscript𝑐𝑖𝑗superscriptsubscript𝑐𝑖𝑗\text{SHD}(c,c^{\prime}):=\sum_{i>j}\mathbf{1}(c_{ij}+c_{ji}\neq c_{ij}^{% \prime}+c_{ji}^{\prime}\text{ or }c_{ij}\neq c_{ij}^{\prime}).SHD ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := ∑ start_POSTSUBSCRIPT italic_i > italic_j end_POSTSUBSCRIPT bold_1 ( italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ≠ italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT or italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≠ italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (8)

In the experiments, we always compute SHD between the predicted and the ground truth graph. In order to aggregate SHD values over different data regimes, we introduce the area under the SHD curve (AUSHD):

AUSHDmT:=1Tt=1TSHDmt,SHDmt:=SHD(cgt,cm,t)formulae-sequenceassignsuperscriptsubscriptAUSHD𝑚𝑇1𝑇superscriptsubscript𝑡1𝑇superscriptsubscriptSHD𝑚𝑡assignsuperscriptsubscriptSHD𝑚𝑡SHDsubscript𝑐𝑔𝑡subscript𝑐𝑚𝑡\text{AUSHD}_{m}^{T}:=\frac{1}{T}\sum_{t=1}^{T}\text{SHD}_{m}^{t},\quad\text{% SHD}_{m}^{t}:=\text{SHD}(c_{gt},c_{m,t})AUSHD start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT SHD start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , SHD start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT := SHD ( italic_c start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT ) (9)

where m𝑚mitalic_m is the used method, T𝑇Titalic_T is the number of interventional data batches, cgtsubscript𝑐𝑔𝑡c_{gt}italic_c start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT is the ground truth graph, and cm,tsubscript𝑐𝑚𝑡c_{m,t}italic_c start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT is the graph fitted by the method m𝑚mitalic_m using t𝑡titalic_t interventional data batches. Intuitively, for small to moderate values of T𝑇Titalic_T, AUSHD captures a method’s speed of convergence: the faster the SHD converges to 00, the smaller the area. For large values of T𝑇Titalic_T, AUSHD measures the asymptotic convergence. Smaller values indicate a better method. For visualizations, we use surplus of AUSHD over Random method (SAUSHD), which compares method m𝑚mitalic_m the the Random baseline. Precisely,

SAUSHDmT:=AUSHDmT𝔼[AUSHDRandomT],assignsubscriptsuperscriptSAUSHD𝑇𝑚subscriptsuperscriptAUSHD𝑇𝑚𝔼delimited-[]subscriptsuperscriptAUSHD𝑇𝑅𝑎𝑛𝑑𝑜𝑚\text{SAUSHD}^{T}_{m}:=\text{AUSHD}^{T}_{m}-\mathbb{E}\left[\text{AUSHD}^{T}_{% Random}\right],SAUSHD start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT := AUSHD start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - blackboard_E [ AUSHD start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R italic_a italic_n italic_d italic_o italic_m end_POSTSUBSCRIPT ] , (10)

where the expectation averages all randomness sources (e.g. stemming from the initialization). Again, smaller values indicate a better method.

5.2 Main Result: GIT’s Empirical Performance

Refer to caption
Figure 2: The distribution of SAUSHD (see equation 10), calculated using 25 seeds, for synthetic graphs (lower is better). The intense color (left-hand side of each violin plot) indicates the low data regime (N=1056𝑁1056N=1056italic_N = 1056 samples). The faded color (right-hand side of each violin plot) represents a higher amount of data (N=3200𝑁3200N=3200italic_N = 3200 samples). Note that even though the solution quality is improved when more samples are available, typically, SAUSHD is smaller in the low data regime. This is because it measures relative improvement over the random baseline, which is most visible for the small number of samples in most methods.
GIT’s Overall Strong Performance

We evaluate GIT on 24 training setups: twelve graphs (synthetic and real-world, six in each category) and two data regimes. GIT is the best or comparable to the baseline methods (excluding GIT-privileged) in 18 cases according to mean AUSHD, and 17 cases according to mean SHD, see Table 1. Additionally, GIT is stable, as the distribution of its AUSHD has most frequently the smallest variation among non-privileged methods (11111111 out of 24242424 cases), see Table 8 and Table 9 in Appendix F.2.2. In terms of pairwise comparison with other methods, GIT is better in 45 cases and comparable in 35 cases, out of a total of 96 (=24absent24=24= 24 setups ×4absent4\times 4× 4 other methods), see Table 7 in Appendix F.2.1. Interestingly, GIT’s performance for graphs with fewer nodes (cancer, earthquake) is less impressive. We hypothesize that this is because in these cases, the corresponding Markov Equivalence Class is a singleton (see Figure 4). Consequently, they require less interventional data to converge (see training curves in Appendix F.2.4), which diminishes the impact of different intervention strategies.

GIT is Especially Efficient for Low Data
Refer to caption
Figure 3: The distribution of SAUSHD (see equation 10), calculated using 25 seeds, for real-world graphs (lower is better). The intense color (left-hand side of each violin plot) indicates the low data regime (N=1056𝑁1056N=1056italic_N = 1056 samples). The faded color (right-hand side of each violin plot) represents a higher amount of data (N=3200𝑁3200N=3200italic_N = 3200 samples). Notice that the two plots have different scales.

In the low data regime (N=1056𝑁1056N=1056italic_N = 1056), GIT is better or

comparable to all the other non-privileged methods for 11 out of 12 graphs, see Table 1. Pictorially, this phenomenon can be seen in Figure 2 and Figure 3, where the left-hand side of the GIT violin plot tends to display the most favorable behavior compared to AIT, CBED, and Random methods. This suggests that GIT could be a good choice when access to interventional data is limited or costly.

GIT Outperforms MI-based Approaches

We also notice that the performance of MI-based approaches (CBED and AIT) is worse than the one of GIT, typically attaining significantly worse AUSHD (see Figure 2 and Figure 3) and SHD values (see Figure 7 and Figure 8 in the Appendix F.2). This problem is further corroborated in Section 5.3, where we show that even in the case of large interventional batch size, these methods occasionally underperform Random, unlike GIT, which clearly wins in such a scenario. We hypothesize the poor performance comes from approximation errors and model mismatches, subverting the MI criterion which should lead to near-optimal decisions in the case of exact mutual information computation Krause and Guestrin (2005); Nemhauser et al. (1978); Tigas et al. (2022).

GIT Approximates GIT-privileged’s Decisions

GIT-privileged performs the best, as it is better or comparable with all other methods for each graph and data regime (see Table 1). This strong performance is also visible in Figure 2 and Figure 3, where the mass of the method consistently occupies the favorable regions of the SAUSHD metric. These results solidify the perception of GIT-privileged as a soft upper-bound. Importantly, GIT follows it quite closely: the methods are equivalent in terms of performance in 10101010 cases in the low data regime, and in 5 cases in the regular data regime. Furthermore, the choices of GIT and GIT-privileged correlate highly (Spearman correlation equal 73737373%), see Appendix F.6. These results provide additional evidence in favor of GIT soundness and suggest that using data sampled from the model to compute GIT’s scores does not lead to severe performance deterioration. The training curves and more detailed results can be found in Appendix F.2.

5.3 Performance under larger interventional batch size

ENCO is sensitive to errors in the estimation of properties of interventional data. In particular, small interventional batch size may cause errors in the estimation of conditional likelihood and disrupt the causal discovery process (Lippe et al., 2022, Appendix B.2.3). We hypothesize that those estimation errors are an important factor hindering the advantage of using our method over Random in the larger data regime. Acquiring data with small batches may result in a misaligned gradient for the model and, in consequence, in the poor assessment of the next interventional target scores.

Table 2: Average AUSHD values (from 5 seeds) for experiments with interventional batch size equal 1024.
AIT CBED Random GIT(ours) GIT-privileged
bidiag 22.6 17.6 20.4 16.8 15.4
chain 11.4 8.2 10.2 8.0 7.7
collider 11.2 11.4 9.9 5.0 4.8
full 120.1 116.0 101.1 100.9 93.2
jungle 22.3 16.7 19.9 11.4 10.6
random 38.4 36.2 32.4 29.9 28.3

We perform an additional experiment in a modified regime, where each intervention yields 1024102410241024 data points instead of the previous 32323232. Such a regime is relevant in scenarios where setting up an intervention with a new target is costly but obtaining the individual samples is relatively cheap. We run the experiment on synthetic graphs with 25252525 nodes and we run for 25252525 acquisition rounds. We present the AUSHD values in Table 2 and full SHD curves in Appendix F.3. In this setting, GIT outperforms all the standard baselines and is on par with GIT-privileged. Importantly, GIT reaches the SHD value of 00 for all graphs. Additionally, we found that GIT selects each intervention target exactly once, except for the chain graph, for which the discovery process converges already after only 15 rounds.

5.4 Investigating GIT’s intervention target distributions

Refer to caption
Figure 4: The interventional target distributions obtained by different strategies on real-world data. The probability is represented by the intensity of the node’s color. The green color represents the edges for which there exists a graph in the Markov Equivalence Class that has the corresponding connection reversed. The number below each graph denotes the entropy of the distribution.
Refer to caption
Figure 5: Histograms of intervention targets chosen by GIT. In this experiment, a node v𝑣vitalic_v was chosen (denoted by a red color; v𝑣vitalic_v’s parents are indicated by green). Parameters were initialized so that the model is only unsure about the neighborhood of v𝑣vitalic_v. The solid lines denote known edges and dashed ones are to be discovered.

In order to gain a qualitative understanding of the GIT’s behavior, we analyze the node distributions generated by respective methods on the BnLearn graphs in Figure 4. We observe that GIT often selects nodes with high out-degree, as visible in the sachs and child graphs. Intuitively, interventions on such nodes bring much information, as they affect multiple other nodes. In addition, the most frequently selected nodes in the sachs, child, and asia graphs are also adjacent to the edges for which there exists a graph in the MEC that has the corresponding connection reversed (as indicated by the green color in Figure 4). Note that in general, establishing the directionality of such an edge (v,w)𝑣𝑤(v,w)( italic_v , italic_w ) requires performing interventions on nodes v,w𝑣𝑤v,witalic_v , italic_w (recall Section 3.1). 333For example, in the ENCO framework the directionality parameter θijsubscript𝜃𝑖𝑗\theta_{ij}italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT can only be reliably detected from the data obtained by intervening either on variable Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT or Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (Lippe et al., 2022).

We further explore the interventional targets and verify that GIT is able to target the most uncertain regions of the graph. In the considered setup, we select a node v𝑣vitalic_v in the graph. Let Evsubscript𝐸𝑣E_{v}italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT be edges adjacent to v𝑣vitalic_v. We set the structural parameters corresponding to edges eEv𝑒subscript𝐸𝑣e\notin E_{v}italic_e ∉ italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to the ground truth values and initialize in the standard way the parameters for eEv𝑒subscript𝐸𝑣e\in E_{v}italic_e ∈ italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Such a model is only unsure about the connectivity around v𝑣vitalic_v, while the rest of the solution is given. We then run the ENCO framework with GIT and report the intervention target distributions in Figure 5.

The interventions concentrate on v𝑣vitalic_v (red color) and its parents (green color). This indicates the efficiency of our approach, as these are most relevant to discovering the graph structure. Indeed, to recover the solution, only the parameters for eEv𝑒subscript𝐸𝑣e\in E_{v}italic_e ∈ italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT need to be found. Intervening on v𝑣vitalic_v changes the distributions of its descendants, providing information on the existence of edges between these variables.

6 Limitations and future work

  • The theoretical grounding of the method involves multiple assumptions. Further work that simplifies or relaxes the assumptions and identifies fail cases would benefit the community.

  • We provide proof that epsilon-greedy GIT converges with any causal discovery framework. As for pure GIT, we show its convergence only with the ENCO framework. The development of a more general theory that solidifies the approach is a promising future work direction.

  • Our method can be applied in the soft-intervention case, and providing appropriate experimental evaluation would be an interesting follow-up to this work.

  • Our method may need more interventions than the minimal number required to identify the causal structure. For example, GIT can be biased towards high-degree nodes, as interventions on them tend to affect a larger amount of structural parameters and result in larger gradients, which might cause suboptimal choices.

  • Intervention acquisition methods (including GIT) seem to be less effective in a continuous setting. We believe investigating this area would benefit the community.

7 Conclusions

In this paper, we consider the problem of experimental design for causal discovery. We introduce a novel Gradient-based Intervention Targeting (GIT) method, which leverages the gradients of gradient-based causal discovery objectives to score intervention targets. We demonstrate that the method is particularly effective in the low-data regime, outperforming competitive baselines. We also provide a theoretical justification for the method and perform several analyses, confirming that GIT typically selects informative targets.

Acknowledgments and Disclosure of Funding

The work of Piotr Miłoś was supported by the Polish National Science Center grant UMO-2017/26/E/ST6/00622 and UMO-2019/35/O/ST6/03464. The work of Michał Zając was supported by the Polish National Science Center Grant No. 2021/43/B/ST6/01456. The research of Michał Zając and Aleksandra Nowak has been supported by a flagship project entitled “Artificial Intelligence Computing Center Core Facility” from the DigiWorld Priority Research Area under the Strategic Programme Excellence Initiative at Jagiellonian University. We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2022/015443. Our experiments were managed using https://neptune.ai. We thank the Neptune team for providing us access to the team version and technical support. We thank Swedish National Supercomputing and the Berzelius Cluster for providing compute resources.

References

  • Agrawal et al. (2019) Raj Agrawal, Chandler Squires, Karren D. Yang, Karthikeyan Shanmugam, and Caroline Uhler. Abcd-strategy: Budgeted experimental design for targeted causal structure discovery. In Kamalika Chaudhuri and Masashi Sugiyama, editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 3400–3409. PMLR, 2019. URL http://proceedings.mlr.press/v89/agrawal19b.html.
  • Annadani et al. (2021) Yashas Annadani, Jonas Rothfuss, Alexandre Lacoste, Nino Scherrer, Anirudh Goyal, Yoshua Bengio, and Stefan Bauer. Variational causal networks: Approximate bayesian inference over causal structures. arXiv preprint arXiv:2106.07635, 2021.
  • Ash et al. (2020) Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=ryghZJBKPS.
  • Bengio et al. (2020) Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sébastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher J. Pal. A meta-transfer objective for learning to disentangle causal mechanisms. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=ryxWIgBFPS.
  • Brouillard et al. (2020) Philippe Brouillard, Sébastien Lachapelle, Alexandre Lacoste, Simon Lacoste-Julien, and Alexandre Drouin. Differentiable causal discovery from interventional data. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 21865–21877. Curran Associates, Inc., 2020.
  • Castro et al. (2020) Daniel C Castro, Ian Walker, and Ben Glocker. Causality matters in medical imaging. Nature Communications, 11(1):1–10, 2020.
  • Cundy et al. (2021) Chris Cundy, Aditya Grover, and Stefano Ermon. Bcd nets: Scalable variational approaches for bayesian causal discovery. Advances in Neural Information Processing Systems, 34, 2021.
  • Deleu et al. (2022) Tristan Deleu, António Góis, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio. Bayesian structure learning with generative flow networks. In James Cussens and Kun Zhang, editors, Uncertainty in Artificial Intelligence, Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI 2022, 1-5 August 2022, Eindhoven, The Netherlands, volume 180 of Proceedings of Machine Learning Research, pages 518–528. PMLR, 2022. URL https://proceedings.mlr.press/v180/deleu22a.html.
  • Eberhardt (2008) Frederick Eberhardt. Almost optimal intervention sets for causal discovery. In David A. McAllester and Petri Myllymäki, editors, UAI 2008, Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, Helsinki, Finland, July 9-12, 2008, pages 161–168. AUAI Press, 2008. URL https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=1948&proceeding_id=24.
  • Ebert-Uphoff and Deng (2012) Imme Ebert-Uphoff and Yi Deng. Causal discovery for climate research using graphical models. Journal of Climate, 25(17):5648–5665, 2012.
  • Geffner et al. (2022) Tomas Geffner, Javier Antoran, Adam Foster, Wenbo Gong, Chao Ma, Emre Kiciman, Amit Sharma, Angus Lamb, Martin Kukla, Nick Pawlowski, et al. Deep end-to-end causal inference. arXiv preprint arXiv:2202.02195, 2022.
  • Ghassami et al. (2018) AmirEmad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Elias Bareinboim. Budgeted experiment design for causal structure learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1724–1733. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/ghassami18a.html.
  • Ghassami et al. (2019) AmirEmad Ghassami, Saber Salehkaleybar, and Negar Kiyavash. Interventional experiment design for causal structure learning. CoRR, abs/1910.05651, 2019. URL http://arxiv.longhoe.net/abs/1910.05651.
  • Glymour et al. (2019) Clark Glymour, Kun Zhang, and Peter Spirtes. Review of causal discovery methods based on graphical models. Frontiers in genetics, 10:524, 2019.
  • Graves et al. (2017) Alex Graves, Marc G. Bellemare, Jacob Menick, Rémi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1311–1320. PMLR, 2017. URL http://proceedings.mlr.press/v70/graves17a.html.
  • Greenewald et al. (2019) Kristjan Greenewald, Dmitriy Katz, Karthikeyan Shanmugam, Sara Magliacane, Murat Kocaoglu, Enric Boix Adsera, and Guy Bresler. Sample efficient active learning of causal trees. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/5ee5605917626676f6a285fa4c10f7b0-Paper.pdf.
  • Hauser and Bühlmann (2012) Alain Hauser and Peter Bühlmann. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. The Journal of Machine Learning Research, 13(1):2409–2464, 2012.
  • He and Geng (2008) Yang-Bo He and Zhi Geng. Active learning of causal networks with intervention experiments and optimal designs. Journal of Machine Learning Research, 9(84):2523–2547, 2008. URL http://jmlr.org/papers/v9/he08a.html.
  • Houlsby et al. (2011) Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. CoRR, abs/1112.5745, 2011. URL http://arxiv.longhoe.net/abs/1112.5745.
  • Ke et al. (2019) Nan Rosemary Ke, Olexa Bilaniuk, Anirudh Goyal, Stefan Bauer, Hugo Larochelle, Bernhard Schölkopf, Michael C. Mozer, Chris Pal, and Yoshua Bengio. Learning neural causal models from unknown interventions, 2019. URL https://arxiv.longhoe.net/abs/1910.01075.
  • Kocaoglu et al. (2017) Murat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim. Experimental design for learning causal graphs with latent variables. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/291d43c696d8c3704cdbe0a72ade5f6c-Paper.pdf.
  • Krause and Guestrin (2005) Andreas Krause and Carlos Guestrin. Near-optimal nonmyopic value of information in graphical models. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, UAI’05, page 324–331, Arlington, Virginia, USA, 2005. AUAI Press. ISBN 0974903914.
  • Lindgren et al. (2018) Erik M. Lindgren, Murat Kocaoglu, Alexandros G. Dimakis, and Sriram Vishwanath. Experimental design for cost-aware learning of causal graphs. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 5284–5294, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/ba3e9b6a519cfddc560b5d53210df1bd-Abstract.html.
  • Lindley (1956) David Lindley. On a measure of the information provided by an experiment. Annals of Mathematical Statistics, 27:986–1005, 1956.
  • Lippe et al. (2022) Phillip Lippe, Taco Cohen, and Efstratios Gavves. Efficient neural causal discovery without acyclicity constraints. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=eYciPrLuUhG.
  • Liu and Wang (2016) Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/b3ba8f1bee1238a2f37603d90b58898d-Paper.pdf.
  • Lorch et al. (2021) Lars Lorch, Jonas Rothfuss, Bernhard Schölkopf, and Andreas Krause. Dibs: Differentiable bayesian structure learning. Advances in Neural Information Processing Systems, 34:24111–24123, 2021.
  • Murphy (2001) Kevin P Murphy. Active learning of causal bayes net structure. 2001.
  • Nemhauser et al. (1978) G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical Programming, 14(1):265–294, Dec 1978. ISSN 1436-4646. doi: 10.1007/BF01588971. URL https://doi.org/10.1007/BF01588971.
  • Pearl (2009) Judea Pearl. Causality. Cambridge University Press, Cambridge, UK, 2 edition, 2009. ISBN 978-0-521-89560-6. doi: 10.1017/CBO9780511803161.
  • Peters et al. (2016) Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947–1012, 2016.
  • Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
  • Sachs et al. (2005) Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529, 2005.
  • Sanchez-Romero et al. (2019) Ruben Sanchez-Romero, Joseph D Ramsey, Kun Zhang, Madelyn RK Glymour, Biwei Huang, and Clark Glymour. Estimating feedforward and feedback effective connections from fmri time series: Assessments of statistical methods. Network Neuroscience, 3(2):274–306, 2019.
  • Scherrer et al. (2021) Nino Scherrer, Olexa Bilaniuk, Yashas Annadani, Anirudh Goyal, Patrick Schwab, Bernhard Schölkopf, Michael C Mozer, Yoshua Bengio, Stefan Bauer, and Nan Rosemary Ke. Learning neural causal models with active interventions. arXiv preprint arXiv:2109.02429, 2021.
  • Scherrer et al. (2022) Nino Scherrer, Anirudh Goyal, Stefan Bauer, Yoshua Bengio, and Nan Rosemary Ke. On the generalization and adaption performance of causal models. arXiv preprint arXiv:2206.04620, 2022.
  • Scutari (2010) Marco Scutari. Learning bayesian networks with the bnlearn r package. Journal of Statistical Software, 35(3):1–22, 2010. doi: 10.18637/jss.v035.i03. URL https://www.jstatsoft.org/index.php/jss/article/view/v035i03.
  • Settles et al. (2007) Burr Settles, Mark Craven, and Soumya Ray. Multiple-instance active learning. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips.cc/paper/2007/file/a1519de5b5d44b31a01de013b9b51a80-Paper.pdf.
  • Shen et al. (2020) Xinpeng Shen, Sisi Ma, Prashanthi Vemuri, and Gyorgy Simon. Challenges and opportunities with causal discovery algorithms: application to alzheimer’s pathophysiology. Scientific reports, 10(1):1–12, 2020.
  • Spirtes et al. (2000a) P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT press, 2nd edition, 2000a.
  • Spirtes et al. (2000b) Peter Spirtes, Clark N Glymour, Richard Scheines, David Heckerman, Christopher Meek, Gregory Cooper, and Thomas Richardson. Causation, prediction, and search. MIT press, 2000b.
  • Squires et al. (2020) Chandler Squires, Sara Magliacane, Kristjan H. Greenewald, Dmitriy Katz, Murat Kocaoglu, and Karthikeyan Shanmugam. Active structure learning of causal dags via directed clique trees. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/f57bd0a58e953e5c43cd4a4e5af46138-Abstract.html.
  • Tigas et al. (2022) Panagiotis Tigas, Yashas Annadani, Andrew Jesson, Bernhard Schölkopf, Yarin Gal, and Stefan Bauer. Interventions, where and how? experimental design for causal models at scale, 2022. URL https://arxiv.longhoe.net/abs/2203.02016.
  • Tong and Koller (2001) Simon Tong and Daphne Koller. Active learning for structure in bayesian networks. In Bernhard Nebel, editor, Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, IJCAI 2001, Seattle, Washington, USA, August 4-10, 2001, pages 863–869. Morgan Kaufmann, 2001.
  • Triantafillou et al. (2017) Sofia Triantafillou, Vincenzo Lagani, Christina Heinze-Deml, Angelika Schmidt, Jesper Tegner, and Ioannis Tsamardinos. Predicting causal relationships from biological data: Applying automated causal discovery on mass cytometry data of human immune cells. Scientific Reports, 7(1):12724, Oct 2017. ISSN 2045-2322. doi: 10.1038/s41598-017-08582-x. URL https://doi.org/10.1038/s41598-017-08582-x.
  • Tsamardinos et al. (2006) Ioannis Tsamardinos, Laura E. Brown, and Constantin F. Aliferis. The max-min hill-climbing bayesian network structure learning algorithm. Machine Learning, 65(1):31–78, Oct 2006. ISSN 1573-0565. doi: 10.1007/s10994-006-6889-7. URL https://doi.org/10.1007/s10994-006-6889-7.
  • Wu et al. (2022) Ji Q. Wu, Nanda Horeweg, Marco de Bruyn, Remi A. Nout, Ina M. Jürgenliemk-Schulz, Ludy C. H. W. Lutgens, Jan J. Jobsen, Elzbieta M. van der Steen-Banasik, Hans W. Nijman, Vincent T. H. B. M. Smit, Tjalling Bosse, Carien L. Creutzberg, and Viktor H. Koelzer. Automated causal inference in application to randomized controlled clinical trials. Nature Machine Intelligence, 4(5):436–444, May 2022. ISSN 2522-5839. doi: 10.1038/s42256-022-00470-y. URL https://doi.org/10.1038/s42256-022-00470-y.
  • Zheng et al. (2018) Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/e347c51419ffb23ca3fd5050202f9c3d-Paper.pdf.

Appendix

Appendix A Convergence of causal discovery with GIT

Suppose that we have some causal discovery algorithm 𝒜𝒜\mathcal{A}caligraphic_A which is guaranteed to converge to the true graph in the limit of infinite data. Here we investigate if such convergence property still holds if we extend 𝒜𝒜\mathcal{A}caligraphic_A with GIT.

Let us define ϵitalic-ϵ{\epsilon}italic_ϵ-greedy GIT as follows: every time we need to select an intervention target, we use GIT with probability 1ϵ1italic-ϵ1-{\epsilon}1 - italic_ϵ, and otherwise, we choose randomly uniformly from all available targets.

Proposition 1.

If the causal discovery algorithm 𝒜𝒜\mathcal{A}caligraphic_A is guaranteed to converge given an infinite amount of samples from each possible intervention target, then 𝒜𝒜\mathcal{A}caligraphic_A with ϵitalic-ϵ{\epsilon}italic_ϵ-greedy GIT is also guaranteed to converge.

Proof.

Since the ϵitalic-ϵ\epsilonitalic_ϵ-exploration guarantees visiting every target infinitely many times in the limit, the proof follows from the asserted convergence of 𝒜𝒜\mathcal{A}caligraphic_A. ∎

Remark 2.

ENCO with ϵitalic-ϵ{\epsilon}italic_ϵ-greedy GIT is guaranteed to converge to the true graph under the standard assumptions [Lippe et al., 2022, Appendix B.1].

Remark 3.

Proposition 1 is asymptotic and holds for arbitrary ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0. However, in a finite setup, we can choose ϵitalic-ϵ\epsilonitalic_ϵ small enough that ϵitalic-ϵ\epsilonitalic_ϵ-GIT and GIT behave similarly. Our experiments show that GIT performs well (compared with other benchmarks) and is indistinguishable from an asymptotically convergent method.

Appendix B Convergence conditions of ENCO framework with GIT

B.1 Preliminaries (ENCO recap)

In this section, we recall results from Lippe et al. [2022] for convergence of their causal discovery method ENCO. They formulate four theorems and a set of conditions that guarantee that the parameters of the structure converge to the true graph. For full proof and detailed explanation please refer to Appendix B in Lippe et al. [2022].

Remark 4.

ENCO identifies common conditions for correct convergence of directionality parameters. They go as follows (see theorems B.1, B.2, and appendix B.4 in ENCO):

  1. 1.

    For all possible sets of parents of Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT excluding Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, adding Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT improves the log-likelihood estimate of Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT under the intervention on Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, or leaves it unchanged.

    pa^(Xj)Xi,j:𝔼IXi,𝑿[logp(Xj|pa^(Xj),Xi)logp(Xj|pa^(Xj))]0:for-all^pasubscript𝑋𝑗subscript𝑋𝑖𝑗subscript𝔼subscript𝐼subscript𝑋𝑖𝑿delimited-[]𝑝conditionalsubscript𝑋𝑗^pasubscript𝑋𝑗subscript𝑋𝑖𝑝conditionalsubscript𝑋𝑗^pasubscript𝑋𝑗0\forall\widehat{\text{pa}}(X_{j})\subseteq X_{-i,j}:\mathbb{E}_{I_{X_{i}},\bm{% X}}\left[\log p(X_{j}|\widehat{\text{pa}}(X_{j}),X_{i})-\log p(X_{j}|\widehat{% \text{pa}}(X_{j}))\right]\geq 0∀ over^ start_ARG pa end_ARG ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⊆ italic_X start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT : blackboard_E start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_X end_POSTSUBSCRIPT [ roman_log italic_p ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | over^ start_ARG pa end_ARG ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_log italic_p ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | over^ start_ARG pa end_ARG ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ] ≥ 0
  2. 2.

    There exists a set of nodes pa^(Xj)^pasubscript𝑋𝑗\widehat{\text{pa}}(X_{j})over^ start_ARG pa end_ARG ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), for which the probability to be sampled as parents of Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is greater than 0, and the following condition holds:

    pa^(Xj)Xi,j:𝔼IXi,𝑿[logp(Xj|pa^(Xj),Xi)logp(Xj|pa^(Xj))]>0:^pasubscript𝑋𝑗subscript𝑋𝑖𝑗subscript𝔼subscript𝐼subscript𝑋𝑖𝑿delimited-[]𝑝conditionalsubscript𝑋𝑗^pasubscript𝑋𝑗subscript𝑋𝑖𝑝conditionalsubscript𝑋𝑗^pasubscript𝑋𝑗0\exists\widehat{\text{pa}}(X_{j})\subseteq X_{-i,j}:\mathbb{E}_{I_{X_{i}},\bm{% X}}\left[\log p(X_{j}|\widehat{\text{pa}}(X_{j}),X_{i})-\log p(X_{j}|\widehat{% \text{pa}}(X_{j}))\right]>0∃ over^ start_ARG pa end_ARG ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⊆ italic_X start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT : blackboard_E start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_X end_POSTSUBSCRIPT [ roman_log italic_p ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | over^ start_ARG pa end_ARG ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_log italic_p ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | over^ start_ARG pa end_ARG ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ] > 0
  3. 3.

    For all possible sets of parents of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT excluding Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, adding Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT does not improves the log-likelihood estimate of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT under the intervention on Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, or leaves it unchanged.

    pa^(Xi)Xi,j:𝔼IXj,𝑿[logp(Xi|pa^(Xi),Xj)logp(Xi|pa^(Xi))]0:for-all^pasubscript𝑋𝑖subscript𝑋𝑖𝑗subscript𝔼subscript𝐼subscript𝑋𝑗𝑿delimited-[]𝑝conditionalsubscript𝑋𝑖^pasubscript𝑋𝑖subscript𝑋𝑗𝑝conditionalsubscript𝑋𝑖^pasubscript𝑋𝑖0\forall\widehat{\text{pa}}(X_{i})\subseteq X_{-i,j}:\mathbb{E}_{I_{X_{j}},\bm{% X}}\left[\log p(X_{i}|\widehat{\text{pa}}(X_{i}),X_{j})-\log p(X_{i}|\widehat{% \text{pa}}(X_{i}))\right]\geq 0∀ over^ start_ARG pa end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⊆ italic_X start_POSTSUBSCRIPT - italic_i , italic_j end_POSTSUBSCRIPT : blackboard_E start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_X end_POSTSUBSCRIPT [ roman_log italic_p ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | over^ start_ARG pa end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - roman_log italic_p ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | over^ start_ARG pa end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ] ≥ 0

    For at least one parent set pa^(Xi)^pasubscript𝑋𝑖\widehat{\text{pa}}(X_{i})over^ start_ARG pa end_ARG ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), which has a probability greater than zero to be sampled, this inequality is strictly smaller than zero.

Remark 5.

The following condition guarantees convergence of existence parameters (see theorem B.3 in ENCO):

minpa^gpai(Xj)𝔼I^pIj(I)𝔼p~I^(𝑿)[logp(Xj|pa^,Xi)logp(Xj|pa^)]>λsparsesubscript^pasubscriptgpa𝑖subscript𝑋𝑗subscript𝔼similar-to^𝐼subscript𝑝subscript𝐼𝑗𝐼subscript𝔼subscript~𝑝^𝐼𝑿delimited-[]𝑝conditionalsubscript𝑋𝑗^pasubscript𝑋𝑖𝑝conditionalsubscript𝑋𝑗^pasubscript𝜆𝑠𝑝𝑎𝑟𝑠𝑒\min_{\hat{\text{pa}}\subseteq\text{gpa}_{i}(X_{j})}\mathbb{E}_{\hat{I}\sim p_% {I_{-j}}(I)}\mathbb{E}_{\tilde{p}_{\hat{I}}(\bm{X})}\big{[}\log p(X_{j}|\hat{% \text{pa}},X_{i})-\log p(X_{j}|\hat{\text{pa}})\big{]}>\lambda_{sparse}roman_min start_POSTSUBSCRIPT over^ start_ARG pa end_ARG ⊆ gpa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT over^ start_ARG italic_I end_ARG ∼ italic_p start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_I ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( bold_italic_X ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | over^ start_ARG pa end_ARG , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_log italic_p ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | over^ start_ARG pa end_ARG ) ] > italic_λ start_POSTSUBSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT

where gpai(Xj)subscriptgpa𝑖subscript𝑋𝑗\text{gpa}_{i}(X_{j})gpa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is a set of nodes excluding Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT which, according to the ground truth graph, could have an edge to Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT without introducing a cycle, pIj(I)subscript𝑝subscript𝐼𝑗𝐼p_{I_{-j}}(I)italic_p start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_I ) refers to the distribution over conducted interventions pI(I)subscript𝑝𝐼𝐼p_{I}(I)italic_p start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_I ) excluding the intervention on variable Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and λsparsesubscript𝜆𝑠𝑝𝑎𝑟𝑠𝑒\lambda_{sparse}italic_λ start_POSTSUBSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT is a positive constant.

Theorem 6.

(Theorem B.1 from Appendix B.4 in ENCO.) Consider the edge XiXjsubscript𝑋𝑖subscript𝑋𝑗X_{i}\rightarrow X_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the true causal graph. The orientation parameter θijsubscript𝜃𝑖𝑗\theta_{ij}italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT converges to σ(θij)=1𝜎subscript𝜃𝑖𝑗1\sigma(\theta_{ij})=1italic_σ ( italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = 1 if the conditions from remark 4 are fulfilled.

Theorem 7.

(Theorem B.2 from Appendix B.4 in ENCO.) Consider a pair of variables Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for which Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an ancestor of Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT without direct edge in the true causal graph. Assume all edges that appear in the true graph have converged according to theorem 6. The orientation parameter θijsubscript𝜃𝑖𝑗\theta_{ij}italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT converges to σ(θij)=1𝜎subscript𝜃𝑖𝑗1\sigma(\theta_{ij})=1italic_σ ( italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = 1 if the conditions from remark 4 are fulfilled.

By Appendix B.4 from ENCO, Theorems 67 hold regardless of whether we collected interventional data from node Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Theorem 8.

Consider an edge XiXjsubscript𝑋𝑖subscript𝑋𝑗X_{i}\rightarrow X_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the true causal graph. The parameter γijsubscript𝛾𝑖𝑗\gamma_{ij}italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT converges to σ(γij)=1𝜎subscript𝛾𝑖𝑗1\sigma(\gamma_{ij})=1italic_σ ( italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = 1 if the condition from remark 5 holds.

Theorem 9.

Assume for all edges XiXjsubscript𝑋𝑖subscript𝑋𝑗X_{i}\rightarrow X_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the true causal graph, σ(θij)𝜎subscript𝜃𝑖𝑗\sigma(\theta_{ij})italic_σ ( italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) and σ(γij)𝜎subscript𝛾𝑖𝑗\sigma(\gamma_{ij})italic_σ ( italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) have converged to one. Then, the likelihood of all other edges, i.e. σ(θlk)σ(γlk)𝜎subscript𝜃𝑙𝑘𝜎subscript𝛾𝑙𝑘\sigma(\theta_{lk})\cdot\sigma(\gamma_{lk})italic_σ ( italic_θ start_POSTSUBSCRIPT italic_l italic_k end_POSTSUBSCRIPT ) ⋅ italic_σ ( italic_γ start_POSTSUBSCRIPT italic_l italic_k end_POSTSUBSCRIPT ) will converge to zero under the condition that λsparse>0subscript𝜆sparse0\lambda_{\textit{sparse}}>0italic_λ start_POSTSUBSCRIPT sparse end_POSTSUBSCRIPT > 0.

B.2 GIT-privileged proof

We follow with proof of ENCO convergence with the GIT-privileged acquisition method under the same set of conditions from remarks 4, 5. We show that GIT-privileged collects interventional data as long as is needed for the orientation parameters to converge according to theorems 6 and 7. Then theorems 8 and 9 can be applied to show that the algorithm reached convergence.

Assumption

We assume that in all local minima of our loss function, the existence parameters take extreme values: i,jsubscriptfor-all𝑖𝑗\forall_{i,j}∀ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT σ(γij){0,1}𝜎subscript𝛾𝑖𝑗01\sigma(\gamma_{ij})\in\{0,1\}italic_σ ( italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ∈ { 0 , 1 }, thus, when sufficient time for optimization is given, they stop contributing to the score. Hence in our analysis, we focus on describing only the behavior of orientation parameter gradients.

The proof is structured as follows:

  1. 1.

    We show that following GIT-privileged score allows collecting enough interventional data to direct all edges that appear in the true graph correctly, see proposition 12.

  2. 2.

    Then we show that, if required, additional interventional data that allows directing other edges according to theorem 7 will be collected, see proposition 13.

  3. 3.

    Finally, theorems 8 and 9 can be applied to show that we learned the correct graph, see proposition 14.

Proposition 10.

Consider the edge XiXjsubscript𝑋𝑖subscript𝑋𝑗X_{i}\rightarrow X_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the true causal graph. The parameter γijsubscript𝛾𝑖𝑗\gamma_{ij}italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT converges to σ(γij)=1𝜎subscript𝛾𝑖𝑗1\sigma(\gamma_{ij})=1italic_σ ( italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = 1 under any set of interventions pI(I)subscript𝑝𝐼𝐼p_{I}(I)italic_p start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_I ) if

minpa^𝒱i𝔼I^pIj(I)𝔼p~I^(𝑿)[logp(Xj|pa^,Xi)logp(Xj|pa^)]>λsparsesubscript^pasubscript𝒱𝑖subscript𝔼similar-to^𝐼subscript𝑝subscript𝐼𝑗𝐼subscript𝔼subscript~𝑝^𝐼𝑿delimited-[]𝑝conditionalsubscript𝑋𝑗^pasubscript𝑋𝑖𝑝conditionalsubscript𝑋𝑗^pasubscript𝜆𝑠𝑝𝑎𝑟𝑠𝑒\min_{\hat{\text{pa}}\subseteq\mathcal{V}_{-i}}\mathbb{E}_{\hat{I}\sim p_{I_{-% j}}(I)}\mathbb{E}_{\tilde{p}_{\hat{I}}(\bm{X})}\big{[}\log p(X_{j}|\hat{\text{% pa}},X_{i})-\log p(X_{j}|\hat{\text{pa}})\big{]}>\lambda_{sparse}roman_min start_POSTSUBSCRIPT over^ start_ARG pa end_ARG ⊆ caligraphic_V start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT over^ start_ARG italic_I end_ARG ∼ italic_p start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_I ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_I end_ARG end_POSTSUBSCRIPT ( bold_italic_X ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | over^ start_ARG pa end_ARG , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_log italic_p ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | over^ start_ARG pa end_ARG ) ] > italic_λ start_POSTSUBSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT

where 𝒱isubscript𝒱𝑖\mathcal{V}_{-i}caligraphic_V start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT is the set of all nodes excluding Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and pIj(I)subscript𝑝subscript𝐼𝑗𝐼p_{I_{-j}}(I)italic_p start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_I ) refers to the distribution over conducted interventions pI(I)subscript𝑝𝐼𝐼p_{I}(I)italic_p start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_I ) excluding the intervention on variable Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Proof.

The condition guarantees that the gradient of γijsubscript𝛾𝑖𝑗\gamma_{ij}italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is positive. ∎

Proposition 11.

Consider the edge XiXjsubscript𝑋𝑖subscript𝑋𝑗X_{i}\rightarrow X_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the true causal graph. When conditions from remark 4 are fulfilled θijLgraph(XIi)=θjiLgraph(XIi)=θijLgraph(XIj)=θjiLgraph(XIj)=0delimited-∥∥subscriptsubscript𝜃𝑖𝑗subscript𝐿graphsubscript𝑋subscript𝐼𝑖delimited-∥∥subscriptsubscript𝜃𝑗𝑖subscript𝐿graphsubscript𝑋subscript𝐼𝑖delimited-∥∥subscriptsubscript𝜃𝑖𝑗subscript𝐿graphsubscript𝑋subscript𝐼𝑗delimited-∥∥subscriptsubscript𝜃𝑗𝑖subscript𝐿graphsubscript𝑋subscript𝐼𝑗0\lVert\nabla_{\theta_{ij}}L_{\text{graph}}(X_{I_{i}})\rVert=\lVert\nabla_{% \theta_{ji}}L_{\text{graph}}(X_{I_{i}})\rVert=\lVert\nabla_{\theta_{ij}}L_{% \text{graph}}(X_{I_{j}})\rVert=\lVert\nabla_{\theta_{ji}}L_{\text{graph}}(X_{I% _{j}})\rVert=0∥ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ = ∥ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ = ∥ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ = ∥ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ = 0 and the edge converged to its true value if and only if we acquired interventional data from Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Proof.

First, recall that ENCO does not update orientation parameters unless the interventional data was acquired from a neighboring node. Therefore, the gradient can only be zero before the intervention if the existence parameters converge to zero. This situation is guaranteed not to happen by proposition 10.

Second, when interventional data is acquired, based on the theorem 6, we know that the edge converges to its true value.

Proposition 12.

Choosing intervention targets with GIT-privileged score allows collecting enough data to direct all edges that appear in the true graph properly if the following is true:

  • For any pair of variables Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT without a direct edge in the true causal graph, when conditions from remark 4 are fulfilled, and we acquired interventional data from either Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and sufficient time for the optimization process is given, then the orientation parameters will converge to extreme values σ(θij)𝜎subscript𝜃𝑖𝑗\sigma(\theta_{ij})italic_σ ( italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ), σ(θji){0,1}𝜎subscript𝜃𝑗𝑖01\sigma(\theta_{ji})\in\{0,1\}italic_σ ( italic_θ start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ) ∈ { 0 , 1 } or the existence parameters will converge to σ(γij)=σ(γji)=0𝜎subscript𝛾𝑖𝑗𝜎subscript𝛾𝑗𝑖0\sigma(\gamma_{ij})=\sigma(\gamma_{ji})=0italic_σ ( italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = italic_σ ( italic_γ start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ) = 0.

Proof.

The assumption and proposition 11 imply that after collecting interventional data from a node, edges that are connected to this node will not contribute to the score anymore. Thus we will not intervene on the same node twice. On the other hand, proposition 11 guarantees the score of edges that appear in the true graph will be positive until they are directed.

Proposition 13.

Consider a pair of variables Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for which Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an ancestor of Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT without a direct edge in the true causal graph. Assume all edges that appear in the true graph has converged according to theorem 6. When conditions from remark 4 are fulfilled, and if θijLgraph(XIi)=θjiLgraph(XIi)=θijLgraph(XIj)=θjiLgraph(XIj)=0delimited-∥∥subscriptsubscript𝜃𝑖𝑗subscript𝐿graphsubscript𝑋subscript𝐼𝑖delimited-∥∥subscriptsubscript𝜃𝑗𝑖subscript𝐿graphsubscript𝑋subscript𝐼𝑖delimited-∥∥subscriptsubscript𝜃𝑖𝑗subscript𝐿graphsubscript𝑋subscript𝐼𝑗delimited-∥∥subscriptsubscript𝜃𝑗𝑖subscript𝐿graphsubscript𝑋subscript𝐼𝑗0\lVert\nabla_{\theta_{ij}}L_{\text{graph}}(X_{I_{i}})\rVert=\lVert\nabla_{% \theta_{ji}}L_{\text{graph}}(X_{I_{i}})\rVert=\lVert\nabla_{\theta_{ij}}L_{% \text{graph}}(X_{I_{j}})\rVert=\lVert\nabla_{\theta_{ji}}L_{\text{graph}}(X_{I% _{j}})\rVert=0∥ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ = ∥ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ = ∥ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ = ∥ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT graph end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ = 0 then the edge converged as described in theorem 7 or its existence parameters converged to 0.

Proof.

To zero out the gradient either orientation or existence parameters had to converge. If the orientation parameters converged we had to collect interventional data from a neighboring node (because otherwise, ENCO does not update parameters). Thus, based on the theorem 7, we know that the edge converged to its true value. ∎

Proposition 14.

Given sufficient acquisition rounds and time for optimization ENCO with GIT-privileged intervention acquisition method will recover the true graph.

Proof.

From propositions 1213 we have that GIT-privileged will collect interventional data from new nodes until edges that appear in the true graph are correctly directed and edges between pairs of variables Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT without a direct edge in the true causal graph either disappear from the model or are directed according to theorem 7. By theorems 89 we conclude that we indeed acquired enough interventional data to converge to the correct graph.

B.3 GIT convergence

Proposition 15.

Given sufficient acquisition rounds and time for optimization ENCO with GIT intervention acquisition method will recover the true graph if the following is true:

  • For any graph G𝐺Gitalic_G sampled from the structural belief ρ()subscript𝜌\mathbb{P}_{\rho}(\cdot)blackboard_P start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( ⋅ ) (recall equation 5) during GIT score estimation, the theorems and propositions from sections B.1 and B.2 hold when we use G𝐺Gitalic_G instead of the true graph and compute gradient using data from the sampled model G,ϕ,i(X)subscript𝐺italic-ϕ𝑖𝑋\mathbb{P}_{G,\phi,i}(X)blackboard_P start_POSTSUBSCRIPT italic_G , italic_ϕ , italic_i end_POSTSUBSCRIPT ( italic_X ) (recall equation 6).

Proof.

First, note that thanks to proposition 10, if there is an edge XiXjsubscript𝑋𝑖subscript𝑋𝑗X_{i}\rightarrow X_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the true graph there exists a model, that can be sampled from the belief with non-zero probability, in which this edge appears.

For each undirected edge XiXjsubscript𝑋𝑖subscript𝑋𝑗X_{i}\rightarrow X_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the structural belief for which the existence parameter converged to σ(θij)=1𝜎subscript𝜃𝑖𝑗1\sigma(\theta_{ij})=1italic_σ ( italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = 1, there exists a model (with a positive probability to be sampled), that will yield a gradient of positive magnitude if and only if there is no interventional data acquired from the node connected to it. This model contains the edge XiXjsubscript𝑋𝑖subscript𝑋𝑗X_{i}\rightarrow X_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, thus by the assumption made above and proposition 11, it will yield a positive gradient. In consequence, expectation over all possible models, when the edge is not yet directed, will yield a positive score.

Note that, when interventional data from Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is acquired, the edge XiXjsubscript𝑋𝑖subscript𝑋𝑗X_{i}\rightarrow X_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is directed and it does not yield a gradient of positive magnitude under data sampled from G,ϕ,i(X)subscript𝐺italic-ϕ𝑖𝑋\mathbb{P}_{G,\phi,i}(X)blackboard_P start_POSTSUBSCRIPT italic_G , italic_ϕ , italic_i end_POSTSUBSCRIPT ( italic_X ) for any Gρ()similar-to𝐺subscript𝜌G\sim\mathbb{P}_{\rho}(\cdot)italic_G ∼ blackboard_P start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( ⋅ ). This stems from the fact that the gradient term zeroes out when parameter σ(θij)𝜎subscript𝜃𝑖𝑗\sigma(\theta_{ij})italic_σ ( italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) takes extreme values (0 or 1).

Hence, GIT score will allow to sequentially "eliminate" undirected edges. Since to update our structural belief ρ()subscript𝜌\mathbb{P}_{\rho}(\cdot)blackboard_P start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( ⋅ ) we use interventional data sampled from the true graph when all edges are directed, we are guaranteed (by B.2 section) that they are directed according to theorems 67. Then the same argument as for GIT-privileged can be applied to show that we converged to the true graph. ∎

Appendix C Details about Employed Causal Discovery Frameworks

C.1 ENCO

We extend the description of the ENCO framework [Lippe et al., 2022] from Section 4.1.

Structural Parameters.

ENCO learns a distribution over the graph structures by associating with each edge (i,j)𝑖𝑗(i,j)( italic_i , italic_j ), for which ij𝑖𝑗i\neq jitalic_i ≠ italic_j, a probability pi,j=σ(γi,j)σ(θi,j)subscript𝑝𝑖𝑗𝜎subscript𝛾𝑖𝑗𝜎subscript𝜃𝑖𝑗p_{i,j}=\sigma(\gamma_{i,j})\sigma(\theta_{i,j})italic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_σ ( italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) italic_σ ( italic_θ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ). Intuitively, the γi,jsubscript𝛾𝑖𝑗\gamma_{i,j}italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT parameter represents the existence of the edge, while θi,j=θj,isubscript𝜃𝑖𝑗subscript𝜃𝑗𝑖\theta_{i,j}=-\theta_{j,i}italic_θ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = - italic_θ start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT is associated with the direction of the edge. The parameters γi,jsubscript𝛾𝑖𝑗\gamma_{i,j}italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and θi,jsubscript𝜃𝑖𝑗\theta_{i,j}italic_θ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT are updated in the graph fitting stage.

Distribution Fitting Stage.

The goal of the distribution fitting stage is to learn the conditional probabilities P(Xi|PA(i,C))𝑃conditionalsubscript𝑋𝑖𝑃subscript𝐴𝑖𝐶P(X_{i}|PA_{(i,C)})italic_P ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_P italic_A start_POSTSUBSCRIPT ( italic_i , italic_C ) end_POSTSUBSCRIPT ) for each variable Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given a graph represented by an adjacency matrix C𝐶Citalic_C, sampled from Ci,jBernoulli(pi,j)similar-tosubscript𝐶𝑖𝑗𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖subscript𝑝𝑖𝑗C_{i,j}\sim~{}Bernoulli(p_{i,j})italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∼ italic_B italic_e italic_r italic_n italic_o italic_u italic_l italic_l italic_i ( italic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ). Note that self-loops are not allowed and thus pi,i=0subscript𝑝𝑖𝑖0p_{i,i}=0italic_p start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = 0. The conditionals are modeled by neural networks fϕisubscript𝑓subscriptitalic-ϕ𝑖f_{\phi_{i}}italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT with an input dropout-mask defined by the adjacency matrix. In consequence, the negative log-probability of a variable can be expressed as LC(Xi)=logfϕi(PA(i,C))(Xi)subscript𝐿𝐶subscript𝑋𝑖subscript𝑓subscriptitalic-ϕ𝑖𝑃subscript𝐴𝑖𝐶subscript𝑋𝑖L_{C}(X_{i})=-\log{f_{\phi_{i}}(PA_{(i,C)})(X_{i})}italic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = - roman_log italic_f start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_P italic_A start_POSTSUBSCRIPT ( italic_i , italic_C ) end_POSTSUBSCRIPT ) ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where PA(i,C)𝑃subscript𝐴𝑖𝐶PA_{(i,C)}italic_P italic_A start_POSTSUBSCRIPT ( italic_i , italic_C ) end_POSTSUBSCRIPT is obtained by computing C,iXdirect-productsubscript𝐶𝑖𝑋C_{\cdot,i}\odot Xitalic_C start_POSTSUBSCRIPT ⋅ , italic_i end_POSTSUBSCRIPT ⊙ italic_X, with direct-product\odot denoting the element-wise multiplication. The optimization objective for this stage is defined as minimizing the negative log-likelihood (NLL) of the observational data over the masks C,isubscript𝐶𝑖C_{\cdot,i}italic_C start_POSTSUBSCRIPT ⋅ , italic_i end_POSTSUBSCRIPT. Under the assumption that the distributions satisfy the Markov factorization property defined in Equation 1, the NLL can be expressed as:

LD=𝔼X𝔼C[i=1nLC(Xi)].subscript𝐿𝐷subscript𝔼𝑋subscript𝔼𝐶delimited-[]superscriptsubscript𝑖1𝑛subscript𝐿𝐶subscript𝑋𝑖L_{D}=\mathbb{E}_{X}\mathbb{E}_{C}[\sum_{i=1}^{n}L_{C}(X_{i})].italic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] . (11)
Graph Fitting Stage and Implementation of Interventions.

The graph fitting stage updates the structural parameters θ𝜃\thetaitalic_θ and γ𝛾\gammaitalic_γ defining the graph distribution. After selecting an intervention target I𝐼Iitalic_I, ENCO samples the data from the postinterventional distribution P~Isubscript~𝑃𝐼\widetilde{P}_{I}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT. In experiments, in the current paper, where the variables are assumed to be categorical the intervention is implemented by changing the target node’s conditional to uniform over the set of node’s categories. As the loss, ENCO uses the graph strcuture loss Lgraphsubscript𝐿𝑔𝑟𝑎𝑝L_{graph}italic_L start_POSTSUBSCRIPT italic_g italic_r italic_a italic_p italic_h end_POSTSUBSCRIPT defined in Equation 7 in the main text plus a regularization term λLγ,θsparse𝜆superscriptsubscript𝐿𝛾𝜃𝑠𝑝𝑎𝑟𝑠𝑒\lambda L_{\gamma,\theta}^{sparse}italic_λ italic_L start_POSTSUBSCRIPT italic_γ , italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUPERSCRIPT that influences the sparsity of the generated adjacency matrices, where λ𝜆\lambdaitalic_λ is the regularization strength.

Gradients Estimators.

In order to update the structural parameters γ𝛾\gammaitalic_γ and θ𝜃\thetaitalic_θ ENCO uses REINFORCE-inspired gradient estimators. For each parameter γi,jsubscript𝛾𝑖𝑗\gamma_{i,j}italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT the gradient is defined as:

LGγi,j=σ(γi,j)σ(θi,j)𝔼𝐗,Cij[LXiXj(Xj)LXi↛Xj(Xj)+λ],subscript𝐿𝐺subscript𝛾𝑖𝑗superscript𝜎subscript𝛾𝑖𝑗𝜎subscript𝜃𝑖𝑗subscript𝔼𝐗subscript𝐶𝑖𝑗delimited-[]subscript𝐿subscript𝑋𝑖subscript𝑋𝑗subscript𝑋𝑗subscript𝐿↛subscript𝑋𝑖subscript𝑋𝑗subscript𝑋𝑗𝜆\begin{split}\frac{\partial L_{G}}{\partial\gamma_{i,j}}=\sigma^{\prime}(% \gamma_{i,j})\sigma(\theta_{i,j})\cdot\\ \cdot\mathbb{E}_{\mathbf{X},C_{-ij}}[L_{X_{i}\to X_{j}}(X_{j})-L_{X_{i}\not\to X% _{j}}(X_{j})+\lambda],\end{split}start_ROW start_CELL divide start_ARG ∂ italic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG = italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) italic_σ ( italic_θ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ⋅ end_CELL end_ROW start_ROW start_CELL ⋅ blackboard_E start_POSTSUBSCRIPT bold_X , italic_C start_POSTSUBSCRIPT - italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_L start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_L start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↛ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_λ ] , end_CELL end_ROW (12)

where 𝔼𝐗,Cijsubscript𝔼𝐗subscript𝐶𝑖𝑗\mathbb{E}_{\mathbf{X},C_{-}ij}blackboard_E start_POSTSUBSCRIPT bold_X , italic_C start_POSTSUBSCRIPT - end_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes all of the three expectations in Equation 7 (in the main text), but excluding the edge (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) from C𝐶Citalic_C. The term LXi↛Xj(Xj)subscript𝐿↛subscript𝑋𝑖subscript𝑋𝑗subscript𝑋𝑗L_{X_{i}\not\to X_{j}}(X_{j})italic_L start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↛ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) describes the negative log-likelihood of the variable Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT under the adjacency matrix Cijsubscript𝐶𝑖𝑗C_{-ij}italic_C start_POSTSUBSCRIPT - italic_i italic_j end_POSTSUBSCRIPT, while LXiXj(Xj)subscript𝐿subscript𝑋𝑖subscript𝑋𝑗subscript𝑋𝑗L_{X_{i}\to X_{j}}(X_{j})italic_L start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the negative log-likelihood computed by including the edge (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) in Cijsubscript𝐶𝑖𝑗C_{-ij}italic_C start_POSTSUBSCRIPT - italic_i italic_j end_POSTSUBSCRIPT. For parameters θi,jsubscript𝜃𝑖𝑗\theta_{i,j}italic_θ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT the gradient is defined as:

LGθi,j=σ(θi,j)\displaystyle\frac{\partial L_{G}}{\partial\theta_{i,j}}=\sigma^{\prime}(% \theta_{i,j})\cdotdivide start_ARG ∂ italic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG = italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ⋅
(p(Ii)σ(γi,j)𝔼Ii,𝐗,Cij[LXiXj(Xj)LXi↛Xj(Xj)]\displaystyle\cdot\big{(}p(I_{i})\sigma(\gamma_{i,j})\mathbb{E}_{I_{i},\mathbf% {X},C_{-ij}}[L_{X_{i}\to X_{j}}(X_{j})-L_{X_{i}\not\to X_{j}}(X_{j})]-⋅ ( italic_p ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_σ ( italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) blackboard_E start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X , italic_C start_POSTSUBSCRIPT - italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_L start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_L start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↛ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] -
p(Ij)σ(γj,i)𝔼Ij,𝐗,Cij[LXjXi(Xi)LXj↛Xi(Xi)]),\displaystyle p(I_{j})\sigma(\gamma_{j,i})\mathbb{E}_{I_{j},\mathbf{X},C_{-ij}% }[L_{X_{j}\to X_{i}}(X_{i})-L_{X_{j}\not\to X_{i}}(X_{i})]\big{)},italic_p ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_σ ( italic_γ start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ) blackboard_E start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_X , italic_C start_POSTSUBSCRIPT - italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_L start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_L start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ↛ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ) , (13)

where p(Ii)𝑝subscript𝐼𝑖p(I_{i})italic_p ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the probability of intervening on node i𝑖iitalic_i (usually uniform) and 𝔼Ii,𝐗,Cijsubscript𝔼subscript𝐼𝑖𝐗subscript𝐶𝑖𝑗\mathbb{E}_{I_{i},\mathbf{X},C_{-ij}}blackboard_E start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X , italic_C start_POSTSUBSCRIPT - italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the same expectation as 𝔼𝐗,Cijsubscript𝔼𝐗subscript𝐶𝑖𝑗\mathbb{E}_{\mathbf{X},C_{-ij}}blackboard_E start_POSTSUBSCRIPT bold_X , italic_C start_POSTSUBSCRIPT - italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT but under the intervention on node i𝑖iitalic_i.

C.2 DiBS

DiBS [Lorch et al., 2021] is a Bayesian structure learning framework which performs posterior inference over graphs with gradient based variational inference. This is achieved by parameterising the belief about the presence of an edge between any two nodes with corresponding learnable node embeddings. This turns the problem of discrete inference over graph structures to inference over node embeddings, which are continuous, thereby opening up the possibility to use gradient based inference techniques. In order to restrict the space of distributions to DAGs, NOTEARS constraint [Zheng et al., 2018] which enforces acyclicity is introduced as a prior through a Gibbs distribution.

Formally, for any two nodes (i,j)𝑖𝑗(i,j)( italic_i , italic_j ), the belief about the presence of the edge from i𝑖iitalic_i to j𝑗jitalic_j is paramerised as:

p(gijui,vj)=11+exp(α(uiTvj))𝑝conditionalsubscript𝑔𝑖𝑗subscript𝑢𝑖subscript𝑣𝑗11𝛼superscriptsubscript𝑢𝑖𝑇subscript𝑣𝑗p(g_{ij}\mid u_{i},v_{j})=\frac{1}{1+\exp(-\alpha(u_{i}^{T}v_{j}))}italic_p ( italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∣ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( - italic_α ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_ARG (14)

Here, gijsubscript𝑔𝑖𝑗g_{ij}italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the random variable corresponding to the presence of an edge between i𝑖iitalic_i to j𝑗jitalic_j, α𝛼\alphaitalic_α is a tunable hyperparameter and ui,vjksubscript𝑢𝑖subscript𝑣𝑗superscript𝑘u_{i},v_{j}\in\mathbb{R}^{k}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT are embeddings corresponding to node i𝑖iitalic_i and j𝑗jitalic_j. The entire set of learnable embeddings, i.e. 𝐔={ui}i=1d𝐔subscriptsuperscriptsubscript𝑢𝑖𝑑𝑖1\mathbf{U}=\{u_{i}\}^{d}_{i=1}bold_U = { italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, 𝐕={vi}i=1d𝐕subscriptsuperscriptsubscript𝑣𝑖𝑑𝑖1\mathbf{V}=\{v_{i}\}^{d}_{i=1}bold_V = { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT and 𝐙=[𝐔,𝐕]2×d×k𝐙𝐔𝐕superscript2𝑑𝑘\mathbf{Z}=[\mathbf{U},\mathbf{V}]\in\mathbb{R}^{2\times d\times k}bold_Z = [ bold_U , bold_V ] ∈ blackboard_R start_POSTSUPERSCRIPT 2 × italic_d × italic_k end_POSTSUPERSCRIPT form the latent variables for which posterior inference needs to be performed. Such a posterior can then be used to perform Bayesian model averaging over corresponding posterior over graph structures they induce.

DiBS uses a variational inference framework and learns the posterior over the latent variables 𝐙𝐙\mathbf{Z}bold_Z using SVGD [Liu and Wang, 2016]. SVGD uses a set of particles for each embedding uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, which form an empirical approximation of the posterior. These particles are then updated based on the gradient from Evidence Lower Bound (ELBO) of the corresonding variational inference problem, and a term which enforces diversity of the particles using kernels. The prior over the latent variable 𝐙𝐙\mathbf{Z}bold_Z is given by a Gibbs distribution with temperature β𝛽\betaitalic_β which enforces soft-acyclicty constraint:

p(𝐙)exp(β𝔼p(𝐆|𝐙)[h(𝐆)])ij𝒩(zij;0,σz2)proportional-to𝑝𝐙𝛽subscript𝔼𝑝conditional𝐆𝐙delimited-[]𝐆subscriptproduct𝑖𝑗𝒩subscript𝑧𝑖𝑗0superscriptsubscript𝜎𝑧2p(\mathbf{Z})\propto\exp(-\beta\mathbb{E}_{p(\mathbf{G}|\mathbf{Z})}\left[h(% \mathbf{G})\right])\prod_{ij}\mathcal{N}(z_{ij};0,\sigma_{z}^{2})italic_p ( bold_Z ) ∝ roman_exp ( - italic_β blackboard_E start_POSTSUBSCRIPT italic_p ( bold_G | bold_Z ) end_POSTSUBSCRIPT [ italic_h ( bold_G ) ] ) ∏ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ; 0 , italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (15)

Here, hhitalic_h is the DAG constraint function given by NOTEARS [Zheng et al., 2018].

Appendix D Details about Intervention Targetting Methods

In this section we briefly introduce other intervention acquisition methods used for comaprison in this work.

Active Intervention Targeting (AIT)

Assume that the structural graph distribution maintained by the causal discovery algorithm can be described by some parameters ρ𝜌\rhoitalic_ρ. Consider a set of graphs 𝒢={𝒢j}𝒢subscript𝒢𝑗\mathcal{G}=\{\mathcal{G}_{j}\}caligraphic_G = { caligraphic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } sampled from this distribution. AIT assigns to each possible intervention target iV𝑖𝑉i\in Vitalic_i ∈ italic_V a discrepancy score that is computed by measuring the variance between the graphs (VBG𝑉𝐵𝐺VBGitalic_V italic_B italic_G) and variance within the graphs (VWG𝑉𝑊𝐺VWGitalic_V italic_W italic_G). The VBGi𝑉𝐵subscript𝐺𝑖VBG_{i}italic_V italic_B italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for intervention i𝑖iitalic_i is defined as:

VBGi=jμj,iμi¯,μj,iμi¯,𝑉𝐵subscript𝐺𝑖subscript𝑗subscript𝜇𝑗𝑖¯subscript𝜇𝑖subscript𝜇𝑗𝑖¯subscript𝜇𝑖VBG_{i}=\sum_{j}\langle\mu_{j,i}-\bar{\mu_{i}},\mu_{j,i}-\bar{\mu_{i}}\rangle,italic_V italic_B italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟨ italic_μ start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , italic_μ start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ⟩ , (16)

where μj,isubscript𝜇𝑗𝑖\mu_{j,i}italic_μ start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT is the mean of all samples drawn from graph 𝒢jsubscript𝒢𝑗\mathcal{G}_{j}caligraphic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT under the intervention on target i𝑖iitalic_i, and μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the mean of all samples drawn from graphs under intervention on target i𝑖iitalic_i. The variance within graphs is described by:

VWGi=jk[Sj,i]kμj,i,[Sj,i]kμj,i,𝑉𝑊subscript𝐺𝑖subscript𝑗subscript𝑘subscriptdelimited-[]subscript𝑆𝑗𝑖𝑘subscript𝜇𝑗𝑖subscriptdelimited-[]subscript𝑆𝑗𝑖𝑘subscript𝜇𝑗𝑖VWG_{i}=\sum_{j}\sum_{k}\langle[S_{j,i}]_{k}-\mu_{j,i},[S_{j,i}]_{k}-\mu_{j,i}\rangle,italic_V italic_W italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟨ [ italic_S start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT , [ italic_S start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ⟩ , (17)

where [Sj,i]ksubscriptdelimited-[]subscript𝑆𝑗𝑖𝑘[S_{j,i}]_{k}[ italic_S start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the k𝑘kitalic_k-th sample from graph 𝒢jsubscript𝒢𝑗\mathcal{G}_{j}caligraphic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT under the intervention on target i𝑖iitalic_i. The AIT score is then defined as the ratio Di=VBGiVWGisubscript𝐷𝑖𝑉𝐵subscript𝐺𝑖𝑉𝑊subscript𝐺𝑖D_{i}=\frac{VBG_{i}}{VWG_{i}}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_V italic_B italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_V italic_W italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG. The method selects then the intervention attaining the highest score Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

CBED Targeting

Bayesian Optimal Experimental Design for Causal Discovery (BOECD) selects the intervention with the highest information gain obtained about the graph belief after observing the interventional data. Let the tuple (j,v)𝑗𝑣(j,v)( italic_j , italic_v ) define the intervention, where jV𝑗𝑉j\in Vitalic_j ∈ italic_V describes the intervention target, and v𝑣vitalic_v represents the change in the conditional distribution of variable Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Specifically, this means that the new conditional distribution of Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a distribution with point mass concentrated on v𝑣vitalic_v. Moreover, let Y(j,v)subscript𝑌𝑗𝑣Y_{(j,v)}italic_Y start_POSTSUBSCRIPT ( italic_j , italic_v ) end_POSTSUBSCRIPT denote the interventional distribution under the intervention (j,v)𝑗𝑣(j,v)( italic_j , italic_v ), and let ψ𝜓\psiitalic_ψ denote the current belief about the graph structure (i.e. the random variable corresponding to the structural and distributional parameters ψ=(ρ,ϕ)𝜓𝜌italic-ϕ\psi=(\rho,\phi)italic_ψ = ( italic_ρ , italic_ϕ )). BOECD selects the intervention that maximizes [Tigas et al., 2022]:

(j,v)=argmax(j,v)I(Y(j,v);ψ|𝒟),superscript𝑗superscript𝑣subscriptargmax𝑗𝑣𝐼subscript𝑌𝑗𝑣conditional𝜓𝒟(j^{*},v^{*})=\operatorname*{arg\,max}_{(j,v)}I(Y_{(j,v)};\psi\ |\ \mathcal{D}),( italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT ( italic_j , italic_v ) end_POSTSUBSCRIPT italic_I ( italic_Y start_POSTSUBSCRIPT ( italic_j , italic_v ) end_POSTSUBSCRIPT ; italic_ψ | caligraphic_D ) , (18)

where 𝒟𝒟\mathcal{D}caligraphic_D are the observational data. The above formulation necessities the use of an MI estimator. One possible choice is a BALD-inspired estimator Tigas et al. [2022], Houlsby et al. [2011]:

I(Y(j,v);ψ|𝒟)=H(Y(j,v)|𝒟)H(Y(j,v);ϕ|𝒟),𝐼subscript𝑌𝑗𝑣conditional𝜓𝒟𝐻conditionalsubscript𝑌𝑗𝑣𝒟𝐻subscript𝑌𝑗𝑣conditionalitalic-ϕ𝒟I(Y_{(j,v)};\psi\ |\ \mathcal{D})=H(Y_{(j,v)}\ |\ \mathcal{D})-H(Y_{(j,v)};% \phi\ |\ \mathcal{D}),italic_I ( italic_Y start_POSTSUBSCRIPT ( italic_j , italic_v ) end_POSTSUBSCRIPT ; italic_ψ | caligraphic_D ) = italic_H ( italic_Y start_POSTSUBSCRIPT ( italic_j , italic_v ) end_POSTSUBSCRIPT | caligraphic_D ) - italic_H ( italic_Y start_POSTSUBSCRIPT ( italic_j , italic_v ) end_POSTSUBSCRIPT ; italic_ϕ | caligraphic_D ) , (19)

with H(;)𝐻H(\cdot;\cdot)italic_H ( ⋅ ; ⋅ ) denoting the cross-entropy. Note that this approach allows to select not only most informative target, but also the value of the intervention.

Appendix E Additional Experimental Details

E.1 Synthetic Graphs Details

The synthetic graph structure is deterministic and is specified by the name of graph (chain, collider, jungle, fulldag), except for random, where the structure is sampled. Following ENCO [Lippe et al., 2022], we set the only parameter of sampling procedure, edge_prob, to 0.30.30.30.3.

The ground truth conditional distributions of the causal graphs are modeled by randomly initialized MLPs. Additionally, a randomly initialized embedding layer is applied at the input to each MLP that converts categorical values to real vectors. We used the code provided by Lippe et al. [2022]. For more detailed explanation, refer to Lippe et al. [2022, Appendix C.1.1].

E.2 ENCO Hyperparameters

For experiments on ENCO framework we used exactly the same parameters as reported by Lippe et al. [2022, Appendix C.1.1]. We provide them in Table 3 for the completeness of our report.

Table 3: Hyperparameters used for the ENCO framework.
parameter value
Sparsity regularizer λsparsesubscript𝜆𝑠𝑝𝑎𝑟𝑠𝑒\lambda_{sparse}italic_λ start_POSTSUBSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT 4×1034E-34\text{\times}{10}^{-3}start_ARG 4 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG
Distribution model 2 layers, hidden size 64, LeakyReLU(α=0.1𝛼0.1\alpha=0.1italic_α = 0.1)
Batch size 128
Learning rate - model 5×1035E-35\text{\times}{10}^{-3}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG
Weight decay - model 1×1041E-41\text{\times}{10}^{-4}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG
Distribution fitting iterations F 1000100010001000
Graph fitting iterations G 100100$1$001 00
Graph samples K 100
Epochs 30
Learning rate - γ𝛾\gammaitalic_γ 2×1022E-22\text{\times}{10}^{-2}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG
Learning rate - θ𝜃\thetaitalic_θ 1×1011E-11\text{\times}{10}^{-1}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 1 end_ARG end_ARG

E.3 DiBS Hyperparameters

In Table 4, we present hyperparameters used for the DiBS framework.

Table 4: Hyperparameters used for the DiBS framework.
parameter value
Number of particles 20202020
Number of particle updates 20 0002000020\,00020 000
Choice of Kernel k([𝐙,Θ],[𝐙,Θ])=σ𝐙exp(1h𝐙𝐙𝐙F2)+σΘexp(1hΘΘΘF2)𝑘𝐙Θsuperscript𝐙superscriptΘsubscript𝜎𝐙1subscript𝐙subscriptsuperscriptnorm𝐙superscript𝐙2𝐹subscript𝜎Θ1subscriptΘsubscriptsuperscriptnormΘsuperscriptΘ2𝐹k([\mathbf{Z},\Theta],[\mathbf{Z}^{\prime},\Theta^{\prime}])=\sigma_{\mathbf{Z% }}\exp(-\frac{1}{h_{\mathbf{Z}}}||\mathbf{Z}-\mathbf{Z}^{\prime}||^{2}_{F})+% \sigma_{\Theta}\exp(-\frac{1}{h_{\Theta}}||\Theta-\Theta^{\prime}||^{2}_{F})italic_k ( [ bold_Z , roman_Θ ] , [ bold_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_Θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ) = italic_σ start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT roman_exp ( - divide start_ARG 1 end_ARG start_ARG italic_h start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT end_ARG | | bold_Z - bold_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT roman_exp ( - divide start_ARG 1 end_ARG start_ARG italic_h start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT end_ARG | | roman_Θ - roman_Θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT )
h𝐙subscript𝐙h_{\mathbf{Z}}italic_h start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT 5555
hΘsubscriptΘh_{\Theta}italic_h start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT 500500500500
σ𝐙subscript𝜎𝐙\sigma_{\mathbf{Z}}italic_σ start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT 1111
σΘsubscript𝜎Θ\sigma_{\Theta}italic_σ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT 1111
Optimizer RMSProp
Learning rate Optimizer 0.0050.0050.0050.005

E.4 Computational Cost

We used two hardware settings, one with GPU: a single Nvidia A100, and another one with CPUs: 12 cores of Intel Xeon E5-2697 processor. In our synthetic graph experiments with ENCO on GPU, a single experiment takes on average 4 h to run, with 57 min being used by GIT to make its decisions; the rest is devoted to the underlying causal discovery algorithm (in this case, ENCO). In the case of the CPU setup, an experiment takes on average 126 h, with the GIT part taking up only 6 h. We estimate the project’s overall cost to be around 50K GPUh and 2M CPUh.

Appendix F Additional Experimental Results

F.1 Experiments in DiBS Framework

Experimental setup

The experimental setup closely follows the one from Tigas et al. [2022]. In the experiments, 10 batches of 10 data-points each are acquired. Each batch can contain various intervention targets. The acquisition method chooses intervention targets and values. For some of the methods, the GP-UCB strategy is used to select a value for a given intervention; see Tigas et al. [2022] for details. For every method, we run 40 random seeds. We compare the following methods:

  • Soft GIT (ours): gradient magnitudes corresponding to different interventions are normalized by the maximum one, then passed to the softmax function (with temperature 1111). Obtained scores are used as probabilities to sample a given intervention in the current batch. GP-UCB is used for value selection.

  • Random (fixed values): Intervention targets are chosen uniformly randomly. The intervention value is fixed at 00.

  • Random (uniform values): Intervention targets are chosen uniformly randomly. The intervention value is chosen uniformly randomly from the variable support.

  • Soft AIT: Intervention targets are chosen from the softmax probabilities of AIT scores [Scherrer et al., 2021], with the temperature 2222. GP-UCB is used for value selection.

  • Soft CBED: Intervention targets are chosen from the softmax probabilities of CBED scores [Tigas et al., 2022], with the temperature 0.20.20.20.2. GP-UCB is used for value selection.

The results are presented in Figure 6. We can see the performance of Soft GIT is comparable to that of Random (uniform values) in both considered graph classes. Soft AIT and Soft CBED behave similarly for Erdos-Renyi graphs, while for Scale-Free they seem to bring a small improvement.

Refer to caption
Figure 6: Expected SHD metric for different acquisition methods on top of the DiBS framework, for graphs with 50 nodes and two different graph classes: Erdos-Renyi and Scale-Free. 95% bootstrap confidence intervals are shown.

F.2 Performance in ENCO Framework - All Results

F.2.1 Ranking Statistics

We present ranking statistics in Tables 567.

Table 5: We count the number of training setups (24), where a given method was best or at least comparable to other methods (AIT, CBED, and Random; GIT-privileged was not compared against), basing on 90% confidence intervals for AUSHD. Each entry shows the total count, broken down into two data regimes, N=1056𝑁1056N=1056italic_N = 1056 and N=3200𝑁3200N=3200italic_N = 3200 resp., presented in the parenthesis.
AIT CBED Random GIT (ours) GIT-privileged
Best 0 (0 + 0) 0 (0 + 0) 2 (0 + 2) 8 (4 + 4) 5 (1 + 4)
Best or comparable 6 (2 + 4) 6 (4 + 2) 12 (5 + 7) 18 (11 + 7) 24 (12 + 12)
Table 6: We count the number of training setups (24), where a given method was best or at least comparable to other methods (AIT, CBED, and Random; GIT-privileged was not compared against), basing on 90% confidence intervals for SHD. Each entry shows the total count, broken down into two data regimes, N=1056𝑁1056N=1056italic_N = 1056 and N=3200𝑁3200N=3200italic_N = 3200 resp., presented in the parenthesis.
AIT CBED Random GIT (ours) priv. GIT
Best 1 (0 + 1) 1 (0 + 1) 2 (1 + 1) 1 (1 + 0) 3 (1 + 2)
Best or comparable 10 (4 + 6) 7 (4 + 3) 22 (12 + 10) 17 (10 + 7) 24 (12 + 12)
Table 7: For each method we show its pairwise performance against other methods (whether it is better, comparable, or worse) based on 90% confidence intervals for AUSHD, across two data regimes (N=1056𝑁1056N=1056italic_N = 1056 and N=3200𝑁3200N=3200italic_N = 3200) and all twelve graphs (hence for each method there are 2×12×4=962124962\times 12\times 4=962 × 12 × 4 = 96 pairs to consider). Each entry shows the total count, broken down into two data regimes, N=1056𝑁1056N=1056italic_N = 1056 and N=3200𝑁3200N=3200italic_N = 3200 resp., presented in the parenthesis.
Better Comparable Worse
AIT 9 (3+6) 27 (11+16) 60 (34+26)
CBED 9 (7+2) 35 (20+15) 52 (21+31)
Random 34 (13+21) 36 (21+15) 26 (14+12)
GIT (ours) 45 (24+21) 35 (21+14) 16 (3+13)
GIT-privileged 57 (25+32) 39 (23+16) 0 (0+0)
Table 8: AUSHD with 90% confidence intervals (in the parenthesis), for synthetic data and for low and regular data regimes (N=1056𝑁1056N=1056italic_N = 1056 and N=3200𝑁3200N=3200italic_N = 3200 resp.).
AIT BALD Random GIT (ours) priv. GIT
bidiag 1056 24.7 (24.1, 25.5) 21.9 (21.1, 22.8) 22.0 (21.5, 22.7) 20.0 (19.5, 20.6) 19.9 (18.6, 20.9)
3200 14.0 (13.0, 15.4) 13.2 (12.5, 14.0) 11.1 (10.5, 12.1) 9.4 (9.0, 9.9) 9.3 (8.0, 10.3)
chain 1056 14.9 (14.4, 15.4) 12.2 (11.8, 12.7) 13.5 (13.1, 13.9) 11.7 (11.3, 12.1) 12.2 (11.4, 13.3)
3200 7.7 (7.3, 8.1) 7.2 (6.8, 7.7) 6.3 (6.0, 6.6) 5.6 (5.2, 6.0) 6.3 (5.2, 8.5)
collider 1056 16.0 (15.2, 16.7) 16.1 (15.5, 16.7) 14.6 (14.1, 15.1) 14.4 (13.4, 15.2) 11.8 (10.9, 13.0)
3200 10.9 (10.2, 11.7) 12.2 (11.6, 12.7) 9.7 (9.2, 10.3) 12.1 (10.9, 13.1) 7.8 (6.9, 8.8)
fulldag 1056 133.0 (131.2, 134.7) 141.6 (139.1, 144.2) 121.7 (120.4, 122.9) 119.8 (118.7, 120.8) 120.7 (119.1, 122.1)
3200 72.8 (71.0, 74.5) 100.6 (97.8, 103.8) 63.4 (62.0, 64.7) 67.9 (66.0, 70.3) 63.4 (61.2, 64.9)
jungle 1056 23.2 (21.9, 24.6) 20.6 (19.6, 21.7) 20.9 (20.1, 21.7) 14.7 (14.1, 15.4) 13.9 (12.4, 15.5)
3200 11.2 (10.7, 11.9) 13.3 (12.3, 14.3) 9.1 (8.8, 9.5) 6.9 (6.5, 7.2) 6.9 (5.5, 8.3)
random 1056 42.1 (40.5, 43.6) 43.1 (41.5, 44.9) 35.6 (34.6, 36.7) 34.6 (33.7, 35.7) 31.9 (30.4, 34.6)
3200 21.3 (20.4, 22.3) 30.7 (29.0, 32.5) 16.5 (15.8, 17.3) 17.0 (16.3, 17.7) 14.5 (13.6, 15.6)

F.2.2 AUSHD Tables

We present all AUSHD results with confidence intervals in Tables 89.

Table 9: AUSHD with 90% confidence intervals (in the parenthesis), for real-world data and for low and regular data regimes (N=1056𝑁1056N=1056italic_N = 1056 and N=3200𝑁3200N=3200italic_N = 3200 resp.).
AIT CBED Random GIT (ours) priv. GIT
alarm 1056 42.8 (41.8, 43.8) 36.8 (35.8, 37.8) 39.7 (38.6, 40.8) 28.8 (28.3, 29.3) 28.5 (27.0, 29.6)
3200 35.0 (33.6, 36.4) 31.6 (30.3, 33.1) 28.8 (27.6, 30.8) 24.0 (23.4, 24.9) 21.5 (20.7, 23.1)
asia 1056 3.6 (2.9, 4.5) 3.5 (2.8, 4.3) 2.0 (1.8, 2.1) 2.2 (2.0, 2.5) 1.8 (1.7, 1.9)
3200 2.4 (1.9, 3.3) 2.1 (1.9, 2.5) 1.3 (1.2, 1.4) 1.5 (1.4, 1.6) 1.1 (1.0, 1.2)
cancer 1056 2.0 (1.9, 2.1) 2.1 (2.0, 2.3) 2.4 (2.2, 2.6) 2.4 (2.2, 2.5) 2.1 (1.6, 2.6)
3200 1.8 (1.6, 2.0) 2.1 (1.9, 2.2) 2.2 (2.0, 2.3) 2.2 (2.0, 2.4) 2.2 (1.7, 2.6)
child 1056 14.4 (13.7, 15.2) 10.4 (9.6, 11.2) 11.1 (10.7, 11.6) 8.3 (8.0, 8.7) 7.9 (7.0, 9.0)
3200 7.8 (7.1, 8.6) 7.1 (6.5, 8.0) 5.0 (4.7, 5.5) 4.5 (4.2, 4.8) 3.9 (3.2, 4.7)
earthquake 1056 0.5 (0.4, 0.6) 0.5 (0.4, 0.6) 0.4 (0.3, 0.5) 0.6 (0.5, 0.7) 0.4 (0.2, 0.6)
3200 0.2 (0.1, 0.3) 0.2 (0.1, 0.2) 0.1 (0.1, 0.2) 0.3 (0.2, 0.5) 0.1 (0.1, 0.2)
sachs 1056 3.1 (2.9, 3.3) 2.9 (2.6, 3.1) 2.9 (2.7, 3.1) 2.5 (2.4, 2.7) 2.5 (2.2, 2.8)
3200 1.4 (1.3, 1.6) 1.9 (1.7, 2.2) 1.2 (1.1, 1.3) 1.1 (1.0, 1.3) 0.9 (0.8, 1.0)

F.2.3 SHD Tables

We present SHD results for small and large data regime with confidence intervals in Tables 1011.

Table 10: SHD with 90% confidence intervals (in the parenthesis), for synthetic data and for low and regular data regimes (N=1056𝑁1056N=1056italic_N = 1056 and N=3200𝑁3200N=3200italic_N = 3200 resp.).
AIT CBED Random GIT (ours) GIT-priv.
bidiag 1056 11.4 (10.3, 12.4) 10.1 (9.2, 11.0) 7.8 (7.0, 8.5) 6.3 (5.7, 7.0) 7.4 (6.2, 8.6)
3200 5.2 (4.2, 6.3) 7.8 (6.9, 8.7) 2.8 (2.3, 3.4) 2.4 (1.8, 2.9) 2.2 (0.8, 3.6)
chain 1056 5.6 (4.8, 6.4) 5.4 (4.6, 6.1) 4.3 (3.8, 4.9) 3.6 (3.0, 4.2) 3.6 (2.0, 4.8)
3200 3.2 (2.6, 3.7) 3.9 (3.4, 4.3) 2.2 (1.7, 2.6) 1.8 (1.3, 2.3) 1.8 (0.2, 2.6)
collider 1056 11.0 (10.1, 11.9) 11.8 (11.0, 12.7) 9.8 (9.1, 10.6) 13.3 (12.2, 14.4) 9.8 (7.6, 12.0)
3200 4.8 (3.8, 5.9) 7.9 (6.8, 8.9) 3.7 (2.8, 4.6) 9.7 (7.7, 11.6) 3.4 (1.4, 5.0)
fulldag 1056 64.4 (61.8, 67.0) 91.4 (86.8, 96.0) 52.1 (50.0, 54.3) 55.8 (53.4, 58.0) 53.4 (49.8, 57.0)
3200 32.0 (30.0, 33.8) 75.4 (71.8, 79.0) 25.1 (22.8, 27.2) 27.3 (25.1, 29.8) 20.8 (19.6, 21.8)
jungle 1056 10.4 (9.2, 11.6) 11.6 (10.1, 13.2) 5.7 (5.0, 6.5) 5.1 (4.4, 5.8) 5.2 (3.0, 7.4)
3200 3.5 (3.1, 3.9) 8.3 (7.2, 9.4) 1.9 (1.5, 2.3) 2.2 (1.8, 2.7) 3.0 (2.0, 4.0)
random 1056 18.8 (17.3, 20.3) 27.5 (25.6, 29.5) 11.3 (10.0, 12.5) 12.5 (11.3, 13.5) 11.0 (9.2, 13.0)
3200 8.3 (7.0, 9.4) 22.1 (19.6, 24.4) 5.0 (4.3, 5.8) 5.3 (4.4, 6.1) 3.8 (2.2, 5.4)
Table 11: SHD with 90% confidence intervals (in the parenthesis), for real-world data and for low and regular data regimes (N=1056𝑁1056N=1056italic_N = 1056 and N=3200𝑁3200N=3200italic_N = 3200 resp.).
AIT CBED Random GIT (ours) priv. GIT
alarm 1056 35.76 (34.04, 37.52) 28.44 (26.68, 30.16) 26.0 (24.71, 27.29) 19.84 (19.0, 20.68) 25.0 (23.2, 27.0)
3200 26.15 (24.15, 28.23) 24.33 (21.67, 27.0) 16.0 (14.57, 17.14) 20.0 (18.67, 21.33) 15.2 (14.6, 15.8)
asia 1056 2.0 (1.2, 2.68) 1.96 (1.44, 2.4) 0.96 (0.8, 1.12) 1.2 (1.0, 1.36) 1.2 (0.8, 1.4)
3200 1.56 (1.12, 1.92) 1.28 (1.0, 1.48) 0.88 (0.79, 1.0) 1.12 (0.96, 1.24) 0.8 (0.6, 1.2)
cancer 1056 1.72 (1.48, 2.0) 2.2 (2.0, 2.4) 2.28 (2.04, 2.48) 2.12 (1.84, 2.4) 2.2 (1.8, 2.4)
3200 1.8 (1.6, 2.0) 1.96 (1.72, 2.2) 1.84 (1.6, 2.12) 2.0 (1.76, 2.24) 2.4 (2.0, 2.8)
child 1056 7.32 (5.92, 8.68) 6.36 (5.52, 7.16) 3.52 (2.84, 4.2) 3.72 (3.2, 4.24) 2.8 (1.4, 4.0)
3200 3.2 (2.56, 3.8) 4.68 (3.8, 5.48) 1.04 (0.7, 1.35) 2.16 (1.8, 2.52) 1.8 (0.4, 3.0)
earthquake 1056 0.12 (0.0, 0.2) 0.12 (0.0, 0.2) 0.0 (0.0, 0.0) 0.24 (0.08, 0.36) 0.0 (0.0, 0.0)
3200 0.04 (-0.04, 0.08) 0.0 (0.0, 0.0) 0.0 (0.0, 0.0) 0.2 (0.08, 0.32) 0.0 (0.0, 0.0)
sachs 1056 0.84 (0.68, 1.0) 1.28 (0.96, 1.6) 0.6 (0.4, 0.8) 0.52 (0.32, 0.72) 0.4 (0.0, 0.8)
3200 0.48 (0.32, 0.64) 1.48 (1.16, 1.76) 0.24 (0.08, 0.36) 0.48 (0.28, 0.68) 0.0 (0.0, 0.0)

F.2.4 ENCO - Training Curves

We provide SHD training curves for main experiments in Figures 7 and 8.

Refer to caption
Figure 7: Expected SHD metric for different acquisition methods on top of the ENCO framework, for synthetic graphs with 25 nodes. 95% bootstrap confidence intervals are shown.
Refer to caption
Figure 8: Expected SHD metric for different acquisition methods on top of the ENCO framework, for graphs from BnLearn dataset. 95% bootstrap confidence intervals are shown.

F.3 ENCO - large interventional batch experiment

We provide SHD training curves for experiments with the large interventional batch in Figure 9.

Refer to caption
Figure 9: Expected SHD metric for GIT with an interventional batch size of 1024 samples. 95% bootstrap confidence intervals are shown, and results were computed using 3 random seeds.

F.4 ENCO - monte carlo sampling evaluation

We provide a performance evaluation of GIT with different amount of graphs sampled from the model in Figure 10.

Refer to caption
Figure 10: Expected SHD metric for GIT with different numbers of graphs samples used to estimate score for interventions (see line 4 in Algorithm 2). 95% bootstrap confidence intervals are shown, results were computed using 10 random seeds.

F.5 ENCO - Large Synthetic Graphs

In order to study the scalability of our method, we perform an additional evulation on selected synthetic graphs in which we increase the number of nodes to 100100100100. We comapre the performance of different acquistion methods used with ENCO in Figure 11. We observe that GIT exhibits very good results, converging to lower SHD values with significantly less acquistion steps compared to all the other methods. This confirms the superiority of GIT, even within a larger graph regime.

Refer to caption
Figure 11: Expected SHD metric for different acquisition methods on top of the ENCO framework, for synthetic graphs with 100100100100 nodes. 95% bootstrap confidence intervals are shown.

F.6 ENCO - Correlation Scores

In Figure 12, we present the correlation of scores of the tested targeting methods. Importantly, the high correlation of GIT and GIT-privileged supports the hypothesis that imaginary gradients are a credible proxy of the true gradients and thus validates GIT. Otherwise, correlations are relatively small, suggesting that the studied methods use different decision mechanisms. Understanding this phenomenon is an interesting future research direction.

Refer to caption
Figure 12: Spearman’s rank correlation of the scores produced by different acquisition methods, averaged over nodes.
Refer to caption
Figure 13: Pearson correlation of the scores produced by different acquisition methods, averaged over nodes. We can see similar trends as in the case of Spearman’s rank correlation, in particular, a high correlation of GIT and GIT-privileged.
Refer to caption
Figure 14: The histograms of chosen interventional targets in all data acquisition steps for different strategies computed on the real-world data.

Below, we provide more details about computing the correlations. Let us denote by sb,imsuperscriptsubscript𝑠𝑏𝑖𝑚s_{b,i}^{m}italic_s start_POSTSUBSCRIPT italic_b , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT the score produced by method m𝑚mitalic_m for the batch b𝑏bitalic_b and the node i𝑖iitalic_i. In order to eliminate effects such as changing scores scales during the discovery process, we normalize the scores as s¯b,im:=sb,imj=1Nsb,jmassignsuperscriptsubscript¯𝑠𝑏𝑖𝑚superscriptsubscript𝑠𝑏𝑖𝑚superscriptsubscript𝑗1𝑁superscriptsubscript𝑠𝑏𝑗𝑚\bar{s}_{b,i}^{m}:=\frac{s_{b,i}^{m}}{\sum_{j=1}^{N}s_{b,j}^{m}}over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_b , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT := divide start_ARG italic_s start_POSTSUBSCRIPT italic_b , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_b , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG. For every pair of methods m𝑚mitalic_m, msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and node i𝑖iitalic_i, we compute Spearman’s rank correlation score rs(s¯,im,s¯,im)subscript𝑟𝑠superscriptsubscript¯𝑠𝑖𝑚superscriptsubscript¯𝑠𝑖superscript𝑚r_{s}(\bar{s}_{\cdot,i}^{m},\bar{s}_{\cdot,i}^{m^{\prime}})italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT ⋅ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT ⋅ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ). We average over the nodes to get the scalar correlation value corr(m,m):=j=1Nrs(s¯,im,s¯,im)Nassigncorr𝑚superscript𝑚superscriptsubscript𝑗1𝑁subscript𝑟𝑠superscriptsubscript¯𝑠𝑖𝑚superscriptsubscript¯𝑠𝑖superscript𝑚𝑁\text{corr}(m,m^{\prime}):=\frac{\sum_{j=1}^{N}r_{s}(\bar{s}_{\cdot,i}^{m},% \bar{s}_{\cdot,i}^{m^{\prime}})}{N}corr ( italic_m , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT ⋅ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT ⋅ , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_N end_ARG.

In addition, we present Pearson’s correlations in Figure 13. Conclusions from the analysis of the Spearman’s rank correlation hold; in particular, the correlation between GIT and GIT-privileged is high.

F.7 ENCO - Intervention Targets Distribution

Refer to caption
Figure 15: The interventional target distribution for the alarm graph. The green color represents the edges for which there exists a graph in the Markov Equivalence Class that has the corresponding connection reversed. Black color is used to indicate node for each no data is collected. We may observe that each method intervenes on at least one node incident to the critical edges. However, both AIT and CBED do not converge for this dataset and struggle to achieve good results.

In this section, we provide additional histograms and plots with regard to the interventional target distributions obtained by different intervention methods as discussed in Section 5.4 in the main text.

In Figure 14, we present the histograms of the target distributions for the real graphs for each of the intervention acquisition methods. Note that those histograms represent the same information as the node coloring in Figure 4. It may be observed that the distributions obtained by GIT concentrate on fewer nodes than those obtained by the AIT and CBED approaches. The only exceptions being the sachs and child datasets, for which the entropy of CBED approach is smaller (recall Figure 4). Note, however, that CBED underperforms on those graphs (recall Figure 3 in the main text or see Figure 8). This is in contrast to GIT, which maintains good performance.

Finally, in Figure 15, we present the interventional target distribution on the alarm graph. We observe that each method intervenes on at least one node incident to the critical edges in the Markov Equivalence Class (as indicated by the green color in the plot). However, both AIT and CBED struggle to achieve convergence and suffer low performance, as can be observed in Figure 8.

F.7.1 ENCO - Obtained Synthetic Graphs

Refer to caption
Figure 16: The interventional target distributions obtained by different strategies on synthetic data. The probability is represented by the intensity of the node’s color. For clarity of the presentation, we choose not to color the critical edges in the corresponding Markov Equivalence Classes. This is because all edges of all the presented graphs would need to be colored. The only exception is the collider graph, which is alone in its Markov Equivalence Class.

In addition, we present the results obtained for the synthetic graphs in Figure 16 and the corresponding histograms in Figure 17. Note that in this case the results are also averaged by different ground truth distributions, which means that any regularities in selecting the nodes come rather from the graph structure than from data distribution.

Interestingly, we may observe that for the jungle and chain graphs GIT often intervenes on the nodes which are the first ones in the topological order (as indicated by low node numbers in the plots). This is again intuitive, as intervening on those nodes can impact more variables lower in the hierarchy. In addition, note that for the chain graph, knowing its Markov Equivalence Class and setting the directionality of an edge automatically makes it possible to determine the directionality of edges for all subsequent nodes in the topological order. Hence intervening on the nodes which are the first ones in the ordering may convey more information and is desired.

We may also observe that the CBED seems to focus only on the first nodes in the topological order, despite the data distribution, which in some graphs (as the chain graph) may be desired, but in others seems to be an oversimplified solution. Note that CBED often struggles to converge – this may be observed in Figure 7.

Refer to caption
Figure 17: The histograms of chosen interventional targets in all data acquisition steps for different strategies computed on the synthetic data.

F.7.2 Discussion on Small Real-World Graphs

We provide a more detailed discussion on the differences between the earthquake and cancer graph distributions and the way it affects the GIT method.

Consider Figure 4 in the main text. Note that the middle node in the earthquake graph corresponds to setting off a burglary alarm, an event very unlikely to happen in observational data but which, when occurs, triggers a change in the distributions of the nodes lower in the hierarchy (see the conditional distributions in Table 12). The chance of starting an alarm is also very high in case a burglary has happened (the left-most node in the graph). Hence the GIT concentrates on those two nodes as they have the largest impact on the entailed distribution.

A similar situation can be observed for the cancer graph, where the middle node corresponds to a binary variable indicating the probability of develo** the illness. Even though the two parents of the cancer variable (pollution and smoke, represented by nodes 0 and 1, respectively) share a causal relationship with cancer, their impact on the cancer variable is limited. In other words, the chances of develo** cancer, no matter whether being subject to high or low pollution or being a smoker or not, remain rather small (see the conditional distributions for cancer variable in Table 13). Hence, the only way in which one can gather more information about the impact of having cancer on the distributions of its child variables (nodes 3 and 4) is by performing an intervention. In consequence, it may be observed that GIT prefers to select nodes that allow to gather data that otherwise would be hard to acquire in the purely observational setting.

Table 12: The conditional distribution in the earthquake graph.
Variable Parents Values Distribution
Burglary [True, False] [0.01,0.99]0.010.99[0.01,0.99][ 0.01 , 0.99 ]
Earthquake [True, False] [0.02,0.99]0.020.99[0.02,0.99][ 0.02 , 0.99 ]
Alarm Burglar=True, Earthquake=True [True, False] [0.95,0.05]0.950.05[0.95,0.05][ 0.95 , 0.05 ]
Alarm Burglar=False, Earthquake=True [True, False] [0.29,0.71]0.290.71[0.29,0.71][ 0.29 , 0.71 ]
Alarm Burglar=True, Earthquake=False [True, False] [0.94,0.06]0.940.06[0.94,0.06][ 0.94 , 0.06 ]
Alarm Burglar=False, Earthquake=False [True, False] [0.001,0.999]0.0010.999[0.001,0.999][ 0.001 , 0.999 ]
John Calls Alarm=True [True, False] [0.9,0.1]0.90.1[0.9,0.1][ 0.9 , 0.1 ]
John Calls Alarm=False [True, False] [0.05,0.95]0.050.95[0.05,0.95][ 0.05 , 0.95 ]
Mary Calls Alarm=True [True, False] [0.7,0.3]0.70.3[0.7,0.3][ 0.7 , 0.3 ]
Mary Calls Alarm=False [True, False] [0.01,0.99]0.010.99[0.01,0.99][ 0.01 , 0.99 ]
Table 13: The conditional distribution in the cancer graph.
Variable Parents Values Distribution
Pollution [Low, High] [0.9,0.1]0.90.1[0.9,0.1][ 0.9 , 0.1 ]
Smoker [True, False] [0.3,0.7]0.30.7[0.3,0.7][ 0.3 , 0.7 ]
Cancer Pollution=Low, Smoker=True [True, False] [0.03,0.97]0.030.97[0.03,0.97][ 0.03 , 0.97 ]
Cancer Pollution=High, Smoker=True [True, False] [0.05,0.95]0.050.95[0.05,0.95][ 0.05 , 0.95 ]
Cancer Pollution=Low, Smoker=False [True, False] [0.001,0.999]0.0010.999[0.001,0.999][ 0.001 , 0.999 ]
Cancer Pollution=High, Smoker=False [True, False] [0.02,0.98]0.020.98[0.02,0.98][ 0.02 , 0.98 ]
Xray Cancer=True [True, False] [0.9,0.1]0.90.1[0.9,0.1][ 0.9 , 0.1 ]
Xray Cancer=False [True, False] [0.2,0.8]0.20.8[0.2,0.8][ 0.2 , 0.8 ]
Dyspnoea Cancer=True [True, False] [0.65,0.35]0.650.35[0.65,0.35][ 0.65 , 0.35 ]
Dyspnoea Cancer=False [True, False] [0.3,0.7]0.30.7[0.3,0.7][ 0.3 , 0.7 ]

F.8 ENCO - Experiments with Pre-Initialization

Refer to caption
Figure 18: Histograms of intervention targets chosen by GIT. The red color corresponds to the selected node, while the green color indicates the node’s parents. The edges on which standard initialization was used are indicated by gray dashed lines. The rest of the solution is given in the initialization.

In addition to the discussion on the target distributions in the case of pre-initializing parts of the graph with the ground truth solution (presented in the main text for synthetic graphs in Section 5.4), we present results of the same experiment computed on the real-world graphs. The results are presented in Figure 18.

Similarly as for the synthetic graphs, here we also observe that the GIT concentrates either on the selected node v𝑣vitalic_v or on its parents (denoted respectively by red and green colors in the plots).