On Counterfactual Interventions in Vector Autoregressive Models thanks: This work was supported by the National Science Foundation under Awards 2021002 and 2212506.

Kurt Butler, Marija Iloska, and Petar M. Djurić Department of Electrical and Computer Engineering, Stony Brook University, Stony Brook, NY 11794, USA
[email protected]; [email protected]; [email protected]
Abstract

Counterfactual reasoning allows us to explore hypothetical scenarios in order to explain the impacts of our decisions. However, addressing such inquires is impossible without establishing the appropriate mathematical framework. In this work, we introduce the problem of counterfactual reasoning in the context of vector autoregressive (VAR) processes. We also formulate the inference of a causal model as a joint regression task where for inference we use both data with and without interventions. After learning the model, we exploit linearity of the VAR model to make exact predictions about the effects of counterfactual interventions. Furthermore, we quantify the total causal effects of past counterfactual interventions. The source code for this project is freely available at https://github.com/KurtButler/counterfactual_interventions.

Index Terms:
causal models, counterfactuals, interventions, time series, vector autoregressive models

I Introduction

In many situations, we may be interested in “what-if” questions. As decisions in real life are rarely made with perfect knowledge of their effects, it is only natural that later one may question the optimality of their choices and ask how things would be if a different decision had been made. This process of reasoning about decisions and actions that contradict what actually occurred pertains to problems of counterfactual reasoning.

A counterfactual event is an event which does not agree with a particular outcome that was observed in a given experiment [1, p. 29]. For example, a student applying to a college might be rejected due to a low score on an entrance exam. A counterfactual question could then be “would this student have been accepted if their score were higher?” In asking this question, we are imagining a hypothetical scenario in which we modify the score of this student, and keep all other variables (i.e., the scores of all the other applicants) constant. In general, counterfactual questions may help us understand whether our actions are effective.

For the problem of detecting the presence of causal relationships and learning the graphical structure of a causal system, a large variety of approaches have been proposed, often arising from different fundamental principles [2, 3, 4, 5]. Many of these methods attempt to rely on entirely observational data, where the experimenter cannot intervene upon the system under study to probe into its cause and effect relationships. While it is sometimes possible to detect causalities without intervention, observational approaches often require strong assumptions that are difficult to verify empirically or are otherwise restrictive.For instance, identification of a linear causal model from only observations is ill-posed if the noises are Gaussian [6, pp.50-51]. Methods based on predictability, such as Granger causality [7], can deliver false positives when not all relevant variables are included in the model. Other methods may require strong assumptions; for example, convergent cross map** [8] requires the existence of a dynamical attractor, which is difficult to test on small data sets [9]. In comparison, when it is possible to perform interventions, causal inference becomes considerably simpler, with several standard algorithms [3],[10]. Under ideal conditions, when experiments can be carried out perfectly and there are no hidden processes, all causal relationships in system of N𝑁Nitalic_N variables can be detected with N𝑁Nitalic_N single-variable interventions [11].

To answer counterfactual questions, more information than just the direction of causalities is required: one must also have a functional model of how the effect is determined by all of its causes [1]. For the college student example, we would need to know the entrance exam scores of the other applicants to determine the functional model for acceptance and subsequently if the given student would be accepted, as their acceptance would depend on all of the scores.

In this work, we consider the counterfactual prediction problem: Given a particular realization of a causal system, i.e., a multivariate time series and a causal model that explains it, can we predict what we would have observed if a previous, hypothetical intervention occurred?

We present several contributions relevant to studying counterfactual events with time series. We introduce an approach to learning the causal structure of a system by leveraging multiple data sets with different interventions. We apply this approach to the setting of vector autoregressive (VAR) models. After learning the causal model, we address the problem of making counterfactual predictions. Due to linearity of the VAR model, we are able to make exact predictions without needing to estimate unobserved variables. Finally, we validate our approach using several examples.

II Problem Formulation

We consider the problem of modeling causal relationships from a multivariate time series 𝐱t=[x1,t,,xD,t]subscript𝐱𝑡superscriptsubscript𝑥1𝑡subscript𝑥𝐷𝑡top\mathbf{x}_{t}=[x_{1,t},...,x_{D,t}]^{\top}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D , italic_t end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. For each node i𝑖iitalic_i and time instant t𝑡titalic_t, we assume that the value of xi,tsubscript𝑥𝑖𝑡x_{i,t}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is determined by a function of the values of the system at previous time steps, 𝐱t1,,𝐱tQsubscript𝐱𝑡1subscript𝐱𝑡𝑄\mathbf{x}_{t-1},...,\mathbf{x}_{t-Q}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_t - italic_Q end_POSTSUBSCRIPT, up to some order Q𝑄Qitalic_Q, as well as a separate independent noise term wi,tsubscript𝑤𝑖𝑡w_{i,t}italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT. As a simple example, we examine the vector autoregressive (VAR) model:

𝐱t:=q=1Q𝐁q𝐱tq+𝐰t,assignsubscript𝐱𝑡superscriptsubscript𝑞1𝑄subscript𝐁𝑞subscript𝐱𝑡𝑞subscript𝐰𝑡\mathbf{x}_{t}:=\sum_{q=1}^{Q}\mathbf{B}_{q}\mathbf{x}_{t-q}+\mathbf{w}_{t},bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t - italic_q end_POSTSUBSCRIPT + bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (1)

where the 𝐁qsubscript𝐁𝑞\mathbf{B}_{q}bold_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT are matrices of linear coefficients, and 𝐰t𝒩(𝟎,𝐂)similar-tosubscript𝐰𝑡𝒩0𝐂\mathbf{w}_{t}\sim\mathcal{N}(\mathbf{0},\mathbf{C})bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_C ), for some given noise covariance 𝐂𝐂\mathbf{C}bold_C. The notation :=assign:=:= expresses the direction of causality, where we consider (1) as a causal generative process.

According to (1), the components of 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, 𝐱t2subscript𝐱𝑡2\mathbf{x}_{t-2}bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT, and 𝐱tQsubscript𝐱𝑡𝑄\mathbf{x}_{t-Q}bold_x start_POSTSUBSCRIPT italic_t - italic_Q end_POSTSUBSCRIPT exert causal influences on 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. It is traditional in causality to express these relationships in terms of a graph, although there are several options as to how this can be done. In Figure 1, we present one approach that treats each time instance as a separate vertex, with the purpose of understanding the temporal dependencies of the VAR model, which will later simplify the calculation and prediction of counterfactuals.

{\cdots}𝐱t2subscript𝐱𝑡2{{\mathbf{x}_{t-2}}}bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT𝐱t1subscript𝐱𝑡1{{\mathbf{x}_{t-1}}}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT𝐱tsubscript𝐱𝑡{{\mathbf{x}_{t}}}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT{\cdots}𝐁1subscript𝐁1\scriptstyle{\mathbf{B}_{1}}bold_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT𝐁1subscript𝐁1\scriptstyle{\mathbf{B}_{1}}bold_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT𝐁1subscript𝐁1\scriptstyle{\mathbf{B}_{1}}bold_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT𝐁1subscript𝐁1\scriptstyle{\mathbf{B}_{1}}bold_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT𝐁2subscript𝐁2\scriptstyle{\mathbf{B}_{2}}bold_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT𝐁2subscript𝐁2\scriptstyle{\mathbf{B}_{2}}bold_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT𝐁2subscript𝐁2\scriptstyle{\mathbf{B}_{2}}bold_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Figure 1: A causal graph obtained from unfolding the causal relationships in (1) to show their dependencies in time, in the Q=2𝑄2Q=2italic_Q = 2 case. Each vertex represents a random vector, and each edge represents the presence of a cause-effect relationship.

To answer counterfactual queries about a system, we must first have a structural causal model (SCM) of the system [1]. A SCM is a hypothesis about the generative process that created a data set. The SCM represents each variable in the system as a function of its parents, a subset of variables directly influencing a given variable, along with an exogenous noise variable. In this way, the VAR model in (1) can be interpreted as a SCM after we declare that the equation explicitly describes how the value of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is determined.

Asserting that a VAR model is a causal model is a stronger statement than asserting it is a statistical model. A statistical model needs only to generate the correct joint probability distribution over the family of variables, but a causal model is a hypothesis about the generative process as it exists in reality. In the same way that multiple generative processes can represent the same probability distribution, there might be multiple SCMs that are compatible with a given data set. Additionally, different causal models will produce different answers to counterfactual questions, so care should be taken to learn the most accurate causal model before discussing counterfactuals. In the next section, we introduce interventions and how SCMs respond to them.

II-A Interventions

A major feature of causal models is that they are modularized, meaning that we may think of them as a set of individual components which can be changed or modified. In our example, each variable xi,tsubscript𝑥𝑖𝑡x_{i,t}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT receives a value according to the formula

xi,t:=q=1Q𝐛q,i𝐱tq+wi,t,assignsubscript𝑥𝑖𝑡superscriptsubscript𝑞1𝑄superscriptsubscript𝐛𝑞𝑖topsubscript𝐱𝑡𝑞subscript𝑤𝑖𝑡x_{i,t}:=\sum_{q=1}^{Q}\mathbf{b}_{q,i}^{\top}\mathbf{x}_{t-q}+w_{i,t},italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT bold_b start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t - italic_q end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , (2)

where 𝐛q,isuperscriptsubscript𝐛𝑞𝑖top\mathbf{b}_{q,i}^{\top}bold_b start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is the i𝑖iitalic_i-th row of 𝐁qsubscript𝐁𝑞\mathbf{B}_{q}bold_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, and wi,tsubscript𝑤𝑖𝑡w_{i,t}italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is the i𝑖iitalic_i-th element of 𝐰tsubscript𝐰𝑡{\bf w}_{t}bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (1). An intervention at node i𝑖iitalic_i and time t𝑡titalic_t is a modification to the causal model which replaces the function that assigns a value to xi,tsubscript𝑥𝑖𝑡x_{i,t}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT with a new one. For example, we may intervene by replacing the function in (2) with a new one:

xi,t:=q=1Q𝐛~q,i𝐱tq+σ~wi,t,+u~i,t,assignsubscript𝑥𝑖𝑡superscriptsubscript𝑞1𝑄superscriptsubscript~𝐛𝑞𝑖topsubscript𝐱𝑡𝑞~𝜎subscript𝑤𝑖𝑡subscript~𝑢𝑖𝑡x_{i,t}:=\sum_{q=1}^{Q}{\widetilde{\mathbf{b}}}_{q,i}^{\top}\mathbf{x}_{t-q}+% \widetilde{\sigma}w_{i,t},+{\widetilde{u}}_{i,t},italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT over~ start_ARG bold_b end_ARG start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t - italic_q end_POSTSUBSCRIPT + over~ start_ARG italic_σ end_ARG italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , + over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , (3)

where 𝐛~q,isubscript~𝐛𝑞𝑖{\widetilde{\mathbf{b}}}_{q,i}over~ start_ARG bold_b end_ARG start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT are modified linear coefficients, u~i,tsubscript~𝑢𝑖𝑡{\widetilde{u}}_{i,t}over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is a signal injected by the experimenter, and σ~[0,1]~𝜎01\widetilde{\sigma}\in[0,1]over~ start_ARG italic_σ end_ARG ∈ [ 0 , 1 ] is a parameter that may be used to control the level of noise. Under ideal conditions, the experimenter’s intervention perfectly controls the value of xi,tsubscript𝑥𝑖𝑡x_{i,t}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT by tuning the parameters in (3) to achieve a desired effect.

In this paper, we assume that all interventions are able to perfectly control the value of the node intervened upon, so that xi,tsubscript𝑥𝑖𝑡x_{i,t}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT satisfies (3) whenever an intervention on xi,tsubscript𝑥𝑖𝑡x_{i,t}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is being performed. While intervening on a node i𝑖iitalic_i, it is assumed that the functions governing other nodes in the system remain unchanged, which is a form of modularity of the system. In real life, there might be some uncertainty in the values of 𝐛~q,isubscript~𝐛𝑞𝑖{\widetilde{\mathbf{b}}}_{q,i}over~ start_ARG bold_b end_ARG start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT and σi~~subscript𝜎𝑖{\widetilde{\sigma_{i}}}over~ start_ARG italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG when the intervention is performed, due to imperfect experimental conditions. An intervention can be applied for a period of time, during which we can actively observe the behavior of the rest of the system, i.e., the unperturbed nodes. We illustrate this process visually in Figure 2.

Data that we record while performing an intervention are called interventional, contrasting with observational data that we record while not altering the system under study. While performing an experiment with multiple interventions, we allocate recorded data into separate data sets according to the type of intervention being applied (cf. Figure 2). In Section III-A, we introduce a method to leverage multiple data sets to improve estimation of 𝐁qsubscript𝐁𝑞\mathbf{B}_{q}bold_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT over purely observational data. The caveat is that this approach is only possible when obtaining interventional data is feasible.

Refer to caption
Figure 2: Interventions applied to a multivariate time series. The underlying causal model is a VAR model, (1), with Q=2𝑄2Q=2italic_Q = 2 and D=5𝐷5D=5italic_D = 5. Two separate interventions were performed during the experiments in which we drove a node in the system with a sinusoidal stimulus. For t=101,,200𝑡101200t=101,...,200italic_t = 101 , … , 200, x1,tsubscript𝑥1𝑡x_{1,t}italic_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT (black curve) was forced to take on the value 4sin(t/2)4𝑡24\sin(t/2)4 roman_sin ( italic_t / 2 ), and for t=301,,400𝑡301400t=301,...,400italic_t = 301 , … , 400, x2,tsubscript𝑥2𝑡x_{2,t}italic_x start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT (red curve) was set to 4sin(t/2)4𝑡24\sin(t/2)4 roman_sin ( italic_t / 2 ). After recording, we partitioned the trial into three data sets, 𝒯0,𝒯1,𝒯2subscript𝒯0subscript𝒯1subscript𝒯2{\cal T}_{0},{\cal T}_{1},{\cal T}_{2}caligraphic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, according to the intervention periods. Explicitly, 𝒯1={101,,200}subscript𝒯1101200{\cal T}_{1}=\{101,...,200\}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { 101 , … , 200 }, 𝒯2={301,,400}subscript𝒯2301400{\cal T}_{2}=\{301,...,400\}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { 301 , … , 400 }, and 𝒯0={1,,500}(𝒯1𝒯2)subscript𝒯01500subscript𝒯1subscript𝒯2{\cal T}_{0}=\{1,...,500\}\setminus({\cal T}_{1}\cup{\cal T}_{2})caligraphic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { 1 , … , 500 } ∖ ( caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )

.

II-B The problem of predicting counterfactuals

Having introduced SCMs and interventions, we may now formulate counterfactuals in our framework. In short, a counterfactual is a hypothetical intervention. It is not an intervention that occurred in reality, but rather one whose behavior we can predict, having already observed a realization of the system.

Now that the necessary causal background has been introduced, we can state the primary problem of this work. Consider an experiment in which we observe a multivariate signal 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT until a certain time T𝑇Titalic_T, and suppose that we have assumed a particular SCM for 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Now assume we are interested in a hypothetical intervention that would have occurred in the past, and want to predict what we would have observed in this hypothetical scenario. That is,

Given: A particular realization {𝐱t|t=1,,T},Given: A particular realization conditional-setsubscript𝐱𝑡𝑡1𝑇\displaystyle\textbf{Given: }\text{A particular realization }\{\mathbf{x}_{t}|% t=1,...,T\},bold_Given: A particular realization { bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_t = 1 , … , italic_T } ,
and an SCM for 𝐱t,and an SCM for 𝐱t\displaystyle\text{ and an SCM for $\mathbf{x}_{t}$},and an SCM for bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (4)
Predict: 𝐱~t,Predict: subscript~𝐱𝑡\displaystyle\textbf{Predict: }{\widetilde{\mathbf{x}}}_{t},Predict: over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (5)

where 𝐱~tsubscript~𝐱𝑡{\widetilde{\mathbf{x}}}_{t}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the signals in the case that we performed a hypothetical intervention upon xi,tsubscript𝑥𝑖𝑡x_{i,t}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT for some t𝑡titalic_t in the range 1<T1tT2<T1subscript𝑇1𝑡subscript𝑇2𝑇1<T_{1}\leq t\leq T_{2}<T1 < italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_t ≤ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_T, and for a fixed i𝑖iitalic_i. Here we make an important assumption that all conditions of the experiment, such as the particular realization of the noise process 𝐰tsubscript𝐰𝑡\mathbf{w}_{t}bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, remains the same in both cases. This is necessary, because in considering the counterfactual problem, asserting that all other conditions were the same implies that the noise too, should have the same realization.

In general models, the values of 𝐰tsubscript𝐰𝑡\mathbf{w}_{t}bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT need to be estimated before counterfactual analysis can be performed. However, this estimation is straightforward in any additive noise model since 𝐰tsubscript𝐰𝑡\mathbf{w}_{t}bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT enters the formulas additively. In our approach, we exploit linearity of the system to circumvent the need to estimate 𝐰tsubscript𝐰𝑡\mathbf{w}_{t}bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at all because the influence of the noise will cancel out.

III Proposed Solution

Before making counterfactual predictions, it is essential to first learn the causal structure of the system. To learn the parameters of the causal model in a way that leverages both observational and interventional data, we propose a joint regression task that allows us to infer the parameters of each module shared across datasets. We make the tacit assumption that these interventions are ideal, and that the parameters of the model are unchanged when they are not related to the intervention. We now introduce some notation. If at time t𝑡titalic_t we perform an intervention to control the value of xi,tsubscript𝑥𝑖𝑡x_{i,t}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT, then the time point t𝑡titalic_t should be allocated to a set of time indices 𝒯isubscript𝒯𝑖{\cal T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. If no intervention was performed, then the index t𝑡titalic_t of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is allocated to the index set 𝒯0subscript𝒯0{\cal T}_{0}caligraphic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We assume that only one node can be intervened upon per time step t𝑡titalic_t, so 𝒯i𝒯j=subscript𝒯𝑖subscript𝒯𝑗{\cal T}_{i}\cap{\cal T}_{j}=\emptysetcaligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ caligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∅ whenever ij𝑖𝑗i\neq jitalic_i ≠ italic_j. We define

𝒯i=(j=0D𝒯j)𝒯isubscript𝒯𝑖superscriptsubscript𝑗0𝐷subscript𝒯𝑗subscript𝒯𝑖{\cal T}_{-i}=\left(\bigcup_{j=0}^{D}{\cal T}_{j}\right)\setminus{\cal T}_{i}caligraphic_T start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT = ( ⋃ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∖ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (6)

to be the set of time indices of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT was not subject to intervention. It is permissible that some index sets may be empty, 𝒯j=subscript𝒯𝑗{\cal T}_{j}=\emptysetcaligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∅. In principle, these sets could be partitions of a single recording trial, or obtained from multiple recording trials, with no major modification to our approach.

III-A Causal structure learning

For clarity in presentation, we will first describe the case of one lag (Q=1)𝑄1(Q=1)( italic_Q = 1 ), and from there we extend to the multiple lag case.

SINGLE LAG

The goal is to learn the linear coefficient matrix 𝐁𝐁\mathbf{B}bold_B of the true causal model jointly across datasets by minimizing the following objective,

min𝐁t𝒯0𝐱t𝐁𝐱t122+i=1Dt𝒯i𝐱t[i]𝐁[i]𝐱t122subscript𝐁subscript𝑡subscript𝒯0superscriptsubscriptnormsubscript𝐱𝑡subscript𝐁𝐱𝑡122superscriptsubscript𝑖1𝐷subscript𝑡subscript𝒯𝑖superscriptsubscriptnormsuperscriptsubscript𝐱𝑡delimited-[]𝑖superscript𝐁delimited-[]𝑖subscript𝐱𝑡122\min_{\mathbf{B}}\sum_{t\in\mathcal{T}_{0}}||\mathbf{x}_{t}-\mathbf{B}\mathbf{% x}_{t-1}||_{2}^{2}+\sum_{i=1}^{D}\sum_{t\in\mathcal{T}_{i}}||\mathbf{x}_{t}^{[% -i]}-\mathbf{B}^{[-i]}\mathbf{x}_{t-1}||_{2}^{2}roman_min start_POSTSUBSCRIPT bold_B end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_Bx start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ - italic_i ] end_POSTSUPERSCRIPT - bold_B start_POSTSUPERSCRIPT [ - italic_i ] end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (7)

where the operation 𝐀[i]superscript𝐀delimited-[]𝑖\mathbf{A}^{[-i]}bold_A start_POSTSUPERSCRIPT [ - italic_i ] end_POSTSUPERSCRIPT removes row i𝑖iitalic_i from the matrix 𝐀𝐀\mathbf{A}bold_A.

Using the identity 𝐱22=x12++xD2subscriptsuperscriptnorm𝐱22superscriptsubscript𝑥12superscriptsubscript𝑥𝐷2||\mathbf{x}||^{2}_{2}=x_{1}^{2}+\cdots+x_{D}^{2}| | bold_x | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⋯ + italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we can simplify this optimization task considerably. Recall that whenever we intervene on a particular variable, all other linear coefficients remain unchanged. Thus, the function that assigns a value to xi,tsubscript𝑥𝑖𝑡x_{i,t}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is given by

xi,t:=𝐛i𝐱t1+𝐰tassignsubscript𝑥𝑖𝑡superscriptsubscript𝐛𝑖topsubscript𝐱𝑡1subscript𝐰𝑡x_{i,t}:=\mathbf{b}_{i}^{\top}\mathbf{x}_{t-1}+\mathbf{w}_{t}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT := bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

whenever t𝒯i𝑡subscript𝒯𝑖t\notin{\cal T}_{i}italic_t ∉ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, or equivalently, when t𝒯i𝑡subscript𝒯𝑖t\in{\cal T}_{-i}italic_t ∈ caligraphic_T start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT. These considerations lead us to propose the following objective, which is equivalent to (7):

min𝐁i=1Dt𝒯ixi,t𝐛i𝐱t122,subscript𝐁superscriptsubscript𝑖1𝐷subscript𝑡subscript𝒯𝑖superscriptsubscriptnormsubscript𝑥𝑖𝑡superscriptsubscript𝐛𝑖topsubscript𝐱𝑡122\min_{\mathbf{B}}\sum_{i=1}^{D}\sum_{t\in{\cal T}_{-i}}||x_{i,t}-\mathbf{b}_{i% }^{\top}\mathbf{x}_{t-1}||_{2}^{2},roman_min start_POSTSUBSCRIPT bold_B end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_T start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT - bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (8)

where, again, 𝐛isuperscriptsubscript𝐛𝑖top\mathbf{b}_{i}^{\top}bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is the i𝑖iitalic_i-th row of 𝐁𝐁\mathbf{B}bold_B.

The revised objective in (8) has multiple advantages. Firstly, it can be solved for each variable i𝑖iitalic_i separately as a distinct subproblem. Secondly, it makes it clear that minimum mean squared estimate of 𝐁𝐁\mathbf{B}bold_B can be obtained using classical methods. If we assume some level of sparsity in 𝐁𝐁\mathbf{B}bold_B, then we can use LASSO[12], or 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization, to get a sparse estimate:

min𝐁i=1Dt𝒯ixi,t𝐛i𝐱t122+λ𝐁1,subscript𝐁superscriptsubscript𝑖1𝐷subscript𝑡subscript𝒯𝑖superscriptsubscriptnormsubscript𝑥𝑖𝑡superscriptsubscript𝐛𝑖topsubscript𝐱𝑡122𝜆subscriptnorm𝐁1\min_{\mathbf{B}}\sum_{i=1}^{D}\sum_{t\in{\cal T}_{-i}}||x_{i,t}-\mathbf{b}_{i% }^{\top}\mathbf{x}_{t-1}||_{2}^{2}+\lambda||\mathbf{B}||_{1},roman_min start_POSTSUBSCRIPT bold_B end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_T start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT - bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ | | bold_B | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (9)

where λ𝜆\lambdaitalic_λ is a regularization parameter.

MULTIPLE LAGS

To generalize to the case in which Q>1𝑄1Q>1italic_Q > 1, we only need to modify the approach slightly. Consider the model in (1). If we define

𝚯=[𝐁1𝐁Q],𝚯matrixsubscript𝐁1subscript𝐁𝑄\boldsymbol{\Theta}=\begin{bmatrix}\mathbf{B}_{1}&\cdots&\mathbf{B}_{Q}\end{% bmatrix},bold_Θ = [ start_ARG start_ROW start_CELL bold_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_B start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , (10)

and

𝐱¯t=[𝐱t1𝐱tQ],subscript¯𝐱𝑡matrixsubscript𝐱𝑡1subscript𝐱𝑡𝑄\bar{\mathbf{x}}_{t}=\begin{bmatrix}\mathbf{x}_{t-1}&\cdots&\mathbf{x}_{t-Q}% \end{bmatrix},over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_x start_POSTSUBSCRIPT italic_t - italic_Q end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , (11)

then we can express (1) as

𝐱t:=𝚯𝐱¯t+𝐰t.assignsubscript𝐱𝑡𝚯subscript¯𝐱𝑡subscript𝐰𝑡\mathbf{x}_{t}:=\boldsymbol{\Theta}\bar{\mathbf{x}}_{t}+\mathbf{w}_{t}.bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := bold_Θ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (12)

Now we can formulate a similar optimization task as before, giving us the multiple lag version of (7):

min𝚯t𝒯0𝐱t𝚯𝐱¯t22+i=1Dt𝒯i𝐱t[i]𝚯[i]𝐱¯t22,subscript𝚯subscript𝑡subscript𝒯0superscriptsubscriptnormsubscript𝐱𝑡𝚯subscript¯𝐱𝑡22superscriptsubscript𝑖1𝐷subscript𝑡subscript𝒯𝑖superscriptsubscriptnormsuperscriptsubscript𝐱𝑡delimited-[]𝑖superscript𝚯delimited-[]𝑖subscript¯𝐱𝑡22\min_{\boldsymbol{\Theta}}\sum_{t\in\mathcal{T}_{0}}||\mathbf{x}_{t}-% \boldsymbol{\Theta}\bar{\mathbf{x}}_{t}||_{2}^{2}+\sum_{i=1}^{D}\sum_{t\in% \mathcal{T}_{i}}||\mathbf{x}_{t}^{[-i]}-\boldsymbol{\Theta}^{[-i]}\bar{\mathbf% {x}}_{t}||_{2}^{2},roman_min start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_Θ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ - italic_i ] end_POSTSUPERSCRIPT - bold_Θ start_POSTSUPERSCRIPT [ - italic_i ] end_POSTSUPERSCRIPT over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

which can again be split into tractable sub-problems. Considering sparsity, the multiple lag version of (9) is given to be

min𝚯i=1Dt𝒯ixi,t𝜽i𝐱¯t22+λ𝚯1,subscript𝚯superscriptsubscript𝑖1𝐷subscript𝑡subscript𝒯𝑖superscriptsubscriptnormsubscript𝑥𝑖𝑡superscriptsubscript𝜽𝑖topsubscript¯𝐱𝑡22𝜆subscriptnorm𝚯1\min_{\boldsymbol{\Theta}}\sum_{i=1}^{D}\sum_{t\in{\cal T}_{-i}}||x_{i,t}-% \boldsymbol{\theta}_{i}^{\top}\bar{\mathbf{x}}_{t}||_{2}^{2}+\lambda||% \boldsymbol{\Theta}||_{1},roman_min start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_T start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ | | bold_Θ | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (13)

where 𝜽isuperscriptsubscript𝜽𝑖top\boldsymbol{\theta}_{i}^{\top}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is the i𝑖iitalic_i-th row of 𝚯𝚯\boldsymbol{\Theta}bold_Θ. The use of sparsity constraints becomes increasingly relevant when considering multiple lags, since strongly autocorrelated signals can have multiple distinct models with similar predictive power [13].

III-B Estimation of total causal effects

We now consider the problem of estimating the effects of a past counterfactual intervention. Suppose that we have observed 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT up until some time T𝑇Titalic_T. For some t<T𝑡𝑇t<Titalic_t < italic_T, consider an intervention that modifies xi,tsubscript𝑥𝑖𝑡x_{i,t}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT to take on a new value xi,tsuperscriptsubscript𝑥𝑖𝑡x_{i,t}^{*}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. If we define Δxi,t=xi,txi,tΔsubscript𝑥𝑖𝑡superscriptsubscript𝑥𝑖𝑡subscript𝑥𝑖𝑡\Delta x_{i,t}=x_{i,t}^{*}-x_{i,t}roman_Δ italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT, then we can reason about the hypothetical intervention as an additive change in xi,tsubscript𝑥𝑖𝑡x_{i,t}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT:

xi,txi,t+Δxi,t.subscript𝑥𝑖𝑡subscript𝑥𝑖𝑡Δsubscript𝑥𝑖𝑡x_{i,t}\longrightarrow x_{i,t}+\Delta x_{i,t}.italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ⟶ italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT + roman_Δ italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT .

Since the VAR model is linear, the additive change in xi,tsubscript𝑥𝑖𝑡x_{i,t}italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT induces an additive change in all variables “downstream” from it in the graph. For example, if we move one step into the future, then 𝐱t+1subscript𝐱𝑡1\mathbf{x}_{t+1}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT changes according to

𝐱t+1𝐱t+1+𝐁1𝐞iΔxi,t,subscript𝐱𝑡1subscript𝐱𝑡1subscript𝐁1subscript𝐞𝑖Δsubscript𝑥𝑖𝑡\mathbf{x}_{t+1}\longrightarrow\mathbf{x}_{t+1}+\mathbf{B}_{1}\mathbf{e}_{i}% \Delta x_{i,t},bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⟶ bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + bold_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ,

where 𝐞isubscript𝐞𝑖\mathbf{e}_{i}bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th unit vector in Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. We observe that the linear model coefficients in 𝐁1subscript𝐁1\mathbf{B}_{1}bold_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT measure the sensitivity of 𝐱t+1subscript𝐱𝑡1\mathbf{x}_{t+1}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT to changes in 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and can be interpreted as a measure of the strength of the causal relationship [14].

If we perform an intervention at time t𝑡titalic_t and attempt to predict its effect at time t+k𝑡𝑘t+kitalic_t + italic_k, we need to propagate the changes through the graph in Figure 1, that is, through each intermediate time step. When Q>1𝑄1Q>1italic_Q > 1, the path in the graph from 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝐱t+ksubscript𝐱𝑡𝑘\mathbf{x}_{t+k}bold_x start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT is not unique, and the total sensitivity of 𝐱t+ksubscript𝐱𝑡𝑘\mathbf{x}_{t+k}bold_x start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT to 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is given by a sum over all paths in the graph [15, 16]. Since the VAR model is time-invariant, the total sensitivity is a matrix 𝐓ksubscript𝐓𝑘\mathbf{T}_{k}bold_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT which only depends upon k𝑘kitalic_k. We refer to this matrix as a total causal effect matrix.

To compute the total causal effect matrix for the VAR model, we can exploit recursion. Suppose that k>Q𝑘𝑄k>Qitalic_k > italic_Q. Any path from 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝐱t+ksubscript𝐱𝑡𝑘\mathbf{x}_{t+k}bold_x start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT must go through a node between 𝐱t+kQsubscript𝐱𝑡𝑘𝑄\mathbf{x}_{t+k-Q}bold_x start_POSTSUBSCRIPT italic_t + italic_k - italic_Q end_POSTSUBSCRIPT and 𝐱t+k1subscript𝐱𝑡𝑘1\mathbf{x}_{t+k-1}bold_x start_POSTSUBSCRIPT italic_t + italic_k - 1 end_POSTSUBSCRIPT. As a result, the total causal effect decomposes according to the same rule:

𝐓k=𝐁1𝐓k1+𝐁2𝐓k2++𝐁Q𝐓kQ.subscript𝐓𝑘subscript𝐁1subscript𝐓𝑘1subscript𝐁2subscript𝐓𝑘2subscript𝐁𝑄subscript𝐓𝑘𝑄\mathbf{T}_{k}=\mathbf{B}_{1}\mathbf{T}_{k-1}+\mathbf{B}_{2}\mathbf{T}_{k-2}+% \cdots+\mathbf{B}_{Q}\mathbf{T}_{k-Q}.bold_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + bold_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_T start_POSTSUBSCRIPT italic_k - 2 end_POSTSUBSCRIPT + ⋯ + bold_B start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_T start_POSTSUBSCRIPT italic_k - italic_Q end_POSTSUBSCRIPT . (14)

If kQ𝑘𝑄k\leq Qitalic_k ≤ italic_Q, then a similar rule holds, but we define 𝐓0=𝐈subscript𝐓0𝐈\mathbf{T}_{0}=\mathbf{I}bold_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_I to be the identity matrix, and 𝐓k=0subscript𝐓𝑘0\mathbf{T}_{k}=0bold_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 for any k<0𝑘0k<0italic_k < 0, since causation can only move forward in time.

IV Experiments

In this section, we demonstrate our approach to modeling counterfactuals with two examples. First, we demonstrate how joint regression can be leveraged to more accurately learn the causal model. Second, we explore an example in which we study the effects of a past intervention on future, forecasted events. The presented results are of single outcomes of the experiments.

Refer to caption
Figure 3: Learning the causal model of a small system. The performance is quantified by the MSE of the estimated entries of 𝐁qsubscript𝐁𝑞{\bf B}_{q}bold_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, defined by MSE=1D2i,j(bq,i,jb^q,i,j)2MSE1superscript𝐷2subscript𝑖𝑗superscriptsubscript𝑏𝑞𝑖𝑗subscript^𝑏𝑞𝑖𝑗2\text{MSE}=\frac{1}{D^{2}}\sum_{i,j}(b_{q,i,j}-\hat{b}_{q,i,j})^{2}MSE = divide start_ARG 1 end_ARG start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_q , italic_i , italic_j end_POSTSUBSCRIPT - over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_q , italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

IV-A Learning causal models with joint regression

We simulated a time-series of length T=390𝑇390T=390italic_T = 390, and system dimension D=5𝐷5D=5italic_D = 5 with lag Q=2𝑄2Q=2italic_Q = 2. Each element bq,i,jsubscript𝑏𝑞𝑖𝑗b_{q,i,j}italic_b start_POSTSUBSCRIPT italic_q , italic_i , italic_j end_POSTSUBSCRIPT in the matrix coefficients 𝐁qsubscript𝐁𝑞\mathbf{B}_{q}bold_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT was generated as bq,i,j𝒰(0.5,0.5)similar-tosubscript𝑏𝑞𝑖𝑗𝒰0.50.5b_{q,i,j}\sim\mathcal{U}(-0.5,0.5)italic_b start_POSTSUBSCRIPT italic_q , italic_i , italic_j end_POSTSUBSCRIPT ∼ caligraphic_U ( - 0.5 , 0.5 ), for q=1,2𝑞12q=1,2italic_q = 1 , 2. We set about 30%percent30~{}30\%30 % sparsity for each matrix. The dynamics were perturbed with Gaussian noise 𝐰t𝒩(𝟎,𝐂)similar-tosubscript𝐰𝑡𝒩0𝐂\mathbf{w}_{t}\sim\mathcal{N}(\boldsymbol{0},{\bf C})bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_C ), where the covariance 𝐂𝐂{\bf C}bold_C is a Toeplitz matrix with Cii=1subscript𝐶𝑖𝑖1C_{ii}=1italic_C start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = 1, Cij=0.5 if |ij|=1subscript𝐶𝑖𝑗0.5 if 𝑖𝑗1C_{ij}=0.5\text{ if }|i-j|=1italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0.5 if | italic_i - italic_j | = 1, and 00 otherwise. We intervened on the node x1,τsubscript𝑥1𝜏x_{1,\tau}italic_x start_POSTSUBSCRIPT 1 , italic_τ end_POSTSUBSCRIPT within the interval τ{101,170}𝜏101170\tau\in\{101,170\}italic_τ ∈ { 101 , 170 } and the node x2,τsubscript𝑥2𝜏x_{2,\tau}italic_x start_POSTSUBSCRIPT 2 , italic_τ end_POSTSUBSCRIPT in the interval τ{201,270}𝜏201270\tau\in\{201,270\}italic_τ ∈ { 201 , 270 }. For this example, we let the intervened nodes be driven by a sine wave, specifically xi,τ:=u~i,tassignsubscript𝑥𝑖𝜏subscript~𝑢𝑖𝑡x_{i,\tau}:={\widetilde{u}}_{i,t}italic_x start_POSTSUBSCRIPT italic_i , italic_τ end_POSTSUBSCRIPT := over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT, where u~i,t=10sin(τ2)subscript~𝑢𝑖𝑡10𝜏2{\widetilde{u}}_{i,t}=10\sin(\frac{\tau}{2})over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = 10 roman_sin ( divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG ), for i=1,2𝑖12i=1,2italic_i = 1 , 2 in their respective time intervals. The rest of the nodes (i=3,4,5𝑖345i=3,4,5italic_i = 3 , 4 , 5) were left unperturbed. We learned the 𝐁𝐁\mathbf{B}bold_B coefficients in two ways: i) by using only the observational (unintervened data 𝒯0subscript𝒯0{\cal T}_{0}caligraphic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), and ii) using intervened data as well - the proposed method. To insure a fair comparison, we opted to use the same number of data points T0=250subscript𝑇0250T_{0}=250italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 250 for learning in both i) and ii), where in the case of i), 140140140140 of those points are the interventional dataset. With the current setup, the proposed method is at a slight disadvantage in regards to amount of data used, since we lose 140140140140 points for learning the coefficients 𝐛q,isubscript𝐛𝑞𝑖\mathbf{b}_{q,i}bold_b start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT for i=1,2𝑖12i=1,2italic_i = 1 , 2, q=1,2𝑞12q=1,2italic_q = 1 , 2. However, those same 140140140140 points of intervention allow us to better estimate the coefficients 𝐛q,isubscript𝐛𝑞𝑖\mathbf{b}_{q,i}bold_b start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT for q=3,4,5𝑞345q=3,4,5italic_q = 3 , 4 , 5.

In Figure 3, we show the true matrices 𝐁qsubscript𝐁𝑞\mathbf{B}_{q}bold_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, for q=1,2𝑞12q=1,2italic_q = 1 , 2, their estimate using 𝒯0subscript𝒯0{\cal T}_{0}caligraphic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT only, and their estimate with the proposed method, i.e., using interventional data as well. We can see there is a drop in MSE when using the proposed method, which is also reflected visually in the heat maps. The effect becomes even more prominent when learning larger systems.

IV-B Predicting the effects of a counterfactual intervention

In this experiment, we consider the problem of predicting the effects of a counterfactual intervention in the past. Consider a multivariate time series of length T=100𝑇100T=100italic_T = 100 and dimension D=2𝐷2D=2italic_D = 2. We assume that the causal model is given by a VAR model, (1), with Q=2𝑄2Q=2italic_Q = 2, whose parameters are known from prior investigation. Let us consider a counterfactual situation in which, at times t=240𝑡240t=240italic_t = 240 until the end of the experiment, the experimenter intervenes upon x1,tsubscript𝑥1𝑡x_{1,t}italic_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT by driving the signal with a new signal, u(t)𝒩(1,1)similar-to𝑢𝑡𝒩11u(t)\sim\mathcal{N}(1,1)italic_u ( italic_t ) ∼ caligraphic_N ( 1 , 1 ).

In Figure 4, we predict the effects of the counterfactual intervention on x1,tsubscript𝑥1𝑡x_{1,t}italic_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT, as perceived by another variable x2,tsubscript𝑥2𝑡x_{2,t}italic_x start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT. We plot the signal x2,tsubscript𝑥2𝑡x_{2,t}italic_x start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT that we observed originally alongside its variant in an “alternate universe,” in which the counterfactual intervention is performed. For times t<40𝑡40t<40italic_t < 40, both the observed time series and the counterfactual one coincide, as no intervention has not occurred yet. For 40t40𝑡40\leq t40 ≤ italic_t, we see the counterfactual intervention begin to make the two time series diverge due to the effects of the intervention. Again, the model is able to well anticipate the counterfactual universe.

Refer to caption
Figure 4: Predicting the effects of a past counterfactual intervention. We consider a counterfactual (hypothetical) intervention in which, starting at time 40. We visualize three curves: the original observed signal x2,tsubscript𝑥2𝑡x_{2,t}italic_x start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT (black), the signal which would have been observed if the hypothetical intervention upon x1,tsubscript𝑥1𝑡x_{1,t}italic_x start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT was performed (blue), and our prediction of what would have been observed in the counterfactual situation (red). The black and blue curves were generated precisely by simulation of the system, and the red curve is produced by making predictions using the observed realization.

V Conclusion

In this work, we studied the problems of counterfactual reasoning and causal inference in the context of vector autoregressive models. By considering the system as modular and assigning separate functions to each variable, we can approach regression and model learning in a nonlinear manner. Learning of the causal model is a necessary requirement to reason about counterfactual interventions and other what-if scenarios. The analysis of counterfactuals for nonlinear settings is analogous to what we have presented here, although additional care must be taken to estimate unobserved variables like the noises. Future work involves understanding the interventions for improved learning of the studied system and combining counterfactual analysis with interventions to achieve desired outcomes. Other directions include extending the counterfactual analysis to nonlinear models and models where the functional relationships are unknown and need to be estimated.

References

  • [1] J. Pearl, Causality. Cambridge University Press, 2009.
  • [2] D. B. Rubin, “Estimating causal effects of treatments in randomized and nonrandomized studies.,” Journal of Educational Psychology, vol. 66, no. 5, p. 688, 1974.
  • [3] P. Spirtes, C. N. Glymour, R. Scheines, and D. Heckerman, Causation, Prediction, and Search. MIT Press, 2000.
  • [4] X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing, “DAGs with NO TEARS: Continuous optimization for structure learning,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [5] T. Schreiber, “Measuring information transfer,” Physical Review Letters, vol. 85, no. 2, p. 461, 2000.
  • [6] J. Peters, D. Janzing, and B. Schölkopf, Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
  • [7] C. W. Granger, “Investigating causal relations by econometric models and cross-spectral methods,” Econometrica: Journal of the Econometric Society, pp. 424–438, 1969.
  • [8] G. Sugihara, R. May, H. Ye, C.-h. Hsieh, E. Deyle, M. Fogarty, and S. Munch, “Detecting causality in complex ecosystems,” Science, vol. 338, no. 6106, pp. 496–500, 2012.
  • [9] K. Butler, G. Feng, and P. M. Djurić, “On causal discovery with convergent cross map**,” IEEE Transactions on Signal Processing, vol. 71, pp. 2595–2607, 2023.
  • [10] J. M. Ogarrio, P. Spirtes, and J. Ramsey, “A hybrid causal search algorithm for latent variable models,” in Conference on Probabilistic Graphical Models, pp. 368–379, PMLR, 2016.
  • [11] F. Eberhardt, C. Glymour, and R. Scheines, “N-1 experiments suffice to determine the causal relations among N variables,” Innovations in Machine Learning: Theory and Applications, pp. 97–112, 2006.
  • [12] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 58, no. 1, pp. 267–288, 1996.
  • [13] C. Cui, P. Banelli, and P. M. Djurić, “Topology inference of directed graphs by gaussian processes with sparsity constraints,” IEEE Transactions on Signal Processing, 2024.
  • [14] K. Butler, G. Feng, and P. M. Djurić, “A differential measure of the strength of causation,” IEEE Signal Processing Letters, vol. 29, pp. 2208–2212, 2022.
  • [15] Y. Liu, C. Cui, D. Waxman, K. Butler, and P. M. Djurić, “Detecting confounders in multivariate time series using strength of causation,” in 2023 31st European Signal Processing Conference (EUSIPCO), pp. 1400–1404, IEEE, 2023.
  • [16] S. Salehkaleybar, A. Ghassami, N. Kiyavash, and K. Zhang, “Learning linear non-Gaussian causal models in the presence of latent variables,” Journal of Machine Learning Research, vol. 21, no. 39, pp. 1–24, 2020.