License: arXiv.org perpetual non-exclusive license
arXiv:2404.04399v1 [stat.ML] 05 Apr 2024

Longitudinal Targeted Minimum Loss-based Estimation with Temporal-Difference Heterogeneous Transformer

Toru Shirakawa    Yi Li    Yulun Wu    Sky Qiu    Yuxuan Li    Mingduo Zhao    Hiroyasu Iso    Mark van der Laan
Abstract

We propose Deep Longitudinal Targeted Minimum Loss-based Estimation (Deep LTMLE), a novel approach to estimate the counterfactual mean of outcome under dynamic treatment policies in longitudinal problem settings. Our approach utilizes a transformer architecture with heterogeneous type embedding trained using temporal-difference learning. After obtaining an initial estimate using the transformer, following the targeted minimum loss-based likelihood estimation (TMLE) framework, we statistically corrected for the bias commonly associated with machine learning algorithms. Furthermore, our method also facilitates statistical inference by enabling the provision of 95% confidence intervals grounded in asymptotic statistical theory. Simulation results demonstrate our method’s superior performance over existing approaches, particularly in complex, long time-horizon scenarios. It remains effective in small-sample, short-duration contexts, matching the performance of asymptotically efficient estimators. To demonstrate our method in practice, we applied our method to estimate counterfactual mean outcomes for standard versus intensive blood pressure management strategies in a real-world cardiovascular epidemiology cohort study.

Machine Learning, ICML


1 Introduction

In the fields of medicine and public health, researchers frequently encounter data that are both high-dimensional and longitudinal. The outcomes of interest in these settings often involve time to the incidence of some failure event, such as total mortality (van der Laan & Robins, 2003; Salerno & Li, 2023). Estimating the counterfactual probability of the event is challenging in high-dimensional longitudinal settings. Existing methods suffer computationally due to lack of scalability and have worse performance due to curse-of-dimensionality (Wyss et al., 2022). In response, we propose an estimator that is computationally scalable and simultaneously allows for robust statistical inference. Our estimator incorporates a transformer architecture for estimating the target estimand, defined as the cumulative incidence probability under dynamic interventions, where the treatment sequence depends on patients’ evolving histories. The target estimand can be identified through the g-formula contingent upon suitable assumptions (Robins, 1986). However, the target functional involves integration over potentially high-dimensional time-dependent covariates across time-horizon, posing computational challenges. Our method advances the longitudinal targeted minimum loss-based estimation (LTMLE) framework (van der Laan & Gruber, 2012; Lendle et al., 2017) by leveraging the computational capabilities of the transformer, facilitating the estimation of the target estimand and relevant nuisance parameters.

A number of estimators for the target estimand were proposed since the pioneering work by Robins (Robins, 1986). These estimators first factor the target parameter as a functional of nuisance parameters given a structural assumption on the underlying variables. Then, a common strategy to construct an estimator is plug-in, where one estimate the nuisance components with some models and then plug them into the target functional. However, since the naive plug-in of the estimated nuisance components causes bias, several methods have been proposed to remove this bias using the first variation of the target functional called influence function. Examples of such de-biasing techniques include one-step estimators (Klaassen, 1987; Bickel et al., 1993), estimating equations (Robins et al., 1994; Chernozhukov et al., 2022), and targeted minimum loss-based estimation (TMLE) (van der Laan & Rose, 2011). Notably, due to its plug-in property, TMLE stands out because it will respect any conditional bounds on the outcome or global bounds on the statistical model, resulting in improved finite-sample performance (Gruber & van der Laan, 2012).

The first-order bias of the plug-in estimator is represented as a population mean of the influence function evaluated at the estimated nuisance distribution. Bias correction is performed by solving the empirical analogue of this term. TMLE solves this term by optimizing a loss function along a submodel starting from the initial nuisance estimate (Bang & Robins, 2005; van der Laan & Rubin, 2006; van der Laan & Rose, 2011). The loss function and the submodel are chosen so that the linear span of the derivative of the loss function along the submodel contains the efficient influence function, the influence function with minimal variance. Targeting is the term that refers to this correction by fluctuating of the initial estimate along the path.

The current LTMLE, a TMLE developed in the context of longitudinal data, relies on a sequential regression representation of the target estimand (Bang & Robins, 2005). An ensemble machine learning technique called super learner is then used to estimate the nuisance components of the data-generating distribution (van der Laan et al., 2007). In real-world complex longitudinal data, these nuisance components, such as the survival probability at a given time, may depend on all past histories. Therefore, the Markovian property, which states that future variable values only depends on the present variables, independent of the past, is not guaranteed to hold. In other words, every observed variable could depend on the past variables in the time ordering. Hence, we want our model for the nuisance components to be able to take variable length of history as input. Under the targeted learning framework, we introduce a transformer architecture tailored towards our longitudinal setting and propose a novel method for the bias correction using a single fluctuation parameter across all time-points.

Our contribution includes: 1) Developed a general method that uses a transformer architecture to facilitate valid statistical inference in longitudinal settings concerning survival outcomes under dynamic interventions; 2) Proposed a method for bias correction using one-dimensional fluctuation for any length of time-horizon; 3) Demonstrated competetive statistical performance with asymptotically efficient estimators in simple and low-dimensional settings and superior statistical and computational performances in more complex settings; and 4) Applied our method to a real-world medical data with results presented in a format that aligns with clinical research guidelines.

2 Related Work

In the data science literature, several methods were proposed that predict the counterfactual outcomes from patient history. The methods include G-Net (Li et al., 2021), counterfactual recurrent network (CRN) (Bica et al., 2020), and causal transformer (CT) (Melnychuk et al., 2022). However, their target parameters do not involve survival outcomes, and their methods are optimized for the mean squared error (MSE) of the individual predictions, rather than for making statistical inferences. DeepACE (Frauen et al., 2023) is closely related to the present study which uses deep neural networks to estimate the whole propensity scores and outcome regressions simultaneously. Furthermore, it has an additional layer for targeting implementing the one-dimensional submodel proposed by van der Laan (van der Laan & Rose, 2018). Our method differs from theirs in the following three aspects. First, DeepACE incorporates the targeting step within their loss function, which requires an additional hyperparameter. However, there is a lack of justification for the chosen value of this hyperparameter and guidance on its tuning in practical applications. Our approach, in contrast, separates the targeting step, aligning more closely with the TMLE literature. Second, DeepACE does not address survival outcomes, specifically failing to consider the process degeneracy following a patient’s event occurrence. Third, while DeepACE utilizes the long short-term memory (LSTM) architecture, our method employs transformers. Transformers are superior in capturing long-term dependencies and offer greater computational efficiency during training than LSTM. Moreover, DeepACE does not provide uncertainty measures, such as confidence intervals, limiting its utility for statistical inference.

Our problem of estimating mean of counterfactual outcomes from longitudinal observational data under dynamic interventions has been extensively investigated as an off-policy evaluation problem in the bandit algorithm and reinforcement learning literature (Levine et al., 2020). Methods of bias correction after plugging in the initial estimate with influence function were also introduced in this context (Jiang & Li, 2016; Farajtabar et al., 2018; Narita et al., 2021). However, they did not provide tools for inference. Double reinforcement learning (Kallus & Uehara, 2020) utilized the efficient influence functions in the spirit of double machine learning (Chernozhukov et al., 2018), which is a closed form of a more general debiased estimating equation framework (Chernozhukov et al., 2022), to correct plug-in bias and proved efficiency. TMLE deform the distribution itself to correct bias before plugged-in to the the target functional, thereby the values are contained the domain of the functional.

3 Problem Formulation

In this section, follwing the roadmap of causal inference (Petersen & van der Laan, 2014; van der Laan & Rose, 2018; Dang et al., 2023), we first describe the experiment that generated the observed data and the statistical model that contains the data-generating distribution. Next, we define our causal target parameter. Then, we discuss assumptions needed to identify our target parameter from the observed data. Finally, we describe the idea of statistical method for constructing estimator and correcting bias.

3.1 Data

We consider the general longitudinal setting involving repeated measurements of a set of variables for a group of n𝑛nitalic_n patients over a period of time. In particular, our observed data contains n𝑛nitalic_n independent and identically distributed copies of random vector

O=(W0=W,L1,A1,Y1,,LT,AT,YT=Y)𝑂formulae-sequencesubscript𝑊0𝑊subscript𝐿1subscript𝐴1subscript𝑌1subscript𝐿𝑇subscript𝐴𝑇subscript𝑌𝑇𝑌\displaystyle O=(W_{0}=W,L_{1},A_{1},Y_{1},\ldots,L_{T},A_{T},Y_{T}=Y)italic_O = ( italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_W , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_Y ) (1)

with baseline covariates W𝑊Witalic_W, time-dependent covariates Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, treatments Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and outcome Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We use P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to denote the true probability distribution of O𝑂Oitalic_O that generated the data, and P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is in some statistical model \mathcal{M}caligraphic_M. Stop** time T𝑇Titalic_T is a random variable (e.g. time of death in the case of survival analysis) and we use τ𝜏\tauitalic_τ to denote the maximum time. We make the remark that in real-world data, patients are often subject to censoring. For a formulation of the data structure involving censoring nodes, see Appendix H.

3.2 Target Parameter

To define the target parameter, we introduce a structural causal model (SCM). In brief, SCM assumes each observed random variable X𝑋Xitalic_X is generated from the parent nodes pa(X)𝑝𝑎𝑋pa(X)italic_p italic_a ( italic_X ) and the external noise UXsubscript𝑈𝑋U_{X}italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT by a production function fXsubscript𝑓𝑋f_{X}italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT as X=fX(pa(X),UX)𝑋subscript𝑓𝑋𝑝𝑎𝑋subscript𝑈𝑋X=f_{X}(pa(X),U_{X})italic_X = italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_p italic_a ( italic_X ) , italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ). By abusing notation, we also denote the induced probability measure of X𝑋Xitalic_X by the same symbol fXsubscript𝑓𝑋f_{X}italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. See Appendix C.1 for details.

Our target parameter is the counterfactual mean of the final outcome Y𝑌Yitalic_Y under a user-specified dynamic treatment policy g=[gt]t=1τ𝑔superscriptsubscriptdelimited-[]subscript𝑔𝑡𝑡1𝜏g=[g_{t}]_{t=1}^{\tau}italic_g = [ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT where gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a probability measure on the treatment space conditioned on the whole history, pa(At)=(L1:t,A1:t1,Y1:t1)𝑝𝑎subscript𝐴𝑡subscript𝐿:1𝑡subscript𝐴:1𝑡1subscript𝑌:1𝑡1pa(A_{t})=(L_{1:t},A_{1:t-1},Y_{1:t-1})italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( italic_L start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ) up until Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (not including Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT). Specifically, our target parameter is given by

ψ(P)=𝔼Yg,𝜓𝑃𝔼superscript𝑌𝑔\psi(P)=\mathbb{E}Y^{g},italic_ψ ( italic_P ) = blackboard_E italic_Y start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , (2)

which is the mean of the counterfactual outcome produced by replacing π𝜋\piitalic_π, defined as the observed treatment policy from the data, with g𝑔gitalic_g in the structural causal model.

Identification

Under the positivity assumption:

gπ,much-less-than𝑔𝜋g\ll\pi,italic_g ≪ italic_π , (3)

and the sequential randomization assumption:

YgAtpa(At) for t=1,,τ,formulae-sequenceperpendicular-tosuperscript𝑌𝑔conditionalsubscript𝐴𝑡𝑝𝑎subscript𝐴𝑡 for 𝑡1𝜏Y^{g}\perp A_{t}\mid pa(A_{t})\text{ for }t=1,\ldots,\tau,italic_Y start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ⟂ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for italic_t = 1 , … , italic_τ , (4)

we can identify our target causal parameter through g-formula as the mean of Y𝑌Yitalic_Y under the counterfactual distribution which is given by replacing distributions π𝜋\piitalic_π with g𝑔gitalic_g (Robins, 1986):

𝔼Yg=𝔼gY.𝔼superscript𝑌𝑔subscript𝔼𝑔𝑌\mathbb{E}Y^{g}=\mathbb{E}_{g}Y.blackboard_E italic_Y start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_Y . (5)

Note that the consistency assumption Y=Yπ𝑌superscript𝑌𝜋Y=Y^{\pi}italic_Y = italic_Y start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT, usually stated in causal inference literature, is a consequence of the definition of counterfactual outcome in our SCM. Now the problem is reduced to the estimation of the statistical parameter:

ψ(P)=𝔼gY.𝜓𝑃subscript𝔼𝑔𝑌\psi(P)=\mathbb{E}_{g}Y.italic_ψ ( italic_P ) = blackboard_E start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_Y . (6)

3.3 Targeted Minimum Loss-based Estimation

Given we have an estimator P^nsubscript^𝑃𝑛\hat{P}_{n}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of the data-generating distribution P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, a natural estimator of the target functional is the plug-in estimator ψ^n=ψ(P^n)subscript^𝜓𝑛𝜓subscript^𝑃𝑛\hat{\psi}_{n}=\psi(\hat{P}_{n})over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_ψ ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Under a regularity condition, ψ𝜓\psiitalic_ψ admits the following first-order expansion

ψ(P^n)ψ(P0)=𝒪D(P^n)𝑑P0+R2(P^n,P0),𝜓subscript^𝑃𝑛𝜓subscript𝑃0subscript𝒪superscript𝐷subscript^𝑃𝑛differential-dsubscript𝑃0subscript𝑅2subscript^𝑃𝑛subscript𝑃0\psi(\hat{P}_{n})-\psi(P_{0})=-\int_{\mathcal{O}}D^{\star}(\hat{P}_{n})dP_{0}+% R_{2}(\hat{P}_{n},P_{0}),italic_ψ ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_ψ ( italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = - ∫ start_POSTSUBSCRIPT caligraphic_O end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_d italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (7)

where Dsuperscript𝐷D^{\star}italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT is the efficient influence function of ψ𝜓\psiitalic_ψ, and R2(P^n,P0)subscript𝑅2subscript^𝑃𝑛subscript𝑃0R_{2}(\hat{P}_{n},P_{0})italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the exact remainder. Influence functions quantifies the amount of changes of an estimator under small perturbations of the sample. The efficient influence function is the influence function with minimal variance. The idea of TMLE is to eliminate the empirical analogue of the first term of the right hand side by fluctuating P^nsubscript^𝑃𝑛\hat{P}_{n}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to find a distribution P^nsubscriptsuperscript^𝑃𝑛\hat{P}^{\star}_{n}over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with PnD(P^n)=0subscript𝑃𝑛superscript𝐷subscriptsuperscript^𝑃𝑛0P_{n}D^{\star}(\hat{P}^{\star}_{n})=0italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = 0, where Pf=f𝑑P𝑃𝑓𝑓differential-d𝑃Pf=\int fdPitalic_P italic_f = ∫ italic_f italic_d italic_P is a shorthand for the expectation of a measurable function f𝑓fitalic_f with respect to a probability measure P𝑃Pitalic_P. Our problem is to obtain an initial estimate P^nsubscript^𝑃𝑛\hat{P}_{n}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with a potentially large scale and high dimensional longitudinal data, and correct bias of the plug-in estimator ψ(P^n)𝜓subscript^𝑃𝑛\psi(\hat{P}_{n})italic_ψ ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) by fluctuating P^nsubscript^𝑃𝑛\hat{P}_{n}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Algorithm 1 Temporal Difference Learning of Conditional Counterfactual Mean Outcomes
1:  for b=1𝑏1b=1italic_b = 1 to B𝐵Bitalic_B do
2:     Q^Tg(pa(YT))Q^Tg(pa(YT))αQ^Tg(pa(YT))(Q^Tg(pa(YT)),YT)subscriptsuperscript^𝑄𝑔𝑇𝑝𝑎subscript𝑌𝑇subscriptsuperscript^𝑄𝑔𝑇𝑝𝑎subscript𝑌𝑇𝛼subscriptsubscriptsuperscript^𝑄𝑔𝑇𝑝𝑎subscript𝑌𝑇subscriptsuperscript^𝑄𝑔𝑇𝑝𝑎subscript𝑌𝑇subscript𝑌𝑇\hat{Q}^{g}_{T}(pa(Y_{T}))\leftarrow\hat{Q}^{g}_{T}(pa(Y_{T}))-\alpha\cdot% \partial_{\,\hat{Q}^{g}_{T}(pa(Y_{T}))}{\mathcal{L}(\hat{Q}^{g}_{T}(pa(Y_{T}))% ,\;Y_{T})}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) ← over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) - italic_α ⋅ ∂ start_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT caligraphic_L ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) , italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
3:     G^TG^TαG^T(pa(AT))(G^T(pa(AT)),AT)subscript^𝐺𝑇subscript^𝐺𝑇𝛼subscriptsubscript^𝐺𝑇𝑝𝑎subscript𝐴𝑇subscript^𝐺𝑇𝑝𝑎subscript𝐴𝑇subscript𝐴𝑇\hat{G}_{T}\leftarrow\hat{G}_{T}-\alpha\cdot\partial_{\,\hat{G}_{T}(pa(A_{T}))% }{\mathcal{L}(\hat{G}_{T}(pa(A_{T})),A_{T})}over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_α ⋅ ∂ start_POSTSUBSCRIPT over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT caligraphic_L ( over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) , italic_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
4:     for t=T1𝑡𝑇1t=T-1italic_t = italic_T - 1 to 1111 do
5:        V^t+1g(pa(At+1))𝔼At+1gt+1(pa(At+1))[Q^t+1g(pa(At+1),At+1)]subscriptsuperscript^𝑉𝑔𝑡1𝑝𝑎subscript𝐴𝑡1subscript𝔼similar-tosubscript𝐴𝑡1subscript𝑔𝑡1𝑝𝑎subscript𝐴𝑡1delimited-[]subscriptsuperscript^𝑄𝑔𝑡1𝑝𝑎subscript𝐴𝑡1subscript𝐴𝑡1\hat{V}^{g}_{t+1}(pa(A_{t+1}))\leftarrow\mathbb{E}_{A_{t+1}\sim g_{t+1}(pa(A_{% t+1}))}\big{[}\hat{Q}^{g}_{t+1}(pa(A_{t+1}),A_{t+1})\big{]}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) ← blackboard_E start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_g start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT [ over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ]
6:        Q^tg(pa(Yt))Q^tg(pa(Yt))αQ^tg(pa(Yt))(Q^tg(pa(Yt)),V^t+1g(pa(At+1)))subscriptsuperscript^𝑄𝑔𝑡𝑝𝑎subscript𝑌𝑡subscriptsuperscript^𝑄𝑔𝑡𝑝𝑎subscript𝑌𝑡𝛼subscriptsubscriptsuperscript^𝑄𝑔𝑡𝑝𝑎subscript𝑌𝑡subscriptsuperscript^𝑄𝑔𝑡𝑝𝑎subscript𝑌𝑡subscriptsuperscript^𝑉𝑔𝑡1𝑝𝑎subscript𝐴𝑡1\hat{Q}^{g}_{t}(pa(Y_{t}))\leftarrow\hat{Q}^{g}_{t}(pa(Y_{t}))-\alpha\cdot% \partial_{\hat{Q}^{g}_{t}(pa(Y_{t}))}\mathcal{L}(\hat{Q}^{g}_{t}(pa(Y_{t})),\;% \hat{V}^{g}_{t+1}(pa(A_{t+1})))over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ← over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_α ⋅ ∂ start_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT caligraphic_L ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) )
7:        G^tG^tαG^t(pa(At))(G^t(pa(At)),At)subscript^𝐺𝑡subscript^𝐺𝑡𝛼subscriptsubscript^𝐺𝑡𝑝𝑎subscript𝐴𝑡subscript^𝐺𝑡𝑝𝑎subscript𝐴𝑡subscript𝐴𝑡\hat{G}_{t}\leftarrow\hat{G}_{t}-\alpha\cdot\partial_{\,\hat{G}_{t}(pa(A_{t}))% }{\mathcal{L}(\hat{G}_{t}(pa(A_{t})),A_{t})}over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_α ⋅ ∂ start_POSTSUBSCRIPT over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT caligraphic_L ( over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
8:     end for
9:  end for
10:  Output (Q^tg,V^tg,G^t)t=1Tsuperscriptsubscriptsubscriptsuperscript^𝑄𝑔𝑡subscriptsuperscript^𝑉𝑔𝑡subscript^𝐺𝑡𝑡1𝑇(\hat{Q}^{g}_{t},\hat{V}^{g}_{t},\hat{G}_{t})_{t=1}^{T}( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT

4 Proposed Method

In this section, we describe our proposed method, Deep Longitudinal Targeted Minimum Loss-based Estimation (Deep LTMLE). Let

Qtg(pa(Yt))=𝔼g[YTL1:t,A1:t,Y1:t1]subscriptsuperscript𝑄𝑔𝑡𝑝𝑎subscript𝑌𝑡subscript𝔼𝑔delimited-[]conditionalsubscript𝑌𝑇subscript𝐿:1𝑡subscript𝐴:1𝑡subscript𝑌:1𝑡1\displaystyle Q^{g}_{t}(pa(Y_{t}))=\mathbb{E}_{g}[Y_{T}\mid L_{1:t},A_{1:t},Y_% {1:t-1}]italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = blackboard_E start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [ italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_L start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ] (8)

be the mean outcome at stop** time T𝑇Titalic_T given the history before node Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for t=1,,τ𝑡1𝜏t=1,\ldots,\tauitalic_t = 1 , … , italic_τ, where future treatments At+1,,ATsubscript𝐴𝑡1subscript𝐴𝑇A_{t+1},\ldots,A_{T}italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT follow a counterfactual treatment assignment policy g𝑔gitalic_g. Similarly,

Vtg(pa(At))=𝔼g[YTL1:t,A1:t1,Y1:t1]subscriptsuperscript𝑉𝑔𝑡𝑝𝑎subscript𝐴𝑡subscript𝔼𝑔delimited-[]conditionalsubscript𝑌𝑇subscript𝐿:1𝑡subscript𝐴:1𝑡1subscript𝑌:1𝑡1\displaystyle V^{g}_{t}(pa(A_{t}))=\mathbb{E}_{g}[Y_{T}\mid L_{1:t},A_{1:t-1},% Y_{1:t-1}]italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = blackboard_E start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [ italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_L start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ] (9)

is the mean outcome at stop** time T𝑇Titalic_T given the history before node Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, for t=1,,τ𝑡1𝜏t=1,\ldots,\tauitalic_t = 1 , … , italic_τ. We abbreviate Qtgsubscriptsuperscript𝑄𝑔𝑡Q^{g}_{t}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for Qtg(pa(Yt))subscriptsuperscript𝑄𝑔𝑡𝑝𝑎subscript𝑌𝑡Q^{g}_{t}(pa(Y_{t}))italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) if it is clear from the context, similarly for Vtgsubscriptsuperscript𝑉𝑔𝑡V^{g}_{t}italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Our goal is to estimate ψ(P)𝜓𝑃\psi(P)italic_ψ ( italic_P ) by

ψ^nsubscriptsuperscript^𝜓𝑛\displaystyle\hat{\psi}^{\star}_{n}over^ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =PnV^1,εg(pa(A1))absentsubscript𝑃𝑛subscriptsuperscript^𝑉𝑔1superscript𝜀𝑝𝑎subscript𝐴1\displaystyle=P_{n}\hat{V}^{g}_{1,\varepsilon^{\star}}(pa(A_{1}))= italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) )
=Pn𝔼A1g1(pa(A1))[Q^1,εg(pa(A1),A1)]absentsubscript𝑃𝑛subscript𝔼similar-tosubscript𝐴1subscript𝑔1𝑝𝑎subscript𝐴1delimited-[]subscriptsuperscript^𝑄𝑔1superscript𝜀𝑝𝑎subscript𝐴1subscript𝐴1\displaystyle=P_{n}\mathbb{E}_{A_{1}\sim g_{1}(pa(A_{1}))}[\hat{Q}^{g}_{1,% \varepsilon^{\star}}(pa(A_{1}),A_{1})]= italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT [ over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] (10)

where Q^1,εgsubscriptsuperscript^𝑄𝑔1superscript𝜀\hat{Q}^{g}_{1,\varepsilon^{\star}}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is an estimation of Q1gsubscriptsuperscript𝑄𝑔1Q^{g}_{1}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT such that ψ^nsubscriptsuperscript^𝜓𝑛\hat{\psi}^{\star}_{n}over^ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is asymptotically efficient. We achieve this by proposing a temporal-difference heterogeneous transformer to yield an initial estimation Q^1gsubscriptsuperscript^𝑄𝑔1\hat{Q}^{g}_{1}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, then update this estimation to get Q^1,εgsubscriptsuperscript^𝑄𝑔1superscript𝜀\hat{Q}^{g}_{1,\varepsilon^{\star}}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT via Targeted Minimum Loss-based Estimation (TMLE).

Refer to caption
Figure 1: Architecture of temporal-difference heterogeneous-token transformer (TDHT). Observed variables are fed into transformer after embedding layers depending on the variable types. Embedding layers aggregate linear transform with learnable type encoding and learnable positional encoding. Outputs of the transformer are G𝐺Gitalic_G after L𝐿Litalic_L and Q𝑄Qitalic_Q after A𝐴Aitalic_A. Each output head consists of a linear layer and the final activation function depending on variable distribution (sigmoid for binary, softmax for categorical and none for continuous). The outputs of G𝐺Gitalic_G heads are used to learn propensity scores and those Q𝑄Qitalic_Q are used for temporal-difference learning after integration with respect to the counterfactual treatment policy.

4.1 Temporal-Difference Heterogeneous Transformer

To learn the initial model V^1gsubscriptsuperscript^𝑉𝑔1\hat{V}^{g}_{1}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we use temporal-difference loss as the objective to learn underlying models Q^tgsubscriptsuperscript^𝑄𝑔𝑡\hat{Q}^{g}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for t=1,,τ𝑡1𝜏t=1,\ldots,\tauitalic_t = 1 , … , italic_τ via stochastic gradient descent (SGD). The principle of temporal difference learning (Sutton, 1988; Mnih et al., 2013) is to supervise Q^tgsubscriptsuperscript^𝑄𝑔𝑡\hat{Q}^{g}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to obey the temporal equality of Qtgsubscriptsuperscript𝑄𝑔𝑡Q^{g}_{t}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

Qtgsubscriptsuperscript𝑄𝑔𝑡\displaystyle Q^{g}_{t}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =𝔼p(Yt,Lt+1pa(Yt))Vt+1gabsentsubscript𝔼𝑝subscript𝑌𝑡conditionalsubscript𝐿𝑡1𝑝𝑎subscript𝑌𝑡subscriptsuperscript𝑉𝑔𝑡1\displaystyle=\mathbb{E}_{p(Y_{t},L_{t+1}\mid pa(Y_{t}))}V^{g}_{t+1}= blackboard_E start_POSTSUBSCRIPT italic_p ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT
=𝔼p(Yt,Lt+1pa(Yt)),gt+1(pa(At+1))Qt+1gabsentsubscript𝔼𝑝subscript𝑌𝑡conditionalsubscript𝐿𝑡1𝑝𝑎subscript𝑌𝑡subscript𝑔𝑡1𝑝𝑎subscript𝐴𝑡1subscriptsuperscript𝑄𝑔𝑡1\displaystyle=\mathbb{E}_{p(Y_{t},L_{t+1}\mid pa(Y_{t})),\,g_{t+1}(pa(A_{t+1})% )}Q^{g}_{t+1}= blackboard_E start_POSTSUBSCRIPT italic_p ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , italic_g start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT (11)

for t=1,,τ1𝑡1𝜏1t=1,\ldots,\tau-1italic_t = 1 , … , italic_τ - 1 and Qτg=𝔼p(Yτpa(Yτ))Yτsubscriptsuperscript𝑄𝑔𝜏subscript𝔼𝑝conditionalsubscript𝑌𝜏𝑝𝑎subscript𝑌𝜏subscript𝑌𝜏Q^{g}_{\tau}=\mathbb{E}_{p(Y_{\tau}\mid pa(Y_{\tau}))}Y_{\tau}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_p ( italic_Y start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∣ italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT. The temporal difference loss on a sample trajectory is thus given by tQ=(Q^tg,V^t+1g)subscriptsuperscript𝑄𝑡subscriptsuperscript^𝑄𝑔𝑡subscriptsuperscript^𝑉𝑔𝑡1\mathcal{L}^{Q}_{t}=\mathcal{L}(\hat{Q}^{g}_{t},\hat{V}^{g}_{t+1})caligraphic_L start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_L ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) for t=1,,T1𝑡1𝑇1t=1,\ldots,T-1italic_t = 1 , … , italic_T - 1, where V^t+1g(pa(At+1))=𝔼At+1gt+1(pa(At+1))[Q^t+1g(pa(At+1),At+1)]subscriptsuperscript^𝑉𝑔𝑡1𝑝𝑎subscript𝐴𝑡1subscript𝔼similar-tosubscript𝐴𝑡1subscript𝑔𝑡1𝑝𝑎subscript𝐴𝑡1delimited-[]subscriptsuperscript^𝑄𝑔𝑡1𝑝𝑎subscript𝐴𝑡1subscript𝐴𝑡1\hat{V}^{g}_{t+1}(pa(A_{t+1}))=\mathbb{E}_{A_{t+1}\sim g_{t+1}(pa(A_{t+1}))}% \big{[}\hat{Q}^{g}_{t+1}(pa(A_{t+1}),A_{t+1})\big{]}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) = blackboard_E start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_g start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT [ over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] can be computed by Monte-Carlo estimation if A𝐴Aitalic_A is continuous, and TQ=(Q^Tg,YT)subscriptsuperscript𝑄𝑇subscriptsuperscript^𝑄𝑔𝑇subscript𝑌𝑇\mathcal{L}^{Q}_{T}=\mathcal{L}(\hat{Q}^{g}_{T},Y_{T})caligraphic_L start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = caligraphic_L ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ). In the case of survival analysis, the components for t=1,,τ1𝑡1𝜏1t=1,\ldots,\tau-1italic_t = 1 , … , italic_τ - 1 are defined as tQ=bce(Q^tg,V^t+1g)subscriptsuperscript𝑄𝑡subscriptbcesubscriptsuperscript^𝑄𝑔𝑡subscriptsuperscript^𝑉𝑔𝑡1\mathcal{L}^{Q}_{t}=\mathcal{L}_{\mathrm{bce}}(\hat{Q}^{g}_{t},\hat{V}^{g}_{t+% 1})caligraphic_L start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_bce end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ), where bce(y^,y)=[ylogy^+(1y)log(1y^)]subscriptbce^𝑦𝑦delimited-[]𝑦^𝑦1𝑦1^𝑦\mathcal{L}_{\mathrm{bce}}(\hat{y},y)=-\big{[}y\log\hat{y}+(1-y)\log(1-\hat{y}% )\big{]}caligraphic_L start_POSTSUBSCRIPT roman_bce end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG , italic_y ) = - [ italic_y roman_log over^ start_ARG italic_y end_ARG + ( 1 - italic_y ) roman_log ( 1 - over^ start_ARG italic_y end_ARG ) ] is the binary cross entropy loss. To yield the updated model Q^t,εgsubscriptsuperscript^𝑄𝑔𝑡𝜀\hat{Q}^{g}_{t,\varepsilon}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT, we need to adjust Q^tgsubscriptsuperscript^𝑄𝑔𝑡\hat{Q}^{g}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT after model training factoring in the estimating model G^tsubscript^𝐺𝑡\hat{G}_{t}over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the propensity score

Gt(pa(At))=[AT=1L1:t,A1:t1,Y1:t1],subscript𝐺𝑡𝑝𝑎subscript𝐴𝑡delimited-[]subscript𝐴𝑇conditional1subscript𝐿:1𝑡subscript𝐴:1𝑡1subscript𝑌:1𝑡1\displaystyle G_{t}(pa(A_{t}))=\mathbb{\mathbb{P}}[A_{T}=1\mid L_{1:t},A_{1:t-% 1},Y_{1:t-1}],italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = blackboard_P [ italic_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 1 ∣ italic_L start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ] , (12)

which we will describe in detail in the next section. Hence, the loss function also needs to include tG=(G^t,At)subscriptsuperscript𝐺𝑡subscript^𝐺𝑡subscript𝐴𝑡\mathcal{L}^{G}_{t}=\mathcal{L}(\hat{G}_{t},A_{t})caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_L ( over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and is thus given by

=t=1τ𝟙{tT}t=t=1Tt=t=1TtQ+αtGsuperscriptsubscript𝑡1𝜏1𝑡𝑇subscript𝑡superscriptsubscript𝑡1𝑇subscript𝑡superscriptsubscript𝑡1𝑇subscriptsuperscript𝑄𝑡𝛼subscriptsuperscript𝐺𝑡\displaystyle\mathcal{L}=\sum_{t=1}^{\tau}\mathbbm{1}\{t\leq T\}\mathcal{L}_{t% }=\sum_{t=1}^{T}\mathcal{L}_{t}=\sum_{t=1}^{T}\mathcal{L}^{Q}_{t}+\alpha% \mathcal{L}^{G}_{t}caligraphic_L = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT blackboard_1 { italic_t ≤ italic_T } caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (13)

where α𝛼\alphaitalic_α is a hyperparameter that controls the weights of losses. See Algorithm 1 for the optimization workflow. Convergence of the algorithm can be found in Appendix D.

For the estimation of Qtgsubscriptsuperscript𝑄𝑔𝑡Q^{g}_{t}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we propose a unified model architecture to simultaneously optimize deep neural networks Q^tgsubscriptsuperscript^𝑄𝑔𝑡\hat{Q}^{g}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and G^tsubscript^𝐺𝑡\hat{G}_{t}over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in an efficient, non-sequential manner by adapting a decoder-only Transformer (Vaswani et al., 2017; Brown et al., 2020) to longitudinal data with heterogeneous tokens. An overview of the model architecture is given in Figure 1. For each sampled sequence in the training set, we feed each token in the sequence to a linear embedding layer according to its variable type. In the case of Figure 1, there are four different embedding layers eWsubscript𝑒𝑊e_{W}italic_e start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT, eLsubscript𝑒𝐿e_{L}italic_e start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, eAsubscript𝑒𝐴e_{A}italic_e start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, and eYsubscript𝑒𝑌e_{Y}italic_e start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT. Each embedding layer has the same number of output dimensions. Then, each embedding is integrated with its positional encoding E0,,Eτsubscript𝐸0subscript𝐸𝜏E_{0},\ldots,E_{\tau}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT and type encoding EWsubscript𝐸𝑊E_{W}italic_E start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT, ELsubscript𝐸𝐿E_{L}italic_E start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, EAsubscript𝐸𝐴E_{A}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, and EYsubscript𝐸𝑌E_{Y}italic_E start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT that represent its timestamp and variable type information through an aggregation function (e.g. sum, concat):

E(t)=𝚊𝚐𝚐𝚛{e(t),E,Et}𝐸subscript𝑡𝚊𝚐𝚐𝚛subscript𝑒subscript𝑡subscript𝐸subscript𝐸𝑡\displaystyle E(\bullet_{t})=\texttt{aggr}\left\{e_{\bullet}(\bullet_{t}),\,E_% {\bullet},\,E_{t}\right\}italic_E ( ∙ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = aggr { italic_e start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT ( ∙ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_E start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } (14)

for t(W0,L1,A1,Y1,,Lτ,Aτ,Yτ)\bullet_{t}\in(W_{0},L_{1},A_{1},Y_{1},\ldots,L_{\tau},A_{\tau},Y_{\tau})∙ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_L start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) where we used concat as aggr in the experiments in this work. Note that we include type embedding Esubscript𝐸E_{\bullet}italic_E start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT because esubscript𝑒e_{\bullet}italic_e start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT need not necessarily be type-specific linear layers. For more efficient and parallelizable embedding operation, we can pad each variable to the same number of dimensions before feeding into the same embedding layer e=esubscript𝑒𝑒e_{\bullet}=eitalic_e start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT = italic_e. Then, the embedded sequence is fed into the transformer and produce G^^𝐺\hat{G}over^ start_ARG italic_G end_ARG and Q^gsuperscript^𝑄𝑔\hat{Q}^{g}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT through output heads fGsubscript𝑓𝐺f_{G}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and fQsubscript𝑓𝑄f_{Q}italic_f start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT at each position that corresponds to token type L𝐿Litalic_L and A𝐴Aitalic_A respectively:

G^tsubscript^𝐺𝑡\displaystyle\hat{G}_{t}over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =fG(transformer{E(W0),,E(Lt)})absentsubscript𝑓𝐺transformer𝐸subscript𝑊0𝐸subscript𝐿𝑡\displaystyle=f_{G}(\mathrm{transformer}\left\{E(W_{0}),\,\ldots,\,E(L_{t})% \right\})= italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( roman_transformer { italic_E ( italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , … , italic_E ( italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } ) (15)
Q^tgsubscriptsuperscript^𝑄𝑔𝑡\displaystyle\hat{Q}^{g}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =fQ(transformer{E(W0),,E(At)}).absentsubscript𝑓𝑄transformer𝐸subscript𝑊0𝐸subscript𝐴𝑡\displaystyle=f_{Q}(\mathrm{transformer}\left\{E(W_{0}),\,\ldots,\,E(A_{t})% \right\}).= italic_f start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( roman_transformer { italic_E ( italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , … , italic_E ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } ) . (16)

In practice, we can use a joint output layer f𝑓fitalic_f for fGsubscript𝑓𝐺f_{G}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and fQsubscript𝑓𝑄f_{Q}italic_f start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT for more efficient and parallelizable output generation, where the output number of dimensions is the sum of the number of dimensions dim(A)dim𝐴\mathrm{dim}(A)roman_dim ( italic_A ) for treatment A𝐴Aitalic_A and dim(Y)dim𝑌\mathrm{dim}(Y)roman_dim ( italic_Y ) for outcome Y𝑌Yitalic_Y. Then, we compute softmax probabilities masking out the last dim(Y)dim𝑌\mathrm{dim}(Y)roman_dim ( italic_Y ) dimensions for G^tsubscript^𝐺𝑡\hat{G}_{t}over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and first dim(A)dim𝐴\mathrm{dim}(A)roman_dim ( italic_A ) dimensions for Q^tgsubscriptsuperscript^𝑄𝑔𝑡\hat{Q}^{g}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Our proposed architecture does not entail concatenation of variables at the same timestamp or sequential decoding of outputs following the transformer embedding block like prior work Melnychuk et al. (2022), which 1) allows us to handle different types of and different number of variables at different timestamps (e.g. starting from W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, ending at L8subscript𝐿8L_{8}italic_L start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT, while A3subscript𝐴3A_{3}italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and Y3subscript𝑌3Y_{3}italic_Y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are missing), and 2) is fully parallelizable when we use padding instead of learnable linear map** for the embedding layer esubscript𝑒e_{\bullet}italic_e start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT and use the joint output layer f𝑓fitalic_f.

4.2 Targeted Minimum Loss-based Estimation

Efficient Influence Function

Since our target parameter is the counterfactual mean outcome at the final τ𝜏\tauitalic_τ, the relevant part of P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of interest are Qt,0gsubscriptsuperscript𝑄𝑔𝑡0Q^{g}_{t,0}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT for t=1,2,,τ𝑡12𝜏t=1,2,...,\tauitalic_t = 1 , 2 , … , italic_τ.

Theorem 4.1.

In our counterfactual mean case, the efficient influence function D(P)(O)=D({Qtg,Gt}t=1τ)(O)superscript𝐷normal-⋆𝑃𝑂superscript𝐷normal-⋆superscriptsubscriptsubscriptsuperscript𝑄𝑔𝑡subscript𝐺𝑡𝑡1𝜏𝑂D^{\star}(P)(O)=D^{\star}(\{Q^{g}_{t},{G}_{t}\}_{t=1}^{\tau})(O)italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_P ) ( italic_O ) = italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( { italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ) ( italic_O ) is given by

D(P)(O)=(V1gψ0)+t=1TIt(Vt+1gQtg)superscript𝐷𝑃𝑂subscriptsuperscript𝑉𝑔1subscript𝜓0superscriptsubscript𝑡1𝑇subscript𝐼𝑡subscriptsuperscript𝑉𝑔𝑡1subscriptsuperscript𝑄𝑔𝑡\displaystyle D^{\star}(P)(O)=(V^{g}_{1}-\psi_{0})+\sum_{t=1}^{T}I_{t}(V^{g}_{% t+1}-Q^{g}_{t})italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_P ) ( italic_O ) = ( italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_ψ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (17)

where It=s=1tdgs/dπssubscript𝐼𝑡superscriptsubscriptproduct𝑠1𝑡𝑑subscript𝑔𝑠𝑑subscript𝜋𝑠I_{t}=\prod_{s=1}^{t}dg_{s}/d\pi_{s}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_d italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT / italic_d italic_π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and VT+1g=YTsubscriptsuperscript𝑉𝑔𝑇1subscript𝑌𝑇V^{g}_{T+1}=Y_{T}italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

This is given in (van der Laan & Gruber, 2012).

4.2.1 Temporal Difference Targeting

Submodel

We update the initial estimate Q^tgsubscriptsuperscript^𝑄𝑔𝑡\hat{Q}^{g}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for Qt,0gsubscriptsuperscript𝑄𝑔𝑡0Q^{g}_{t,0}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT to Q^tgsubscriptsuperscript^𝑄𝑔𝑡\hat{Q}^{g\star}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that PnD({Q^tg,G^t}t=1τ)=0subscript𝑃𝑛superscript𝐷superscriptsubscriptsubscriptsuperscript^𝑄𝑔𝑡subscript^𝐺𝑡𝑡1𝜏0P_{n}D^{\star}(\{\hat{Q}^{g\star}_{t},\hat{G}_{t}\}_{t=1}^{\tau})=0italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( { over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ) = 0. We realize this by fluctuating Q^tgsubscriptsuperscript^𝑄𝑔𝑡\hat{Q}^{g}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT along a one-dimensional submodel through the initial fit Q^g=[Q^tg]t=1τsuperscript^𝑄𝑔superscriptsubscriptdelimited-[]subscriptsuperscript^𝑄𝑔𝑡𝑡1𝜏\hat{Q}^{g}=[\hat{Q}^{g}_{t}]_{t=1}^{\tau}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = [ over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT given by, Q^εg=[Q^t,εg]t=1τsubscriptsuperscript^𝑄𝑔𝜀superscriptsubscriptdelimited-[]subscriptsuperscript^𝑄𝑔𝑡𝜀𝑡1𝜏\hat{Q}^{g}_{\varepsilon}=[\hat{Q}^{g}_{t,\varepsilon}]_{t=1}^{\tau}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT = [ over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT, where

logitQ^t,εg=logitQ^tg+εlogitsubscriptsuperscript^𝑄𝑔𝑡𝜀logitsubscriptsuperscript^𝑄𝑔𝑡𝜀\displaystyle\operatorname{logit}\hat{Q}^{g}_{t,\varepsilon}=\operatorname{% logit}\hat{Q}^{g}_{t}+\varepsilonroman_logit over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT = roman_logit over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ε (18)

with a common fluctuation parameter ε𝜀\varepsilonitalic_ε across t𝑡titalic_t. If the outcome is survival, then we automatically have Yt[0,1]subscript𝑌𝑡01Y_{t}\in[0,1]italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. In a general longitudinal setting for bounded Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’s, we can re-scale both Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Q^tgsubscriptsuperscript^𝑄𝑔𝑡\hat{Q}^{g}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to [0,1]01[0,1][ 0 , 1 ] and use the same one-dimensional submodel.

Partial Loss function

We search for the optimal fluctuation εsuperscript𝜀\varepsilon^{\star}italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT with respect to the partial loss function

(Q^εg,V^εg;G^)=t=1TIt(G^)bce(Q^t,εg,V^t+1,εg),superscriptsubscriptsuperscript^𝑄𝑔𝜀subscriptsuperscript^𝑉𝑔superscript𝜀^𝐺superscriptsubscript𝑡1𝑇subscript𝐼𝑡^𝐺subscriptbcesubscriptsuperscript^𝑄𝑔𝑡𝜀subscriptsuperscript^𝑉𝑔𝑡1superscript𝜀\displaystyle\mathcal{L}^{\star}(\hat{Q}^{g}_{\varepsilon},\hat{V}^{g}_{% \varepsilon^{\prime}};\hat{G})=\sum_{t=1}^{T}I_{t}(\hat{G})\mathcal{L}_{% \mathrm{bce}}(\hat{Q}^{g}_{t,\varepsilon},\hat{V}^{g}_{t+1,\varepsilon^{\prime% }}),caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ; over^ start_ARG italic_G end_ARG ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG italic_G end_ARG ) caligraphic_L start_POSTSUBSCRIPT roman_bce end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , (19)

where V^T+1,εg=YTsubscriptsuperscript^𝑉𝑔𝑇1𝜀subscript𝑌𝑇\hat{V}^{g}_{T+1,\varepsilon}=Y_{T}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T + 1 , italic_ε end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and V^t,εg=𝔼Atgt[Q^t,εg]subscriptsuperscript^𝑉𝑔𝑡𝜀subscript𝔼similar-tosubscript𝐴𝑡subscript𝑔𝑡delimited-[]subscriptsuperscript^𝑄𝑔𝑡𝜀\hat{V}^{g}_{t,\varepsilon}=\mathbb{E}_{A_{t}\sim g_{t}}\big{[}\hat{Q}^{g}_{t,% \varepsilon}\big{]}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT ], such that (Q^εg,V^εg)superscriptsubscriptsuperscript^𝑄𝑔𝜀subscriptsuperscript^𝑉𝑔superscript𝜀\mathcal{L}^{\star}(\hat{Q}^{g}_{\varepsilon},\hat{V}^{g}_{\varepsilon^{\prime% }})caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) satisfies the following theorem:

Theorem 4.2.

For any εsuperscript𝜀normal-⋆\varepsilon^{\star}italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, we have

ε|ε(Q^εg,V^εg)=D(Q^εg,G^).evaluated-atsubscript𝜀superscript𝜀superscriptsubscriptsuperscript^𝑄𝑔𝜀subscriptsuperscript^𝑉𝑔superscript𝜀superscript𝐷subscriptsuperscript^𝑄𝑔superscript𝜀^𝐺\displaystyle\partial_{\varepsilon}|_{\varepsilon^{\star}}\mathcal{L}^{\star}(% \hat{Q}^{g}_{\varepsilon},\hat{V}^{g}_{\varepsilon^{\star}})=D^{\star}(\hat{Q}% ^{g}_{\varepsilon^{\star}},\hat{G}).∂ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_G end_ARG ) . (20)

See Section B for the proof.

Corollary 4.3.

Suppose that we found an εsuperscript𝜀normal-⋆\varepsilon^{\star}italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT satisfying

ε|εPn(Q^εg,V^εg)=0,evaluated-atsubscript𝜀superscript𝜀subscript𝑃𝑛superscriptsubscriptsuperscript^𝑄𝑔𝜀subscriptsuperscript^𝑉𝑔superscript𝜀0\partial_{\varepsilon}|_{\varepsilon^{\star}}P_{n}\mathcal{L}^{\star}(\hat{Q}^% {g}_{\varepsilon},\hat{V}^{g}_{\varepsilon^{\star}})=0,∂ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = 0 , (21)

then Q^εgsubscriptsuperscriptnormal-^𝑄𝑔superscript𝜀normal-⋆\hat{Q}^{g}_{\varepsilon^{\star}}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT solves the efficient influence function.

In practice, for the finite sample performance, we only need to solve it to the order of standard error with a σn/lognsubscript𝜎𝑛𝑛\sigma_{n}/\log nitalic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / roman_log italic_n factor (van der Laan & Rose, 2011) (Algorithm 2).

Algorithm 2 Temporal-Difference Targeting
1:  ε0𝜀0\varepsilon\leftarrow 0italic_ε ← 0
2:  repeat
3:     εargminεPn(Q^εg,V^εg).𝜀subscriptargminsuperscript𝜀subscript𝑃𝑛superscriptsubscriptsuperscript^𝑄𝑔superscript𝜀subscriptsuperscript^𝑉𝑔𝜀\varepsilon\leftarrow\mathrm{argmin}_{\varepsilon^{\prime}}P_{n}\mathcal{L}^{% \star}(\hat{Q}^{g}_{\varepsilon^{\prime}},\hat{V}^{g}_{\varepsilon}).italic_ε ← roman_argmin start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ) .
4:     σ^nn1PnD2(Q^εg,G^)subscript^𝜎𝑛superscript𝑛1subscript𝑃𝑛superscript𝐷absent2subscriptsuperscript^𝑄𝑔𝜀^𝐺\hat{\sigma}_{n}\leftarrow\sqrt{n^{-1}P_{n}D^{\star 2}(\hat{Q}^{g}_{% \varepsilon},\hat{G})}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← square-root start_ARG italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT , over^ start_ARG italic_G end_ARG ) end_ARG
5:  until PnD(Q^εg,G^)<σ^n/lognsubscript𝑃𝑛superscript𝐷subscriptsuperscript^𝑄𝑔𝜀^𝐺subscript^𝜎𝑛𝑛P_{n}D^{\star}(\hat{Q}^{g}_{\varepsilon},\hat{G})<\hat{\sigma}_{n}/\log nitalic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT , over^ start_ARG italic_G end_ARG ) < over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / roman_log italic_n
6:  ψ^nPnV^1,εg(pa(A1))subscriptsuperscript^𝜓𝑛subscript𝑃𝑛subscriptsuperscript^𝑉𝑔1𝜀𝑝𝑎subscript𝐴1\hat{\psi}^{\star}_{n}\leftarrow P_{n}\hat{V}^{g}_{1,\varepsilon}(pa(A_{1}))over^ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_ε end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) )
7:  95% CI as ψ^n±1.96σ^nplus-or-minussubscriptsuperscript^𝜓𝑛1.96subscript^𝜎𝑛\hat{\psi}^{\star}_{n}\pm 1.96\cdot\hat{\sigma}_{n}over^ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ± 1.96 ⋅ over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
Convergence of Algorithm 2

The investigation of bce(Q^t,εg,V^t,εg)(O)subscriptbcesubscriptsuperscript^𝑄𝑔𝑡𝜀subscriptsuperscript^𝑉𝑔𝑡𝜀𝑂\mathcal{L}_{\mathrm{bce}}(\hat{Q}^{g}_{t,\varepsilon},\hat{V}^{g}_{t,% \varepsilon})(O)caligraphic_L start_POSTSUBSCRIPT roman_bce end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT ) ( italic_O ) for different t𝑡titalic_t and O𝑂Oitalic_O’s as a function of ε𝜀\varepsilonitalic_ε suggests that they admit different bell curve shapes concentrating at different ε𝜀\varepsilonitalic_ε’s and have different spread out levels. Thus, by summing up bce(Q^t,εg,V^t,εg)(O)subscriptbcesubscriptsuperscript^𝑄𝑔𝑡𝜀subscriptsuperscript^𝑉𝑔𝑡𝜀𝑂\mathcal{L}_{\mathrm{bce}}(\hat{Q}^{g}_{t,\varepsilon},\hat{V}^{g}_{t,% \varepsilon})(O)caligraphic_L start_POSTSUBSCRIPT roman_bce end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT ) ( italic_O ) across t𝑡titalic_t and across O𝑂Oitalic_O’s as Pn(Q^εg,V^εg)subscript𝑃𝑛superscriptsubscriptsuperscript^𝑄𝑔𝜀subscriptsuperscript^𝑉𝑔𝜀P_{n}\mathcal{L}^{\star}(\hat{Q}^{g}_{\varepsilon},\hat{V}^{g}_{\varepsilon})italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ) as a function of ε𝜀\varepsilonitalic_ε will fluctuate a lot and we expect a local minima and local maxima around the neighborhood of ε=0𝜀0\varepsilon=0italic_ε = 0. And thus the convergence of the algorithm is highly probable and we don’t discover any issue in our simulations.

Comparison to LTMLE

In the LTMLE, we only need a good estimate of Qτgsubscriptsuperscript𝑄𝑔𝜏{Q}^{g}_{\tau}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT and then do backward sequential regression and targeting as mentioned in (van der Laan & Gruber, 2012). However, the problem is the error in the estimation of Qτgsubscriptsuperscript𝑄𝑔𝜏{Q}^{g}_{\tau}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT can propagate as we progress back to get Qτ1g,*,,Q1g,*subscriptsuperscript𝑄𝑔𝜏1subscriptsuperscript𝑄𝑔1{Q}^{g,*}_{\tau-1},...,{Q}^{g,*}_{1}italic_Q start_POSTSUPERSCRIPT italic_g , * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ - 1 end_POSTSUBSCRIPT , … , italic_Q start_POSTSUPERSCRIPT italic_g , * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Nonetheless, after our initial transformer step, we have good initial estimates for all Q1g,,Qτgsubscriptsuperscript𝑄𝑔1subscriptsuperscript𝑄𝑔𝜏{Q}^{g}_{1},...,{Q}^{g}_{\tau}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT. So, instead of only relying on a good estimate of Qτgsubscriptsuperscript𝑄𝑔𝜏{Q}^{g}_{\tau}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT, our algorithm makes uses of all of them. and doing targeting across t𝑡titalic_t with o(ε)𝑜𝜀o(\varepsilon)italic_o ( italic_ε ) fluctuation at each t𝑡titalic_t level. Thus, we are able to pool information across time when doing the targeting step.

4.2.2 Sequential Targeting

Alternatively, one could apply a sequential targeting procedure that is very similar to LTMLE but with given initials generated from the transformer step.

Submodel

We fluctuate each component of the initial fit Q^g=[Q^tg]t=1τsuperscript^𝑄𝑔superscriptsubscriptdelimited-[]subscriptsuperscript^𝑄𝑔𝑡𝑡1𝜏\hat{Q}^{g}=[\hat{Q}^{g}_{t}]_{t=1}^{\tau}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = [ over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT along a model as

logitQ^t,εtg=logitQ^tg+εt.logitsubscriptsuperscript^𝑄𝑔𝑡subscript𝜀𝑡logitsubscriptsuperscript^𝑄𝑔𝑡subscript𝜀𝑡\displaystyle\operatorname{logit}\hat{Q}^{g}_{t,\varepsilon_{t}}=\operatorname% {logit}\hat{Q}^{g}_{t}+\varepsilon_{t}.roman_logit over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_logit over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (22)
Loss function

Starting from t=τ𝑡𝜏t=\tauitalic_t = italic_τ, given we have found εt+1superscriptsubscript𝜀𝑡1\varepsilon_{t+1}^{\star}italic_ε start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, among individuals whose T>t1𝑇𝑡1T>t-1italic_T > italic_t - 1, we search for empirical loss minimizer εtsuperscriptsubscript𝜀𝑡\varepsilon_{t}^{\star}italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT with respect to the loss function tsubscriptsuperscript𝑡\mathcal{L}^{\star}_{t}caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as,

t(Q^t,εtg,V^t+1,εt+1g)=It(G^)bce(Q^t,εtg,V^t+1,εt+1g),subscriptsuperscript𝑡subscriptsuperscript^𝑄𝑔𝑡subscript𝜀𝑡subscriptsuperscript^𝑉𝑔𝑡1superscriptsubscript𝜀𝑡1subscript𝐼𝑡^𝐺subscriptbcesubscriptsuperscript^𝑄𝑔𝑡subscript𝜀𝑡subscriptsuperscript^𝑉𝑔𝑡1superscriptsubscript𝜀𝑡1\displaystyle\mathcal{L}^{\star}_{t}(\hat{Q}^{g}_{t,\varepsilon_{t}},\hat{V}^{% g}_{t+1,\varepsilon_{t+1}^{\star}})=I_{t}(\hat{G})\mathcal{L}_{\mathrm{bce}}(% \hat{Q}^{g}_{t,\varepsilon_{t}},\hat{V}^{g}_{t+1,\varepsilon_{t+1}^{\star}}),caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_ε start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG italic_G end_ARG ) caligraphic_L start_POSTSUBSCRIPT roman_bce end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_ε start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , (23)

where V^t+1,εt+1g=𝔼At+1gt+1[Q^t+1,εt+1g]subscriptsuperscript^𝑉𝑔𝑡1superscriptsubscript𝜀𝑡1subscript𝔼similar-tosubscript𝐴𝑡1subscript𝑔𝑡1delimited-[]subscriptsuperscript^𝑄𝑔𝑡1superscriptsubscript𝜀𝑡1\hat{V}^{g}_{t+1,\varepsilon_{t+1}^{\star}}=\mathbb{E}_{A_{t+1}\sim g_{t+1}}% \big{[}\hat{Q}^{g}_{t+1,\varepsilon_{t+1}^{\star}}\big{]}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_ε start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_g start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_ε start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] when T>t𝑇𝑡T>titalic_T > italic_t and V^t+1,εt+1g=YTsubscriptsuperscript^𝑉𝑔𝑡1superscriptsubscript𝜀𝑡1subscript𝑌𝑇\hat{V}^{g}_{t+1,\varepsilon_{t+1}^{\star}}=Y_{T}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_ε start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT when T=t𝑇𝑡T=titalic_T = italic_t. To initialize, we set ετ+1=0superscriptsubscript𝜀𝜏10\varepsilon_{\tau+1}^{\star}=0italic_ε start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = 0.

Lemma 4.4.

Suppose that we found ετ,ε1subscriptsuperscript𝜀normal-⋆𝜏normal-…subscriptsuperscript𝜀normal-⋆1\varepsilon^{\star}_{\tau},...\varepsilon^{\star}_{1}italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , … italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT sequentially as mentioned above, then {Q^εtg}t=1τsuperscriptsubscriptsubscriptsuperscriptnormal-^𝑄𝑔superscriptsubscript𝜀𝑡normal-⋆𝑡1𝜏\{\hat{Q}^{g}_{\varepsilon_{t}^{\star}}\}_{t=1}^{\tau}{ over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT solves the efficient influence function.

Algorithm 3 Sequential Targeting
1:  ετ+10superscriptsubscript𝜀𝜏10\varepsilon_{\tau+1}^{\star}\leftarrow 0italic_ε start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ← 0
2:  for t=τ𝑡𝜏t=\tauitalic_t = italic_τ to 1111 do
3:     εtargminεtPn𝟙(Tt)(Q^t,εtg,V^t+1,εt+1g).superscriptsubscript𝜀𝑡subscriptargminsubscript𝜀𝑡subscript𝑃𝑛1𝑇𝑡superscriptsubscriptsuperscript^𝑄𝑔𝑡subscript𝜀𝑡subscriptsuperscript^𝑉𝑔𝑡1superscriptsubscript𝜀𝑡1\varepsilon_{t}^{\star}\leftarrow\mathrm{argmin}_{\varepsilon_{t}}P_{n}% \mathbbm{1}(T\geq t)\mathcal{L}^{\star}(\hat{Q}^{g}_{t,\varepsilon_{t}},\hat{V% }^{g}_{t+1,\varepsilon_{t+1}^{\star}}).italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ← roman_argmin start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT blackboard_1 ( italic_T ≥ italic_t ) caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_ε start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) .
4:  end for
5:  ψ^nPnV^1,ε1g(pa(A1))subscriptsuperscript^𝜓𝑛subscript𝑃𝑛subscriptsuperscript^𝑉𝑔1superscriptsubscript𝜀1𝑝𝑎subscript𝐴1\hat{\psi}^{\dagger}_{n}\leftarrow P_{n}\hat{V}^{g}_{1,\varepsilon_{1}^{\star}% }(pa(A_{1}))over^ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) )
6:  σ^nn1PnD2({Q^εtg}t=1τ,G^)subscript^𝜎𝑛superscript𝑛1subscript𝑃𝑛superscript𝐷absent2superscriptsubscriptsubscriptsuperscript^𝑄𝑔superscriptsubscript𝜀𝑡𝑡1𝜏^𝐺\hat{\sigma}_{n}\leftarrow\sqrt{n^{-1}P_{n}D^{\star 2}(\{\hat{Q}^{g}_{% \varepsilon_{t}^{\star}}\}_{t=1}^{\tau},\hat{G})}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← square-root start_ARG italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ 2 end_POSTSUPERSCRIPT ( { over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT , over^ start_ARG italic_G end_ARG ) end_ARG
7:  95% CI as ψ^n±1.96σ^nplus-or-minussubscriptsuperscript^𝜓𝑛1.96subscript^𝜎𝑛\hat{\psi}^{\dagger}_{n}\pm 1.96\cdot\hat{\sigma}_{n}over^ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ± 1.96 ⋅ over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
Comparaison to LTMLE

While the error can still propagate as we move back in time, the error propagates only through the targeting steps whereas in LTMLE the error can also propagate through regressions. At each time step t𝑡titalic_t, LTMLE needs to first regress V^t+1,εt+1gsubscriptsuperscript^𝑉𝑔𝑡1subscriptsuperscript𝜀𝑡1\hat{V}^{g}_{t+1,\varepsilon^{\star}_{t+1}}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 , italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT on pa(Yt)𝑝𝑎subscript𝑌𝑡pa(Y_{t})italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to get an estimate Q^tgsubscriptsuperscript^𝑄𝑔𝑡\hat{Q}^{g}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and then perform the targeting through the submodel in(22). However, we only use initial estimate Q^tgsubscriptsuperscript^𝑄𝑔𝑡\hat{Q}^{g}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from our transformer fit and it does not depend on εt+1,,ετsubscriptsuperscript𝜀𝑡1subscriptsuperscript𝜀𝜏\varepsilon^{\star}_{t+1},\ldots,\varepsilon^{\star}_{\tau}italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , … , italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT.

Why not targeting through additional loss function

As in DeepACE, the targeting can be performed through introducing additional loss components to further train the transformer we have build in the first step. This additional loss function will have its derivative equal to the efficient influence function. However, we find that the penalty factor before this loss function is hard to tune and in near all cases, it is hard to guarantee the EIF is solved and most of the time we will hurt our initial fits as shown in Appendix 5.1.

Refer to caption
Figure 2: Results from simple synthetic data with continuous outcome. Left: Sampling distributions of estimates. Right: Sampling distributions of empirical means of estimated efficient influence functions.
Table 1: Results from complex synthetic data. LTMLE (GLM): LTMLE with GLM; LTMLE (SL): LTMLE with super learner of GLM, MARS, and XGBoost; Deep LTMLE: initial estimate with TDHT; Deep LTMLE \dagger: TDHT with sequential targeting; Deep LTMLE \star: TDHT with temporal-difference targeting.
Bias RMSE Coverage Mean σ^nsubscript^𝜎𝑛\hat{\sigma}_{n}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
Model τ=10𝜏10\tau=10italic_τ = 10 τ=20𝜏20\tau=20italic_τ = 20 τ=30𝜏30\tau=30italic_τ = 30 τ=10𝜏10\tau=10italic_τ = 10 τ=20𝜏20\tau=20italic_τ = 20 τ=30𝜏30\tau=30italic_τ = 30 τ=10𝜏10\tau=10italic_τ = 10 τ=20𝜏20\tau=20italic_τ = 20 τ=30𝜏30\tau=30italic_τ = 30 τ=10𝜏10\tau=10italic_τ = 10 τ=20𝜏20\tau=20italic_τ = 20 τ=30𝜏30\tau=30italic_τ = 30
LTMLE (GLM) 0.0230 0.0766 0.1344 0.0265 0.0796 0.1381 1.00 1.00 1.00 0.43 0.69 0.76
LTMLE (SL) 0.0144 0.0297 0.0477 0.0185 0.0344 0.0545 1.00 1.00 1.00 0.31 0.40 0.45
DeepACE -0.0704 -0.1491 -0.2396 0.0948 0.1601 0.2453 1.00 1.00 1.00 0.74 0.69 0.57
Deep LTMLE 0.0182 0.0304 0.0499 0.0264 0.0342 0.0532 1.00 0.94 0.71 0.17 0.09 0.06
Deep LTMLE \dagger 0.0158 0.0286 0.0548 0.0188 0.0314 0.0589 1.00 0.93 0.73 0.16 0.09 0.06
Deep LTMLE \star 0.0143 0.0305 0.0471 0.0204 0.0333 0.0509 1.00 0.93 0.76 0.16 0.08 0.06

5 Experiments

We conducted two experiments. In the first experiment, we compare the bias, root-mean-squared-error (RMSE), and coverage probability, of our estimator with existing estimators based on 100 times of estimations for both continuous and survival outcomes. The second experiment is an application of our proposed method to a real-world data.

5.1 Synthetic Data with Continuous Outcome

First, we start our experiment with a very simple data generating process with continuous outcome, n=500𝑛500n=500italic_n = 500, and τ=10𝜏10\tau=10italic_τ = 10. The data generating proccess is described in the Section F.1. After fitting DeepACE, we additionally performed our targeting precedures on the fit.

The results were shown in Figure 2. Initial fits of Deep LTMLE and DeepACE had comparable bias. Even with the targeting loss, DeepACE failed to solve the efficient influence function. On the other hand, due to the separation of the targeting step in our method, we managed to solve it completely and succeeded in correcting bias.

5.2 Synthetic Data with Survival Outcome

Table 2: Results from semi-synthetic data with unmeasured confounding
Bias RMSE Coverage Mean σ^nsubscript^𝜎𝑛\hat{\sigma}_{n}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
Model τ=10𝜏10\tau=10italic_τ = 10 τ=20𝜏20\tau=20italic_τ = 20 τ=30𝜏30\tau=30italic_τ = 30 τ=10𝜏10\tau=10italic_τ = 10 τ=20𝜏20\tau=20italic_τ = 20 τ=30𝜏30\tau=30italic_τ = 30 τ=10𝜏10\tau=10italic_τ = 10 τ=20𝜏20\tau=20italic_τ = 20 τ=30𝜏30\tau=30italic_τ = 30 τ=10𝜏10\tau=10italic_τ = 10 τ=20𝜏20\tau=20italic_τ = 20 τ=30𝜏30\tau=30italic_τ = 30
LTMLE (SL) 0.0075 0.0341 0.0574 0.0138 0.0491 0.0786 0.70 0.45 0.25 0.09 0.12 0.14
DeepACE -0.0174 -0.0434 -0.0770 0.0788 0.1154 0.1341 1.00 1.00 1.00 0.67 0.78 0.86
Deep LTMLE -0.0002 0.0108 0.0041 0.0162 0.0720 0.0772 1.00 0.95 1.00 0.18 0.27 0.32
Deep LTMLE \star 0.0058 0.0429 0.0709 0.0205 0.0724 0.0968 0.95 0.90 0.95 0.18 0.26 0.31

Next, we evaluated Deep LTMLE under a highly complex data-generating process with survival outcomes, five-dimensional time-dependent covariates, non-Markovian dependencies, n=1000𝑛1000n=1000italic_n = 1000, and τ=10,20,30𝜏102030\tau=10,20,30italic_τ = 10 , 20 , 30, imitating the setups from previous studies (Bica et al., 2020; Frauen et al., 2023). See Section F.3 for details.

Results are presented in Table 1. We observe that Deep LTMLE on average achieves a lower RMSE compared to other methods, particularly in scenarios with larger τ𝜏\tauitalic_τ, indicating its robustness in complex and realistic scenarios without Markovian dependencies. Benefits by our targeting procedures are obvious for τ=10,20𝜏1020\tau=10,20italic_τ = 10 , 20. For τ=30𝜏30\tau=30italic_τ = 30, we still see reductions in bias and in RMSE when the temporal-difference targeting is applied. While Deep LTMLE’s coverage probability diminished at τ=30𝜏30\tau=30italic_τ = 30, the confidence intervals generated by LTMLE and DeepACE were notably over-conservative with large estimated standard errors.

The pronounced bias of DeepACE can likely be attributed to three factors. First, DeepACE’s use of the squared-error-loss for the outcome is known to induce greater bias in sparse outcomes, a common scenario in survival analysis, as opposed to the logistic loss used in our approach (Gruber & van der Laan, 2010). Second, DeepACE failed to solve the efficient influence function. Third, DeepACE does not account for the degeneration of the survival outcome.

Simple Synthetic Data with Survival Outcome

We also conducted an eperiment with a very simple survival synthetic data with one-dimensional time-dependent covariates, n=1000𝑛1000n=1000italic_n = 1000, and τ=10,20,30𝜏102030\tau=10,20,30italic_τ = 10 , 20 , 30. Although LTMLE with GLM is expected to have strong performance in this experiment, Deep LTMLE remains highly competitive in this context, equalling LTMLE’s performance (Section G).

5.3 Semi-Synthetic Data

To evaluate the performance of the proposed methods, we generated realistic data from Circulatory Risk in Communities Study (CIRCS) (Yamagishi et al., 2019), a long-term on-going cardiovascular epidemiological cohort study, lasting over a half century. See Section G.1 for the detail.

Table 2 shows the results with semi-synthetic data with unmeasured confounding, which reflects a real world setting. Deep LTMLE performed best in terms of bias for all time horizons. Furthermore, as the time horizon increases from 10 to 30, LTMLE’s coverage probability drops as low as 0.3. On the other hand, Deep LTMLE has nominal coverage even in the longest time-horizon setting.

5.4 Real World Data

We applied Deep LTMLE to real world data from CIRCS. We estimated the counterfactual mean outcomes under the standard blood pressure (SBP) management strategy that controls SBP less than 140 mmHg and the intensive blood pressure management strategy with SBP less than 120 mmHg after the 30 years of sustained management.

In real world applications, we often encounter with practical problems of censoring, that is loss of follow-up for some reasons. Our model can be easily generalized to cover this setting with a slight modification by adding censoring nodes. Details are described in Section H of Appendix.

The results were shown in Figure 3. The average treatment effect (ATE) of the intensive management strategy over the standard management strategy first increased with a peak at 20 years after baseline and then decreased with a fluctuation. The direction and trend of ATE is consistent with the difference of empirical means of cumulative outcomes between two groups followed the two strategies.

Refer to caption
Figure 3: Counterfactual mean outcomes and 95% simultaneous confidence intervals according to standard and intensive treatment policies among 13,485 participants of the CIRCS.

5.5 Computation Details

DeepACE and Deep LTMLE were run on a GPU (Tesla T4) with 16 GB memory and LTMLE on CPU (Intel Xeon Skylake 6230 @ 2.1 GHz) with 40 cores and 96 GB memory. We used the R package ltmle with GLM and a super learner (SL) library consisting of GLM, maltivariate adaptive regression spline with earth package, and xgboost for the simple synthetic data and the real world data (Lendle et al., 2017; Polley et al., 2021; Milborrow, 2023; Chen et al., 2022). Confidence intervals for LTMLE was constructed based on its estimate of the efficient influence function.

Table 3: Running time with complex synthetic data
Time, sec
Model τ=10𝜏10\tau=10italic_τ = 10 τ=20𝜏20\tau=20italic_τ = 20 τ=30𝜏30\tau=30italic_τ = 30
LTMLE (SL) 271 958 2122
DeepACE 53 54 133
Deep LTMLE \star 38 39 116

As shown in Table 3, Deep LTMLE leverages GPU acceleration to achieve significantly faster processing times than LTMLE, presenting a substantial computational benefit for analyses involving extensive time horizons and high-dimensional time-dependent covariates.

6 Limitations

Our method assumes the sequential randomization and the positivity assumption on the intervention mechanism to identify the counterfactual outcome from observational data. However, to our surprise, in semi-synthetic data simulations, we found that when there is unmeasured confounding violating the sequential randomization assumption rely on, our method is very robust and could even provide robust inference. Furthermore, our proposed model does not currently address several complexities often found in real-world data, such as visiting processes, competing risks, and continuous time horizons. These challenges will be the focus of our future research efforts.

7 Conclusion

In this paper, we propose a variant of LTMLE that leverages the sequential learning capabilities of transformers. This approach enables simultaneous fitting of the entire LTMLE, allowing us to target the mean survival under dynamic interventions directly through weighting the loss function with cumulative inverse probabilities of intervention. The proposed method performs competitively with asymptotically efficient estimators in low-dimensional settings and exceeds the performance of existing models in high-dimensional scenarios. Scalability of our model to larger and longer datasets was implied. We applied our method to real world data and demonstrated a causal inference on the effect of sustained blood pressure management strategies on total mortality.

Acknowledgement

This research is funded by NIH and Berkeley School of Public Health, Interdisciplinary Collaborative Research Grant. TS is supported by Fulbright scholarship program. The authors thank Dr. Ahmed Alaa at University of California, San Francisco and Berkeley for valuable discussions. The authors acknowledge the CIRCS investigators team for providing the real world data for experiments; Dr. Akihiko Kitamura at Yao City, Dr. Masahiko Kiyama at Osaka Center for Prevention of Cardiovascular Diseases, Dr. Takeo Okada at Osaka Center for Prevention of Cardiovascular Diseases, Dr. Yuji Shimizu at Osaka Center for Prevention of Cardiovascular Diseases, Dr. Hironori Imano at Kinki University, Dr. Tetsuya Ohira at Fukushima Prefeture Medical University, Dr. Kazumasa Yamagishi at Tsukuba University, and Dr. Isao Muraki at Osaka University.

References

  • Bang & Robins (2005) Bang, H. and Robins, J. M. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
  • Bica et al. (2020) Bica, I., Alaa, A. M., Jordon, J., and van der Schaar, M. Estimating counterfactual treatment outcomes over time through adversarially balanced representations. In International Conference on Learning Representations, 2020.
  • Bickel et al. (1993) Bickel, P., Klaassen, C., Ritov, Y., and Wellner, J. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins Series in the Mathematical Sciences. Springer New York, 1993. ISBN 978-0-387-98473-5.
  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chen et al. (2022) Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, M., Xie, J., Lin, M., Geng, Y., Li, Y., and Yuan, J. xgboost: Extreme Gradient Boosting, 2022. URL https://CRAN.R-project.org/package=xgboost. R package version 1.7.6.1.
  • Chernozhukov et al. (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
  • Chernozhukov et al. (2022) Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., and Robins, J. M. Locally robust semiparametric estimation. Econometrica, 90(4):1501–1535, 2022.
  • Dang et al. (2023) Dang, L. E., Gruber, S., Lee, H., Dahabreh, I. J., Stuart, E. A., Williamson, B. D., Wyss, R., Díaz, I., Ghosh, D., Kıcıman, E., Alemayehu, D., Hoffman, K. L., Vossen, C. Y., Huml, R. A., Ravn, H., Kvist, K., Pratley, R., Shih, M.-C., Pennello, G., Martin, D., Waddy, S. P., Barr, C. E., Akacha, M., Buse, J. B., van der Laan, M., and Petersen, M. A causal roadmap for generating high-quality real-world evidence. Journal of Clinical and Translational Science, 7(1):e212, 2023.
  • Farajtabar et al. (2018) Farajtabar, M., Chow, Y., and Ghavamzadeh, M. More robust doubly robust off-policy evaluation. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  1447–1456. PMLR, 10–15 Jul 2018.
  • Frauen et al. (2023) Frauen, D., Hatt, T., Melnychuk, V., and Feuerriegel, S. Estimating average causal effects from patient trajectories. Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):7586–7594, 2023.
  • Gruber & van der Laan (2010) Gruber, S. and van der Laan, M. J. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. The International Journal of Biostatistics, 6(1):Article 26, 2010. ISSN 1557-4679. doi: 10.2202/1557-4679.1260.
  • Gruber & van der Laan (2012) Gruber, S. and van der Laan, M. J. Targeted minimum loss based estimation of a causal effect on an outcome with known conditional bounds. The international journal of biostatistics, 8(1):21–21, 2012. ISSN 1557-4679.
  • Jiang & Li (2016) Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp.  652–661, New York, New York, USA, 20–22 Jun 2016. PMLR.
  • Kallus & Uehara (2020) Kallus, N. and Uehara, M. Double reinforcement learning for efficient and robust off-policy evaluation. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  5078–5088. PMLR, 13–18 Jul 2020.
  • Kennedy (2022) Kennedy, E. H. Semiparametric doubly robust targeted double machine learning: A review. arXiv preprint arXiv:2203.06469, 2022.
  • Klaassen (1987) Klaassen, C. A. J. Consistent estimation of the influence function of locally asymptotically linear estimators. The Annals of Statistics, 15(4):1548–1562, 1987.
  • Lendle et al. (2017) Lendle, S. D., Schwab, J., Petersen, M. L., and van der Laan, M. J. ltmle: An R package implementing targeted minimum loss-based estimation for longitudinal data. Journal of Statistical Software, 81(1):1–21, 2017. doi: 10.18637/jss.v081.i01.
  • Levine et al. (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv preprint arXiv:2005.01643, 2020.
  • Li et al. (2021) Li, R., Hu, S., Lu, M., Utsumi, Y., Chakraborty, P., Sow, D. M., Madan, P., Li, J., Ghalwash, M., Shahn, Z., and Lehman, L.-w. G-net: A recurrent network approach to G-computation for counterfactual prediction under a dynamic treatment regime. In Proceedings of Machine Learning for Health, volume 158 of Proceedings of Machine Learning Research, pp.  282–299. PMLR, 2021.
  • Melnychuk et al. (2022) Melnychuk, V., Frauen, D., and Feuerriegel, S. Causal transformer for estimating counterfactual outcomes. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  15293–15329. PMLR, 2022.
  • Milborrow (2023) Milborrow, S. earth: Multivariate Adaptive Regression Splines, 2023. URL https://CRAN.R-project.org/package=earth. R package version 5.3.2.
  • Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Narita et al. (2021) Narita, Y., Yasui, S., and Yata, K. Debiased off-policy evaluation for recommendation systems. In Proceedings of the 15th ACM Conference on Recommender Systems, RecSys ’21, pp.  372–379, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384582.
  • Petersen & van der Laan (2014) Petersen, M. L. and van der Laan, M. J. Causal models and learning from data: Integrating causal modeling and statistical estimation. Epidemiology (Cambridge, Mass.), 25(3):418–426, 2014.
  • Polley et al. (2021) Polley, E., LeDell, E., Kennedy, C., and van der Laan, M. SuperLearner: Super Learner Prediction, 2021. URL https://CRAN.R-project.org/package=SuperLearner. R package version 2.0-28.1.
  • Robins (1986) Robins, J. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical modelling, 7(9-12):1393–1512, 1986.
  • Robins et al. (1994) Robins, J. M., Rotnitzky, A., and Zhao, L. P. Estimation of Regression Coefficients When Some Regressors Are Not Always Observed. Journal of the American Statistical Association, 89(427):846–866, 1994.
  • Salerno & Li (2023) Salerno, S. and Li, Y. High-dimensional survival analysis: Methods and applications. Annual review of statistics and its application, 10:25–49, 2023.
  • Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3:9–44, 1988.
  • van der Laan & Rubin (2006) van der Laan, M. and Rubin, D. Targeted Maximum Likelihood Learning. The International Journal of Biostatistics, 2(1), 2006.
  • van der Laan & Gruber (2012) van der Laan, M. J. and Gruber, S. Targeted Minimum Loss Based Estimation of Causal Effects of Multiple Time Point Interventions. The International Journal of Biostatistics, 8(1), 2012.
  • van der Laan & Robins (2003) van der Laan, M. J. and Robins, J. Unified Methods for Censored Longitudinal Data and Causality. Springer Series in Statistics. Springer New York, 2003. ISBN 978-0-387-21700-0.
  • van der Laan & Rose (2011) van der Laan, M. J. and Rose, S. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Series in Statistics. Springer, 2011. ISBN 978-1-4419-9781-4 978-1-4419-9782-1.
  • van der Laan & Rose (2018) van der Laan, M. J. and Rose, S. Targeted Learning in Data Science: Causal Inference for Complex Longitudinal Studies. Springer Series in Statistics. Springer International Publishing, 2018.
  • van der Laan et al. (2007) van der Laan, M. J., Polley, E. C., and Hubbard, A. E. Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1):1309–1309, 2007. ISSN 1544-6115.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • Wyss et al. (2022) Wyss, R., Yanover, C., El‐Hay, T., Bennett, D., Platt, R. W., Zullo, A. R., Sari, G., Wen, X., Ye, Y., Yuan, H., Gokhale, M., Patorno, E., and Lin, K. J. Machine learning for improving high‐dimensional proxy confounder adjustment in healthcare database studies: An overview of the current literature. Pharmacoepidemiology and drug safety, 31(9):932–943, 2022. ISSN 1053-8569.
  • Yamagishi et al. (2019) Yamagishi, K., Muraki, I., Kubota, Y., Hayama-Terada, M., Imano, H., Cui, R., Umesawa, M., Shimizu, Y., Sankai, T., Okada, T., Sato, S., Kitamura, A., Kiyama, M., and Iso, H. The Circulatory Risk in Communities Study (CIRCS): A Long-Term Epidemiological Study for Lifestyle-Related Disease Among Japanese Men and Women Living in Communities. Journal of Epidemiology, 29(3):83–91, 2019.

Appendix A Notation

Here we list notations used in the article.

O𝑂Oitalic_O Observed variables O=(W=W0,L1,A1,Y1,L2,A2,Y2,,Lτ,Aτ,Yτ)𝑂𝑊subscript𝑊0subscript𝐿1subscript𝐴1subscript𝑌1subscript𝐿2subscript𝐴2subscript𝑌2subscript𝐿𝜏subscript𝐴𝜏subscript𝑌𝜏O=(W=W_{0},L_{1},A_{1},Y_{1},L_{2},A_{2},Y_{2},\ldots,L_{\tau},A_{\tau},Y_{% \tau})italic_O = ( italic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_L start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT )
τ𝜏\tauitalic_τ Maximum length of time-horizon
T𝑇Titalic_T Stop** time
W𝑊Witalic_W Baseline covariates
Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Time-dependent covariates (states)
Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Time-dependent treatments (controls)
Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Outcomes. In survival case, binary failure indicator defined as Yt=𝟙{Tt}subscript𝑌𝑡1𝑇𝑡Y_{t}=\mathbbm{1}\{T\leq t\}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = blackboard_1 { italic_T ≤ italic_t }
Y𝑌Yitalic_Y Outcome at the end of the trajectory: Y=YT𝑌subscript𝑌𝑇Y=Y_{T}italic_Y = italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
pa(X)𝑝𝑎𝑋pa(X)italic_p italic_a ( italic_X ) Parent nodes of X𝑋Xitalic_X. For example, pa(Lt)=(W,L1:t1,A1:t1,Y1:t1)𝑝𝑎subscript𝐿𝑡𝑊subscript𝐿:1𝑡1subscript𝐴:1𝑡1subscript𝑌:1𝑡1pa(L_{t})=(W,L_{1:t-1},A_{1:t-1},Y_{1:t-1})italic_p italic_a ( italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( italic_W , italic_L start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT )
P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT The true distribution of the observed variable
P^nsubscript^𝑃𝑛\hat{P}_{n}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT Estimator of P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
π𝜋\piitalic_π Propensity scores π=[πt]1τ𝜋superscriptsubscriptdelimited-[]subscript𝜋𝑡1𝜏\pi=[\pi_{t}]_{1}^{\tau}italic_π = [ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT with π(dat|pa(at))=(dat|pa(at))𝜋conditional𝑑subscript𝑎𝑡𝑝𝑎subscript𝑎𝑡conditional𝑑subscript𝑎𝑡𝑝𝑎subscript𝑎𝑡\pi(da_{t}|pa(a_{t}))=\mathbb{P}(da_{t}|pa(a_{t}))italic_π ( italic_d italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_p italic_a ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = blackboard_P ( italic_d italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_p italic_a ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
g𝑔gitalic_g User-specified treatment policies g=[gt]t=1τ𝑔superscriptsubscriptdelimited-[]subscript𝑔𝑡𝑡1𝜏g=[g_{t}]_{t=1}^{\tau}italic_g = [ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT
ψ𝜓\psiitalic_ψ Target functional ψ(P)=𝔼gY𝜓𝑃subscript𝔼𝑔𝑌\psi(P)=\mathbb{E}_{g}Yitalic_ψ ( italic_P ) = blackboard_E start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_Y
ψ0subscript𝜓0\psi_{0}italic_ψ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT True parameter ψ0=ψ(P0)subscript𝜓0𝜓subscript𝑃0\psi_{0}=\psi(P_{0})italic_ψ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ψ ( italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
ψ^nsubscript^𝜓𝑛\hat{\psi}_{n}over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT Estimator of ψ0subscript𝜓0\psi_{0}italic_ψ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
σ^nsubscript^𝜎𝑛\hat{\sigma}_{n}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT Estimator of the standard error of the estimator ψ^nsubscript^𝜓𝑛\hat{\psi}_{n}over^ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
Qgsuperscript𝑄𝑔Q^{g}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT State-action value functions Qg=[Qtg]t=1τsuperscript𝑄𝑔superscriptsubscriptdelimited-[]subscriptsuperscript𝑄𝑔𝑡𝑡1𝜏Q^{g}=[Q^{g}_{t}]_{t=1}^{\tau}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = [ italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT
Qtgsubscriptsuperscript𝑄𝑔𝑡Q^{g}_{t}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Qtg(pa(Yt))=𝔼g[Y|pa(Yt)]absentsubscriptsuperscript𝑄𝑔𝑡𝑝𝑎subscript𝑌𝑡subscript𝔼𝑔delimited-[]conditional𝑌𝑝𝑎subscript𝑌𝑡=Q^{g}_{t}(pa(Y_{t}))=\mathbb{E}_{g}[Y|pa(Y_{t})]= italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = blackboard_E start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [ italic_Y | italic_p italic_a ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ]
Vgsuperscript𝑉𝑔V^{g}italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT Value functions Vg=[Vtg]t=1τ+1superscript𝑉𝑔superscriptsubscriptdelimited-[]subscriptsuperscript𝑉𝑔𝑡𝑡1𝜏1V^{g}=[V^{g}_{t}]_{t=1}^{\tau+1}italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = [ italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ + 1 end_POSTSUPERSCRIPT
Vtgsubscriptsuperscript𝑉𝑔𝑡V^{g}_{t}italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Vtg(pa(At))=𝔼g[Y|pa(At)]absentsubscriptsuperscript𝑉𝑔𝑡𝑝𝑎subscript𝐴𝑡subscript𝔼𝑔delimited-[]conditional𝑌𝑝𝑎subscript𝐴𝑡=V^{g}_{t}(pa(A_{t}))=\mathbb{E}_{g}[Y|pa(A_{t})]= italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = blackboard_E start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [ italic_Y | italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] for t=1,,τ𝑡1𝜏t=1,\ldots,\tauitalic_t = 1 , … , italic_τ and Vτ+1g=Yτsubscriptsuperscript𝑉𝑔𝜏1subscript𝑌𝜏V^{g}_{\tau+1}=Y_{\tau}italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT
G𝐺Gitalic_G Propensity scores G=[Gt]1τ𝐺superscriptsubscriptdelimited-[]subscript𝐺𝑡1𝜏G=[G_{t}]_{1}^{\tau}italic_G = [ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT with G(dat|pa(at))=(dat|pa(at))𝐺conditional𝑑subscript𝑎𝑡𝑝𝑎subscript𝑎𝑡conditional𝑑subscript𝑎𝑡𝑝𝑎subscript𝑎𝑡G(da_{t}|pa(a_{t}))=\mathbb{P}(da_{t}|pa(a_{t}))italic_G ( italic_d italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_p italic_a ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = blackboard_P ( italic_d italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_p italic_a ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Clever covariates (importance weights) It=s=1tdgs/dπs(O)subscript𝐼𝑡superscriptsubscriptproduct𝑠1𝑡𝑑subscript𝑔𝑠𝑑subscript𝜋𝑠𝑂I_{t}=\prod_{s=1}^{t}dg_{s}/d\pi_{s}(O)italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_d italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT / italic_d italic_π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_O )
Dsuperscript𝐷D^{\star}italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT Efficient influence function of ψ𝜓\psiitalic_ψ: D(Qg,Vg;G)=V1gψ0+t=1τ(Vt+1gQtg)superscript𝐷superscript𝑄𝑔superscript𝑉𝑔𝐺subscriptsuperscript𝑉𝑔1subscript𝜓0superscriptsubscript𝑡1𝜏subscriptsuperscript𝑉𝑔𝑡1subscriptsuperscript𝑄𝑔𝑡D^{\star}(Q^{g},V^{g};G)=V^{g}_{1}-\psi_{0}+\sum_{t=1}^{\tau}(V^{g}_{t+1}-Q^{g% }_{t})italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ; italic_G ) = italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_ψ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
Qεgsubscriptsuperscript𝑄𝑔𝜀Q^{g}_{\varepsilon}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT Local least favorable submodel Qεg=[Qt,εg]t=1τsubscriptsuperscript𝑄𝑔𝜀superscriptsubscriptdelimited-[]subscriptsuperscript𝑄𝑔𝑡𝜀𝑡1𝜏Q^{g}_{\varepsilon}=[Q^{g}_{t,\varepsilon}]_{t=1}^{\tau}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT = [ italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT
Qtgsubscriptsuperscript𝑄𝑔𝑡Q^{g}_{t}italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT logitQt,εg=logitQtg+εlogitsubscriptsuperscript𝑄𝑔𝑡𝜀logitsubscriptsuperscript𝑄𝑔𝑡𝜀\operatorname{logit}Q^{g}_{t,\varepsilon}=\operatorname{logit}Q^{g}_{t}+\varepsilonroman_logit italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT = roman_logit italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ε
Vεgsubscriptsuperscript𝑉𝑔𝜀V^{g}_{\varepsilon}italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT Local least favorable submodel Vεg=[Vt,εg]t=1τ+1subscriptsuperscript𝑉𝑔𝜀superscriptsubscriptdelimited-[]subscriptsuperscript𝑉𝑔𝑡𝜀𝑡1𝜏1V^{g}_{\varepsilon}=[V^{g}_{t,\varepsilon}]_{t=1}^{\tau+1}italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT = [ italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ + 1 end_POSTSUPERSCRIPT
Vtgsubscriptsuperscript𝑉𝑔𝑡V^{g}_{t}italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT logitVt,εg=logitVtg+εlogitsubscriptsuperscript𝑉𝑔𝑡𝜀logitsubscriptsuperscript𝑉𝑔𝑡𝜀\operatorname{logit}V^{g}_{t,\varepsilon}=\operatorname{logit}V^{g}_{t}+\varepsilonroman_logit italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT = roman_logit italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ε for t=1,,τ𝑡1𝜏t=1,\ldots,\tauitalic_t = 1 , … , italic_τ and Vτ+1,εg=Vτ+1gsubscriptsuperscript𝑉𝑔𝜏1𝜀subscriptsuperscript𝑉𝑔𝜏1V^{g}_{\tau+1,\varepsilon}=V^{g}_{\tau+1}italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ + 1 , italic_ε end_POSTSUBSCRIPT = italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT
(Qg,Vg)superscript𝑄𝑔superscript𝑉𝑔\mathcal{L}(Q^{g},V^{g})caligraphic_L ( italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) Loss function for temporal difference learning
superscript\mathcal{L^{\star}}caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT Loss function for targeting
α𝛼\alphaitalic_α Weight for the propensity loss (hyperparameter)
Pf𝑃𝑓Pfitalic_P italic_f Mean of a function f𝑓fitalic_f under the distribution P𝑃Pitalic_P: Pf=f𝑑P𝑃𝑓𝑓differential-d𝑃Pf=\int fdPitalic_P italic_f = ∫ italic_f italic_d italic_P
E(t)𝐸subscript𝑡E(\bullet_{t})italic_E ( ∙ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) Embedding of a node tsubscript𝑡\bullet_{t}∙ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
e(t)subscript𝑒subscript𝑡e_{\bullet}(\bullet_{t})italic_e start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT ( ∙ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) Type embedding of a node tsubscript𝑡\bullet_{t}∙ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Positional encoding at time t𝑡titalic_t
fXsubscript𝑓𝑋f_{X}italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT production function of a node X𝑋Xitalic_X

Appendix B Proof

Proof of Theorem 4.2.

A direct calculation shows

εt(Qt,εg,Vt+1,εg)subscript𝜀subscriptsuperscript𝑡superscriptsubscript𝑄𝑡𝜀𝑔superscriptsubscript𝑉𝑡1superscript𝜀𝑔\displaystyle\partial_{\varepsilon}\mathcal{L}^{\star}_{t}(Q_{t,\varepsilon}^{% g},V_{t+1,\varepsilon^{\star}}^{g})∂ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_t + 1 , italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) =It(G)[εQt,εg][Qt,εgbce(Qt,εg,Vt+1,εg)]=It(G)(Vt+1,εgQt,εg).absentsubscript𝐼𝑡𝐺delimited-[]subscript𝜀superscriptsubscript𝑄𝑡𝜀𝑔delimited-[]subscriptsuperscriptsubscript𝑄𝑡𝜀𝑔subscriptbcesuperscriptsubscript𝑄𝑡𝜀𝑔superscriptsubscript𝑉𝑡1superscript𝜀𝑔subscript𝐼𝑡𝐺superscriptsubscript𝑉𝑡1superscript𝜀𝑔superscriptsubscript𝑄𝑡𝜀𝑔\displaystyle=I_{t}(G)[\partial_{\varepsilon}Q_{t,\varepsilon}^{g}][\partial_{% Q_{t,\varepsilon}^{g}}\mathcal{L}_{\mathrm{bce}}(Q_{t,\varepsilon}^{g},V_{t+1,% \varepsilon^{\star}}^{g})]=I_{t}(G)(V_{t+1,\varepsilon^{\star}}^{g}-Q_{t,% \varepsilon}^{g}).= italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_G ) [ ∂ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ] [ ∂ start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_bce end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_t + 1 , italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) ] = italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_G ) ( italic_V start_POSTSUBSCRIPT italic_t + 1 , italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_t , italic_ε end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) .

Substitution of εsuperscript𝜀\varepsilon^{\star}italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT to ε𝜀\varepsilonitalic_ε yields the t𝑡titalic_t-th component of the efficient influence function (34) at (Qεg,Vεg,G)superscriptsubscript𝑄superscript𝜀𝑔superscriptsubscript𝑉superscript𝜀𝑔𝐺(Q_{\varepsilon^{\star}}^{g},V_{\varepsilon^{\star}}^{g},G)( italic_Q start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_G ). ∎

Appendix C Review of TMLE

C.1 Structural Causal Model

We assume each node depends on the all previous nodes in the trajectory, that is, we do not assume the Markovian property. And each node X𝑋Xitalic_X is produced from the parent nodes pa(X)𝑝𝑎𝑋pa(X)italic_p italic_a ( italic_X ) and independent noise random variables UXsubscript𝑈𝑋U_{X}italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT by a measurable function fXsubscript𝑓𝑋f_{X}italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT: X=fX(pa(X),UX)𝑋subscript𝑓𝑋𝑝𝑎𝑋subscript𝑈𝑋X=f_{X}(pa(X),U_{X})italic_X = italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_p italic_a ( italic_X ) , italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ). This production function fXsubscript𝑓𝑋f_{X}italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT induces a conditional distribution PX|Hsubscript𝑃conditional𝑋𝐻P_{X|H}italic_P start_POSTSUBSCRIPT italic_X | italic_H end_POSTSUBSCRIPT of X𝑋Xitalic_X given H=pa(X)𝐻𝑝𝑎𝑋H=pa(X)italic_H = italic_p italic_a ( italic_X ) by pushing forward the distribution of noise variable: PX|H(A|h)=(PUXfX1(h,))(A)subscript𝑃conditional𝑋𝐻conditional𝐴subscript𝑃subscript𝑈𝑋superscriptsubscript𝑓𝑋1𝐴P_{X|H}(A|h)=(P_{U_{X}}\circ f_{X}^{-1}(h,\cdot))(A)italic_P start_POSTSUBSCRIPT italic_X | italic_H end_POSTSUBSCRIPT ( italic_A | italic_h ) = ( italic_P start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_h , ⋅ ) ) ( italic_A ) for all measurable A𝒳𝐴𝒳A\subset\mathcal{X}italic_A ⊂ caligraphic_X, where 𝒳𝒳\mathcal{X}caligraphic_X is a domain of random vector X𝑋Xitalic_X. Starting from nodes without parents including noise nodes and their distributions, production functions and their causal structure, which can be described by a directed acyclic graph over the ovservables, generate the joint distribution of the observed random variables. With our particular data in longitudinal setting, we define the propensity score πt=PAt|pa(At)subscript𝜋𝑡subscript𝑃conditionalsubscript𝐴𝑡𝑝𝑎subscript𝐴𝑡\pi_{t}=P_{A_{t}|pa(A_{t})}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT, where pa(At)=pa(At)𝑝𝑎subscript𝐴𝑡𝑝𝑎subscript𝐴𝑡pa(A_{t})=pa(A_{t})italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the patient history before the node Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We use the same symbol for the production function if the treatment assignment is deterministics, that is, there is no noise variable in generating the treatment node: dπt(At|pa(At))=1𝑑subscript𝜋𝑡conditionalsubscript𝐴𝑡𝑝𝑎subscript𝐴𝑡1d\pi_{t}(A_{t}|pa(A_{t}))=1italic_d italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = 1 if At=atsubscript𝐴𝑡subscript𝑎𝑡A_{t}=a_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for some specific atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

C.2 Causal Target Parameter and Identification

Our target parameter is the counterfactual mean of the final outcome Y𝑌Yitalic_Y under the user-specified dynamic treatment policy g=(gt)𝑔subscript𝑔𝑡g=(g_{t})italic_g = ( italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This is the mean of counterfactual outcome which is produced by replacing πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the structural causal model:

ψg(P)=EYg.superscript𝜓𝑔𝑃𝐸superscript𝑌𝑔\psi^{g}(P)=EY^{g}.italic_ψ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_P ) = italic_E italic_Y start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT . (24)

To identify this causal target paratmer from observatoinal data, we assume the following conditions of the positiviy:

gπ,much-less-than𝑔𝜋g\ll\pi,italic_g ≪ italic_π , (25)

and the sequential randomization:

YgAtpa(At) for t=1,,τ.formulae-sequenceperpendicular-tosuperscript𝑌𝑔conditionalsubscript𝐴𝑡𝑝𝑎subscript𝐴𝑡 for 𝑡1𝜏Y^{g}\perp A_{t}\mid pa(A_{t})\text{ for }t=1,\ldots,\tau.italic_Y start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ⟂ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_p italic_a ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for italic_t = 1 , … , italic_τ . (26)

Note that the consistency Y=Yπ𝑌superscript𝑌𝜋Y=Y^{\pi}italic_Y = italic_Y start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT usually stated in the causal inference literature is a consequence of the definition of counterfactual outcome in our structural causal model. Under these identifiability conditions, this parameter is identified through g-formula that is the mean of Y𝑌Yitalic_Y under the counterfactual distribution which is given by replacing distributions πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

EYg=EgY.𝐸superscript𝑌𝑔subscript𝐸𝑔𝑌EY^{g}=E_{g}Y.italic_E italic_Y start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = italic_E start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_Y . (27)

Then the problem reduced to the estimation of the statistical parameter:

ψ(P)=EgY.𝜓𝑃subscript𝐸𝑔𝑌\psi(P)=E_{g}Y.italic_ψ ( italic_P ) = italic_E start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_Y . (28)

C.3 TMLE

Bias correction by TMLE is based on the following first order approximation of the target functional around the true distribution P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (van der Laan & Rubin, 2006; van der Laan & Rose, 2011; Kennedy, 2022):

ψ(P^n)ψ(P0)=P0D(P^n)+R2(P^n,P0),𝜓subscript^𝑃𝑛𝜓subscript𝑃0subscript𝑃0superscript𝐷subscript^𝑃𝑛subscript𝑅2subscript^𝑃𝑛subscript𝑃0\psi(\hat{P}_{n})-\psi(P_{0})=-P_{0}D^{\star}(\hat{P}_{n})+R_{2}(\hat{P}_{n},P% _{0}),italic_ψ ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_ψ ( italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = - italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (29)

where Dsuperscript𝐷D^{\star}italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT is called influence function and R2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the second order remainder. This equation is the infinite dimensional extension of Taylor expansion.

The right hand side of this equation can be further written as:

PnD(P^n)+(PnP0)[D(P^n)D(P0)]+(PnP0)D(P0)+R2(P^n,P0),subscript𝑃𝑛superscript𝐷subscript^𝑃𝑛subscript𝑃𝑛subscript𝑃0delimited-[]superscript𝐷subscript^𝑃𝑛superscript𝐷subscript𝑃0subscript𝑃𝑛subscript𝑃0superscript𝐷subscript𝑃0subscript𝑅2subscript^𝑃𝑛subscript𝑃0-P_{n}D^{\star}(\hat{P}_{n})+(P_{n}-P_{0})\big{[}D^{\star}(\hat{P}_{n})-D^{% \star}(P_{0})\big{]}+(P_{n}-P_{0})D^{\star}(P_{0})+R_{2}(\hat{P}_{n},P_{0}),- italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + ( italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) [ italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] + ( italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (30)

whose second term called empirical process term converges to zero in the rate of square root of n𝑛nitalic_n if D(P^n),D(P0)superscript𝐷subscript^𝑃𝑛superscript𝐷subscript𝑃0D^{\star}(\hat{P}_{n}),D^{\star}(P_{0})italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) belong to the Donsker class and D(P^n)superscript𝐷subscript^𝑃𝑛D^{\star}(\hat{P}_{n})italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) converges to D(P0)superscript𝐷subscript𝑃0D^{\star}(P_{0})italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) in L2(P0)superscript𝐿2subscript𝑃0L^{2}(P_{0})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Given a good initial fit P^nsubscript^𝑃𝑛\hat{P}_{n}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, above conditions are usually satisfied and, in addition, R2(P^n,P0)=oP0(n1/2)subscript𝑅2subscript^𝑃𝑛subscript𝑃0subscript𝑜subscript𝑃0superscript𝑛12R_{2}(\hat{P}_{n},P_{0})=o_{P_{0}}(n^{-1/2})italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_o start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). Thus, by further using the fact about the influence function that P0D(P0)=0subscript𝑃0superscript𝐷subscript𝑃00P_{0}D^{\star}(P_{0})=0italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 0, the right hand side reduced to

PnD(P^n)+PnD(P0)+oP0(n1/2).subscript𝑃𝑛superscript𝐷subscript^𝑃𝑛subscript𝑃𝑛superscript𝐷subscript𝑃0subscript𝑜subscript𝑃0superscript𝑛12-P_{n}D^{\star}(\hat{P}_{n})+P_{n}D^{\star}(P_{0})+o_{P_{0}}(n^{-1/2}).- italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_o start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) . (31)

Now, the idea is to find P^nsuperscriptsubscript^𝑃𝑛\hat{P}_{n}^{\star}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT in the close neighborhood of P^nsubscript^𝑃𝑛\hat{P}_{n}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that solves the empirical analog of the first term:

PnD(P^n)=0.subscript𝑃𝑛superscript𝐷superscriptsubscript^𝑃𝑛0P_{n}D^{\star}(\hat{P}_{n}^{\star})=0.italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = 0 . (32)

By doing so, using similar arguments as above for P^nsuperscriptsubscript^𝑃𝑛\hat{P}_{n}^{\star}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT instead of P^nsubscript^𝑃𝑛\hat{P}_{n}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we have the following.

ψ(P^n)ψ(P0)=PnD(P0)+oP0(n1/2).𝜓superscriptsubscript^𝑃𝑛𝜓subscript𝑃0subscript𝑃𝑛superscript𝐷subscript𝑃0subscript𝑜subscript𝑃0superscript𝑛12\psi(\hat{P}_{n}^{\star})-\psi(P_{0})=P_{n}D^{\star}(P_{0})+o_{P_{0}}(n^{-1/2}).italic_ψ ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) - italic_ψ ( italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_o start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) . (33)

Thus, our estimator ψ(P^n)𝜓superscriptsubscript^𝑃𝑛\psi(\hat{P}_{n}^{\star})italic_ψ ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) is a plug in estimator and attains the efficiency bound among the asymptotically linear and regular estimators.

C.4 Efficient influence curve

Then the efficient influence function of our target parameter is computed as follows (van der Laan & Gruber, 2012)

D(P)superscript𝐷𝑃\displaystyle D^{\star}(P)italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_P ) =t=1τDt(P)absentsuperscriptsubscript𝑡1𝜏subscriptsuperscript𝐷𝑡𝑃\displaystyle=\sum_{t=1}^{\tau}D^{\star}_{t}(P)= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_P ) (34)
=(V1gψ0)+t=1τ1It(Vt+1gQtg)+Iτ(YτQτg)absentsubscriptsuperscript𝑉𝑔1subscript𝜓0superscriptsubscript𝑡1𝜏1subscript𝐼𝑡subscriptsuperscript𝑉𝑔𝑡1subscriptsuperscript𝑄𝑔𝑡subscript𝐼𝜏subscript𝑌𝜏subscriptsuperscript𝑄𝑔𝜏\displaystyle=(V^{g}_{1}-\psi_{0})+\sum_{t=1}^{\tau-1}I_{t}(V^{g}_{t+1}-Q^{g}_% {t})+I_{\tau}(Y_{\tau}-Q^{g}_{\tau})= ( italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_ψ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_I start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT )
=(V1gψ0)+t=1τ1𝟙{Yt1=0}It(Vt+1gQtg)+𝟙{Yτ1=0}Iτ(YτQτg)absentsubscriptsuperscript𝑉𝑔1subscript𝜓0superscriptsubscript𝑡1𝜏11subscript𝑌𝑡10subscript𝐼𝑡subscriptsuperscript𝑉𝑔𝑡1subscriptsuperscript𝑄𝑔𝑡1subscript𝑌𝜏10subscript𝐼𝜏subscript𝑌𝜏subscriptsuperscript𝑄𝑔𝜏\displaystyle=(V^{g}_{1}-\psi_{0})+\sum_{t=1}^{\tau-1}\mathbbm{1}\{Y_{t-1}=0\}% I_{t}(V^{g}_{t+1}-Q^{g}_{t})+\mathbbm{1}\{Y_{\tau-1}=0\}I_{\tau}(Y_{\tau}-Q^{g% }_{\tau})= ( italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_ψ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ - 1 end_POSTSUPERSCRIPT blackboard_1 { italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = 0 } italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + blackboard_1 { italic_Y start_POSTSUBSCRIPT italic_τ - 1 end_POSTSUBSCRIPT = 0 } italic_I start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT )
=(V1gψ0)+t=1T1It(Vt+1gQtg)+IT(YTQTg),absentsubscriptsuperscript𝑉𝑔1subscript𝜓0superscriptsubscript𝑡1𝑇1subscript𝐼𝑡subscriptsuperscript𝑉𝑔𝑡1subscriptsuperscript𝑄𝑔𝑡subscript𝐼𝑇subscript𝑌𝑇subscriptsuperscript𝑄𝑔𝑇\displaystyle=(V^{g}_{1}-\psi_{0})+\sum_{t=1}^{T-1}I_{t}(V^{g}_{t+1}-Q^{g}_{t}% )+I_{T}(Y_{T}-Q^{g}_{T}),= ( italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_ψ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_I start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ,

where Y0=0subscript𝑌00Y_{0}=0italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 by definition.

Appendix D Convergence of Temporal Difference Learning

First, consider a flexible model Qθsubscript𝑄𝜃Q_{\theta}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and corresponding Vt,θ=𝔼AtgtQt,θsubscript𝑉𝑡𝜃subscript𝔼similar-tosubscript𝐴𝑡subscript𝑔𝑡subscript𝑄𝑡𝜃V_{t,\theta}=\mathbb{E}_{A_{t}\sim g_{t}}Q_{t,\theta}italic_V start_POSTSUBSCRIPT italic_t , italic_θ end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t , italic_θ end_POSTSUBSCRIPT. Initiate θ0superscript𝜃0\theta^{0}italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and then iteratively update by θk+1=argminθP(Qθ,Vθk)superscript𝜃𝑘1subscriptargmin𝜃𝑃subscript𝑄𝜃subscript𝑉superscript𝜃𝑘\theta^{k+1}=\operatorname*{arg\,min}_{\theta}P\mathcal{L}(Q_{\theta},V_{% \theta^{k}})italic_θ start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_P caligraphic_L ( italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) for k=2,,𝑘2k=2,\ldots,italic_k = 2 , … , till convergence. Our proof below shows that if we use a variation independent parameter space for each Qt,θsubscript𝑄𝑡𝜃Q_{t,\theta}italic_Q start_POSTSUBSCRIPT italic_t , italic_θ end_POSTSUBSCRIPT and the parameter spaces contain the true Qt,Psubscript𝑄𝑡𝑃Q_{t,P}italic_Q start_POSTSUBSCRIPT italic_t , italic_P end_POSTSUBSCRIPT, then in K+2𝐾2K+2italic_K + 2-steps this algorithm will have converged to the true solution QPsubscript𝑄𝑃Q_{P}italic_Q start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT.

Ignoring the parameterization, but just thinking in terms of optimizing over parameter spaces, this algorithm corresponds with: initiate V0superscript𝑉0V^{0}italic_V start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, and then for k=0,𝑘0k=0,\ldotsitalic_k = 0 , …, compute Qk+1=argminQP(Q,Vk)superscript𝑄𝑘1subscript𝑄𝑃𝑄superscript𝑉𝑘Q^{k+1}=\arg\min_{Q}P\mathcal{L}\left(Q,V^{k}\right)italic_Q start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT italic_P caligraphic_L ( italic_Q , italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) and set Vk+1superscript𝑉𝑘1V^{k+1}italic_V start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT as the one implied by the intervention g𝑔gitalic_g and Qk+1superscript𝑄𝑘1Q^{k+1}italic_Q start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT; and set k=k+1𝑘𝑘1k=k+1italic_k = italic_k + 1.

Firstly, we claim that in a nonparametric model the t𝑡titalic_t-specific parameters Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are variation independent across t𝑡titalic_t. Consider a given V𝑉Vitalic_V (misspecified). This implies a parameter space {𝔼Ltpa(Lt)μt(pa(Lt))Vt:μt}\left\{\mathbb{E}_{L_{t}\mid pa(L_{t})\sim\mu_{t}(\cdot\mid pa(L_{t}))}V_{t}:% \mu_{t}\right\}{ blackboard_E start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_p italic_a ( italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∼ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_p italic_a ( italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } for the regressions Q𝑄Qitalic_Q. The parameter space of the free parameter μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is even larger than the parameter space of functions of pa(Lt)𝑝𝑎subscript𝐿𝑡pa(L_{t})italic_p italic_a ( italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Therefore this appears indeed a reasonable condition. Then we can state that the parameter space over which we optimize at step k𝑘kitalic_k of the algorithm is the cartesian product of the parameter spaces 𝒬tsubscript𝒬𝑡\mathcal{Q}_{t}caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT across t=τ,,1𝑡𝜏1t=\tau,\ldots,1italic_t = italic_τ , … , 1. Consider the k=1𝑘1k=1italic_k = 1-step of the algorithm in which the outcomes are V0superscript𝑉0V^{0}italic_V start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and we optimize over all the Qt=τ1𝒬t𝑄superscriptsubscriptproduct𝑡𝜏1subscript𝒬𝑡Q\in\prod_{t=\tau}^{1}\mathcal{Q}_{t}italic_Q ∈ ∏ start_POSTSUBSCRIPT italic_t = italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then, Q1superscript𝑄1Q^{1}italic_Q start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is the minimizer of QP(Q,V0)𝑄𝑃𝑄superscript𝑉0Q\rightarrow P\mathcal{L}(Q,V^{0})italic_Q → italic_P caligraphic_L ( italic_Q , italic_V start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ). That means that the derivative w.r.t. εtsubscript𝜀𝑡\varepsilon_{t}italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT along a path Qt1+εthtsuperscriptsubscript𝑄𝑡1subscript𝜀𝑡subscript𝑡Q_{t}^{1}+\varepsilon_{t}h_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT through Qt1superscriptsubscript𝑄𝑡1Q_{t}^{1}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT in any direction htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at ϵt=0subscriptitalic-ϵ𝑡0\epsilon_{t}=0italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0 should be equal to zero, across all t=τ,,1𝑡𝜏1t=\tau,\ldots,1italic_t = italic_τ , … , 1. Thus, at ε=0𝜀0\varepsilon=0italic_ε = 0, we have

ddεt=1τt(Qt1+εthtVt+10)=0𝑑𝑑𝜀superscriptsubscript𝑡1𝜏subscript𝑡superscriptsubscript𝑄𝑡1conditionalsubscript𝜀𝑡subscript𝑡subscriptsuperscript𝑉0𝑡10\frac{d}{d\varepsilon}\sum_{t=1}^{\tau}\mathcal{L}_{t}(Q_{t}^{1}+\varepsilon_{% t}h_{t}\mid V^{0}_{t+1})=0divide start_ARG italic_d end_ARG start_ARG italic_d italic_ε end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_V start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = 0

Consider the derivative w.r.t. εYsubscript𝜀𝑌\varepsilon_{Y}italic_ε start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT. This yields the score equation PhQτ(Vτ+1Qτ1)=0𝑃subscriptsubscript𝑄𝜏subscript𝑉𝜏1subscriptsuperscript𝑄1𝜏0Ph_{Q_{\tau}}(V_{\tau+1}-Q^{1}_{\tau})=0italic_P italic_h start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) = 0 for all hQτsubscriptsubscript𝑄𝜏h_{Q_{\tau}}italic_h start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT. This implies that Qτ1=Qτ,Psubscriptsuperscript𝑄1𝜏subscript𝑄𝜏𝑃Q^{1}_{\tau}=Q_{\tau,P}italic_Q start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_τ , italic_P end_POSTSUBSCRIPT. The others are some optimizer. Now, we go to step k=2𝑘2k=2italic_k = 2. We now know that Vτ1=Vτ,Psubscriptsuperscript𝑉1𝜏subscript𝑉𝜏𝑃V^{1}_{\tau}=V_{\tau,P}italic_V start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_τ , italic_P end_POSTSUBSCRIPT due to Qτ1=Qτ,Psubscriptsuperscript𝑄1𝜏subscript𝑄𝜏𝑃Q^{1}_{\tau}=Q_{\tau,P}italic_Q start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_τ , italic_P end_POSTSUBSCRIPT. Therefore, at the next step, due to the derivative w.r.t. ετ1subscript𝜀𝜏1\varepsilon_{\tau-1}italic_ε start_POSTSUBSCRIPT italic_τ - 1 end_POSTSUBSCRIPT, it follows that Qτ12=Qτ1,Psubscriptsuperscript𝑄2𝜏1subscript𝑄𝜏1𝑃Q^{2}_{\tau-1}=Q_{\tau-1,P}italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ - 1 end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_τ - 1 , italic_P end_POSTSUBSCRIPT, while it again Qτ2=Qτ1=Qτ,Psuperscriptsubscript𝑄𝜏2superscriptsubscript𝑄𝜏1subscript𝑄𝜏𝑃Q_{\tau}^{2}=Q_{\tau}^{1}=Q_{\tau,P}italic_Q start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_Q start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_Q start_POSTSUBSCRIPT italic_τ , italic_P end_POSTSUBSCRIPT. Then, at step k=3𝑘3k=3italic_k = 3, we also obtain Qτ23=Qτ2,Psuperscriptsubscript𝑄𝜏23subscript𝑄𝜏2𝑃Q_{\tau-2}^{3}=Q_{\tau-2,P}italic_Q start_POSTSUBSCRIPT italic_τ - 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT = italic_Q start_POSTSUBSCRIPT italic_τ - 2 , italic_P end_POSTSUBSCRIPT. In this manner, it follows that after K+2𝐾2K+2italic_K + 2 steps we have QK+2=QPsuperscript𝑄𝐾2subscript𝑄𝑃Q^{K+2}=Q_{P}italic_Q start_POSTSUPERSCRIPT italic_K + 2 end_POSTSUPERSCRIPT = italic_Q start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT.

Appendix E Hyperparameter Tuning

We selected hyperparameters shown in Table 4 which optimized the empiricall loss Q+Gsuperscript𝑄superscript𝐺\mathcal{L}^{Q}+\mathcal{L}^{G}caligraphic_L start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT in the validation set which is the 30% of the entire dataset. The parameter α𝛼\alphaitalic_α and β𝛽\betaitalic_β for censoring mechanism balances the learning rate of G𝐺Gitalic_G-parts and Q𝑄Qitalic_Q-parts because the complexity of G𝐺Gitalic_G-parts would be simpler than Q𝑄Qitalic_Q-parts which involves prediction in the long-range.

Table 4: Selected hyperparameters.
Data Simple Synthetic Data Complex Synthetic Data Real World
Model Deep LTMLE DeepACE Deep LTMLE DeepACE Deep LTMLE
τ𝜏\tauitalic_τ 10 20 30 10 20 30 10 20 30 10 20 30 30
Embedding dimension 32 32 32 32 32 32 16 32 32 32 32 32 32
Dropout rate 0 0 0.1 0.3 0.2 0.1 0 0 0 0.3 0.2 0.2 0
Hidden size 64 64 16 8 4 4 64 32 16 4 4 4 16
Number of Layers 8 4 4 1 1 2 4 4 4 2 8 2 8
Number of heads 8 4 4 8 8 8 4
Learning rate 1e-04 5e-04 5e-04 5e-03 1e-02 5e-03 1e-03 5e-04 1e-04 5e-04 5e-04 5e-04 5e-04
α𝛼\alphaitalic_α 0.1 0.01 0.01 0.01 0.1 0.05 0.01 0.05 0.05 0.01 0.1 0.1 0.1
β𝛽\betaitalic_β 0.05 0.05 0.05 0.05 0.05 0.05 0.01
Number of epochs 100 200 400 100 200 100 100 100 400 100 100 100 100

Appendix F Synthetic Data

F.1 Simple Synthetic Data with Continuous Outcome

The process iteratively generates variables W𝑊Witalic_W, Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over time steps t𝑡titalic_t, for t=0,,τ1𝑡0𝜏1t=0,\ldots,\tau-1italic_t = 0 , … , italic_τ - 1. WNormal(0,1)similar-to𝑊Normal01W\sim\text{Normal}(0,1)italic_W ∼ Normal ( 0 , 1 ). At t=0𝑡0t=0italic_t = 0, L0Normal(0.1W,1)similar-tosubscript𝐿0Normal0.1𝑊1L_{0}\sim\text{Normal}(0.1W,1)italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ Normal ( 0.1 italic_W , 1 ), A0Ber(σ(0.5W+L0))similar-tosubscript𝐴0Ber𝜎0.5𝑊subscript𝐿0A_{0}\sim\text{Ber}(\sigma(-0.5W+L_{0}))italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ Ber ( italic_σ ( - 0.5 italic_W + italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ), Y0Ber(σ(3+0.2W+0.2L02A0))similar-tosubscript𝑌0Ber𝜎30.2𝑊0.2subscript𝐿02subscript𝐴0Y_{0}\sim\text{Ber}(\sigma(-3+0.2W+0.2L_{0}-2A_{0}))italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ Ber ( italic_σ ( - 3 + 0.2 italic_W + 0.2 italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 2 italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ). For t>0𝑡0t>0italic_t > 0, LtNormal(0.1W0.1At1,1)similar-tosubscript𝐿𝑡Normal0.1𝑊0.1subscript𝐴𝑡11L_{t}\sim\text{Normal}(0.1W-0.1A_{t-1},1)italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ Normal ( 0.1 italic_W - 0.1 italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , 1 ), AtBer(σ(0.5+0.3W+0.3Lt+2At1))similar-tosubscript𝐴𝑡Ber𝜎0.50.3𝑊0.3subscript𝐿𝑡2subscript𝐴𝑡1A_{t}\sim\text{Ber}(\sigma(-0.5+0.3W+0.3L_{t}+2A_{t-1}))italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ Ber ( italic_σ ( - 0.5 + 0.3 italic_W + 0.3 italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ), Yt=σ(3+0.2W+0.2Lt2At)subscript𝑌𝑡𝜎30.2𝑊0.2subscript𝐿𝑡2subscript𝐴𝑡Y_{t}=\sigma(-3+0.2W+0.2L_{t}-2A_{t})italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_σ ( - 3 + 0.2 italic_W + 0.2 italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 2 italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), σ(x)=(1+ex)1𝜎𝑥superscript1superscript𝑒𝑥1\sigma(x)=(1+e^{-x})^{-1}italic_σ ( italic_x ) = ( 1 + italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is the sigmoid function. We set the counterfactual treatment at all time-points to 1 and and evaluated the counterfactual mean of survival under this treatment policy.

F.2 Simple Synthetic Data with Survival Outcome

The process iteratively generates variables W𝑊Witalic_W, Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over time steps t𝑡titalic_t, for t=0,,τ1𝑡0𝜏1t=0,\ldots,\tau-1italic_t = 0 , … , italic_τ - 1. WNormal(0,1)similar-to𝑊Normal01W\sim\text{Normal}(0,1)italic_W ∼ Normal ( 0 , 1 ). At t=0𝑡0t=0italic_t = 0, L0Normal(0.1W,1)similar-tosubscript𝐿0Normal0.1𝑊1L_{0}\sim\text{Normal}(0.1W,1)italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ Normal ( 0.1 italic_W , 1 ), A0Ber(σ(0.5W+L0))similar-tosubscript𝐴0Ber𝜎0.5𝑊subscript𝐿0A_{0}\sim\text{Ber}(\sigma(-0.5W+L_{0}))italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ Ber ( italic_σ ( - 0.5 italic_W + italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ), Y0Ber(σ(3+0.2W+0.2L02A0))similar-tosubscript𝑌0Ber𝜎30.2𝑊0.2subscript𝐿02subscript𝐴0Y_{0}\sim\text{Ber}(\sigma(-3+0.2W+0.2L_{0}-2A_{0}))italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ Ber ( italic_σ ( - 3 + 0.2 italic_W + 0.2 italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 2 italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ). For t>0𝑡0t>0italic_t > 0, LtNormal(0.1W0.1At1,1)similar-tosubscript𝐿𝑡Normal0.1𝑊0.1subscript𝐴𝑡11L_{t}\sim\text{Normal}(0.1W-0.1A_{t-1},1)italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ Normal ( 0.1 italic_W - 0.1 italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , 1 ), AtBer(σ(0.5+0.3W+0.3Lt+2At1))similar-tosubscript𝐴𝑡Ber𝜎0.50.3𝑊0.3subscript𝐿𝑡2subscript𝐴𝑡1A_{t}\sim\text{Ber}(\sigma(-0.5+0.3W+0.3L_{t}+2A_{t-1}))italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ Ber ( italic_σ ( - 0.5 + 0.3 italic_W + 0.3 italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ), YtBer(σ(3+0.2W+0.2Lt2At))similar-tosubscript𝑌𝑡Ber𝜎30.2𝑊0.2subscript𝐿𝑡2subscript𝐴𝑡Y_{t}\sim\text{Ber}(\sigma(-3+0.2W+0.2L_{t}-2A_{t}))italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ Ber ( italic_σ ( - 3 + 0.2 italic_W + 0.2 italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 2 italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ), with Yt1=1subscript𝑌𝑡11Y_{t-1}=1italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = 1 implying Yt=1subscript𝑌𝑡1Y_{t}=1italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1. Here σ(x)=(1+ex)1𝜎𝑥superscript1superscript𝑒𝑥1\sigma(x)=(1+e^{-x})^{-1}italic_σ ( italic_x ) = ( 1 + italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is the sigmoid function. We set the counterfactual treatment at all time-points to 1 and and evaluated the counterfactual mean of survival under this treatment policy.

F.3 Complex Synthetic Data with Survival Outcome

First draw parameters αi,βiNormal((i+1)1,0.02)similar-tosubscript𝛼𝑖subscript𝛽𝑖Normalsuperscript𝑖110.02\alpha_{i},\beta_{i}\sim\mathrm{Normal}((i+1)^{-1},0.02)italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ roman_Normal ( ( italic_i + 1 ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 0.02 ) and γi2*Binom(0.5)1similar-tosubscript𝛾𝑖2Binom0.51\gamma_{i}\sim 2*\mathrm{Binom}(0.5)-1italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ 2 * roman_Binom ( 0.5 ) - 1 for i[ht]𝑖delimited-[]𝑡i\in[ht]italic_i ∈ [ italic_h italic_t ], where hhitalic_h is the length of time-dependency with h=11h=1italic_h = 1 corresponding to Markovian process. Then, draw error in time-dependet variables εtjNormal(0,0.1)similar-tosubscriptsuperscript𝜀𝑡𝑗Normal00.1\varepsilon^{\ell}_{tj}\sim\mathrm{Normal}(0,0.1)italic_ε start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_j end_POSTSUBSCRIPT ∼ roman_Normal ( 0 , 0.1 ) for t[τ]𝑡delimited-[]𝜏t\in[\tau]italic_t ∈ [ italic_τ ] and j[p]𝑗delimited-[]𝑝j\in[p]italic_j ∈ [ italic_p ], errors in treatment εt1ANormal(0,0.2)similar-tosubscriptsuperscript𝜀𝐴𝑡1Normal00.2\varepsilon^{A}_{t1}\sim\mathrm{Normal}(0,0.2)italic_ε start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT ∼ roman_Normal ( 0 , 0.2 ), εt2ANormal(0,0.05)similar-tosubscriptsuperscript𝜀𝐴𝑡2Normal00.05\varepsilon^{A}_{t2}\sim\mathrm{Normal}(0,0.05)italic_ε start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT ∼ roman_Normal ( 0 , 0.05 ) for t[τ]𝑡delimited-[]𝜏t\in[\tau]italic_t ∈ [ italic_τ ]. For each t[τ]𝑡delimited-[]𝜏t\in[\tau]italic_t ∈ [ italic_τ ], Lt=tanh(k[ht]αkLtk+βkγk(2Atk1))+εtjsubscript𝐿𝑡subscript𝑘delimited-[]𝑡subscript𝛼𝑘subscript𝐿𝑡𝑘subscript𝛽𝑘subscript𝛾𝑘2subscript𝐴𝑡𝑘1subscriptsuperscript𝜀𝑡𝑗L_{t}=\tanh\big{(}\sum_{k\in[ht]}\alpha_{k}L_{t-k}+\beta_{k}\gamma_{k}(2A_{t-k% }-1)\big{)}+\varepsilon^{\ell}_{tj}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_tanh ( ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_h italic_t ] end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_t - italic_k end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 2 italic_A start_POSTSUBSCRIPT italic_t - italic_k end_POSTSUBSCRIPT - 1 ) ) + italic_ε start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_j end_POSTSUBSCRIPT, then draw Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from an indicator function 𝟙{(σ(st+)+εt2A)>0.5}1𝜎limit-fromsubscript𝑠𝑡subscriptsuperscript𝜀𝐴𝑡20.5\mathbbm{1}\{(\sigma(s_{t}+)+\varepsilon^{A}_{t2})>0.5\}blackboard_1 { ( italic_σ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ) + italic_ε start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT ) > 0.5 }, with st=tan(j[p]L¯j+A¯)+εt1Asubscript𝑠𝑡subscriptproduct𝑗delimited-[]𝑝subscript¯𝐿𝑗¯𝐴subscriptsuperscript𝜀𝐴𝑡1s_{t}=\tan\big{(}\prod_{j\in[p]}\bar{L}_{j}+\bar{A}\big{)}+\varepsilon^{A}_{t1}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_tan ( ∏ start_POSTSUBSCRIPT italic_j ∈ [ italic_p ] end_POSTSUBSCRIPT over¯ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + over¯ start_ARG italic_A end_ARG ) + italic_ε start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT. The outcome Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is drawn from a Bernoulli distribution of a probability σ(pt)𝜎subscript𝑝𝑡\sigma(p_{t})italic_σ ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with pt=tan(j[p]L¯j0.7*(A¯0.5))4.5subscript𝑝𝑡subscriptproduct𝑗delimited-[]𝑝subscript¯𝐿𝑗0.7¯𝐴0.54.5p_{t}=\tan\big{(}\prod_{j\in[p]}\bar{L}_{j}-0.7*(\bar{A}-0.5))-4.5italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_tan ( ∏ start_POSTSUBSCRIPT italic_j ∈ [ italic_p ] end_POSTSUBSCRIPT over¯ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - 0.7 * ( over¯ start_ARG italic_A end_ARG - 0.5 ) ) - 4.5. Yt=1subscript𝑌𝑡1Y_{t}=1italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 if Yt1=1subscript𝑌𝑡11Y_{t-1}=1italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = 1 for t>0𝑡0t>0italic_t > 0. We set the counterfactual treatment policy as 𝟙{σ(st)>0.5}1𝜎subscript𝑠𝑡0.5\mathbbm{1}\{\sigma(s_{t})>0.5\}blackboard_1 { italic_σ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > 0.5 } for t[τ]𝑡delimited-[]𝜏t\in[\tau]italic_t ∈ [ italic_τ ] and evaluated the counterfactual mean of survival under this policy.

Appendix G Results with Simple Synthetic Data with Survival Outcome

Results of an experiment with the simple synthetic data described in Section F.2 was shown in Table 5. Although LTMLE’s strong performance on simple synthetic data is anticipated due to reduced burden in estimating nuisance parameters from Markovian dependencies, Deep LTMLE remains highly competitive in this context, equalling LTMLE’s performance. Our two targeting approaches demonstrated better bias variance trade off for the estimation of the target parameter compared to the untargeted approach. Both bias and standard deviation get improved a lot for all τ𝜏\tauitalic_τ’s considered. The targeting step made a marked difference in terms of coverage probability, getting much closer to a nominal 95% coverage probability compared to the one without targeting.

Table 5: Results from simple synthetic data
Bias RMSE Coverage Mean σ^nsubscript^𝜎𝑛\hat{\sigma}_{n}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
Model τ=10𝜏10\tau=10italic_τ = 10 τ=20𝜏20\tau=20italic_τ = 20 τ=30𝜏30\tau=30italic_τ = 30 τ=10𝜏10\tau=10italic_τ = 10 τ=20𝜏20\tau=20italic_τ = 20 τ=30𝜏30\tau=30italic_τ = 30 τ=10𝜏10\tau=10italic_τ = 10 τ=20𝜏20\tau=20italic_τ = 20 τ=30𝜏30\tau=30italic_τ = 30 τ=10𝜏10\tau=10italic_τ = 10 τ=20𝜏20\tau=20italic_τ = 20 τ=30𝜏30\tau=30italic_τ = 30
LTMLE (GLM) 0.0052 0.0045 0.0021 0.0202 0.0268 0.0308 0.95 0.94 0.93 0.02 0.03 0.03
LTMLE (SL) 0.0056 0.0058 0.0061 0.0203 0.0263 0.0311 0.91 0.93 0.91 0.02 0.02 0.03
DeepACE 0.0213 0.0462 -0.1342 0.0266 0.0515 0.1397 1.00 1.00 1.00 0.19 0.70 0.70
Deep LTMLE 0.0080 0.0133 0.0090 0.0292 0.0569 0.0449 0.79 0.78 0.87 0.02 0.04 0.03
Deep LTMLE \dagger 0.0054 0.0070 0.0080 0.0207 0.0350 0.0329 0.91 0.95 0.91 0.02 0.04 0.03
Deep LTMLE \star 0.0053 0.0053 0.0080 0.0207 0.0361 0.0310 0.90 0.96 0.92 0.02 0.04 0.03

G.1 Semi-Synthetic Data

As a compromise, we conducted several additional experiments with semi-synthetic data from the real world data as used in previous studies (Bica et al., 2020; Frauen et al., 2023). For this experiment, we used covariates from the Circulatory Risk in Communities Study (CIRCS) and fit outcome regression given the history through each time point using XGBoost with early stop**. Outcomes were then generated using this fitted regression model. For the experiment, we sample 1000 observations from the empirical dstribution of covariates W,Lt,At𝑊subscript𝐿𝑡subscript𝐴𝑡W,L_{t},A_{t}italic_W , italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and generate Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for t=1,,τ𝑡1𝜏t=1,\ldots,\tauitalic_t = 1 , … , italic_τ with τ{10,20.30}𝜏1020.30\tau\in\{10,20.30\}italic_τ ∈ { 10 , 20.30 }.

Appendix H Extension to Survival Analysis with Censoring

In this section, we describe the extended LTMLE model with censoring for the real world application in Section 5.4. We assume the following order of observed nodes O=(W=W0,L1,A1,C1,Y1,L2,A2,C2,Y2,,Lτ,Aτ,Cτ,Yτ=Y)𝑂formulae-sequence𝑊subscript𝑊0subscript𝐿1subscript𝐴1subscript𝐶1subscript𝑌1subscript𝐿2subscript𝐴2subscript𝐶2subscript𝑌2subscript𝐿𝜏subscript𝐴𝜏subscript𝐶𝜏subscript𝑌𝜏𝑌O=(W=W_{0},L_{1},A_{1},C_{1},Y_{1},L_{2},A_{2},C_{2},Y_{2},\ldots,L_{\tau},A_{% \tau},C_{\tau},Y_{\tau}=Y)italic_O = ( italic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_L start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = italic_Y ), where Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are binary censoring nodes with Ct=1subscript𝐶𝑡1C_{t}=1italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 indicating one being censord. Our interest is to estimate the risk of our outcome Yτsubscript𝑌𝜏Y_{\tau}italic_Y start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT, the mortality of the individual. However, our observation period spans long-term, individuals are at risk of being censored. Censoring Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is loss of follow-up from administrative reasons, for example, move to other areas or denial of participation in the survey. We assume degenerations of nodes. When we observe a jump in Y𝑌Yitalic_Y or C𝐶Citalic_C nodes, the process halts and all nodes after the jump remain constant with the last observed values. For example, if Yt=1subscript𝑌𝑡1Y_{t}=1italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1, then Ys=1subscript𝑌𝑠1Y_{s}=1italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 1, Cs=0subscript𝐶𝑠0C_{s}=0italic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0, As1=At1subscript𝐴𝑠1subscript𝐴𝑡1A_{s-1}=A_{t-1}italic_A start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, and Ls=Ltsubscript𝐿𝑠subscript𝐿𝑡L_{s}=L_{t}italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for all s>t𝑠𝑡s>titalic_s > italic_t.

We constructed a Deep LTMLE similar to the one describe in Section 4 with this structure. The only difference is an additional component of censoring mechanism Gcsuperscript𝐺𝑐G^{c}italic_G start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT which is involved in the clever covariate Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the loss function:

It(G)subscript𝐼𝑡𝐺\displaystyle I_{t}(G)italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_G ) =s=1td(g𝟙(Cs=0))d(GtaGtc)(O), andabsentsuperscriptsubscriptproduct𝑠1𝑡𝑑tensor-product𝑔1subscript𝐶𝑠0𝑑tensor-productsubscriptsuperscript𝐺𝑎𝑡subscriptsuperscript𝐺𝑐𝑡𝑂 and\displaystyle=\prod_{s=1}^{t}\frac{d(g\otimes\mathbbm{1}(C_{s}=0))}{d(G^{a}_{t% }\otimes G^{c}_{t})}(O),\text{ and }= ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_d ( italic_g ⊗ blackboard_1 ( italic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0 ) ) end_ARG start_ARG italic_d ( italic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_G start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ( italic_O ) , and (35)
(Q,V,G)𝑄𝑉𝐺\displaystyle\mathcal{L}(Q,V,G)caligraphic_L ( italic_Q , italic_V , italic_G ) =Q(Q,V)+αGa(Ga,A)+βGc(Gc,C),absentsuperscript𝑄𝑄𝑉𝛼superscriptsuperscript𝐺𝑎superscript𝐺𝑎𝐴𝛽superscriptsuperscript𝐺𝑐superscript𝐺𝑐𝐶\displaystyle=\mathcal{L}^{Q}(Q,V)+\alpha\mathcal{L}^{G^{a}}(G^{a},A)+\beta% \mathcal{L}^{G^{c}}(G^{c},C),= caligraphic_L start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ( italic_Q , italic_V ) + italic_α caligraphic_L start_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_A ) + italic_β caligraphic_L start_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_G start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_C ) , (36)

where β𝛽\betaitalic_β is an additional hyperparameter for the loss function of binary logistic loss. The counterfactual treatment on the censoring process is 𝟙(Ct=0)1subscript𝐶𝑡0\mathbbm{1}(C_{t}=0)blackboard_1 ( italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0 ) meaning supression of censoring. Estimates of the target parameter and the efficient influence curve for different treatment strategies are computed using Deep LTMLE, and average treatment effects (ATEs) and its EIC were computed using the delta method. Based on the estimated EICs of the target parameters at each time point t, we constructed a simultaneous confidence intervals.