\AfterEndEnvironment

theorem \AfterEndEnvironmentproposition \AfterEndEnvironmentlemma \AfterEndEnvironmentcorollary \AfterEndEnvironmentdefinition \AfterEndEnvironmentremark

Learning Decision Policies with Instrumental Variables
through Double Machine Learning

Daqian Shao Department of Computer Science, University of Oxford Ashkan Soleymani Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology Francesco Quinzan Department of Computer Science, University of Oxford Marta Kwiatkowska Department of Computer Science, University of Oxford
Abstract

A common issue in learning decision-making policies in data-rich settings is spurious correlations in the offline dataset, which can be caused by hidden confounders. Instrumental variable (IV) regression, which utilises a key unconfounded variable known as the instrument, is a standard technique for learning causal relationships between confounded action, outcome, and context variables. Most recent IV regression algorithms use a two-stage approach, where a deep neural network (DNN) estimator learnt in the first stage is directly plugged into the second stage, in which another DNN is used to estimate the causal effect. Naively plugging the estimator can cause heavy bias in the second stage, especially when regularisation bias is present in the first stage estimator. We propose DML-IV, a non-linear IV regression method that reduces the bias in two-stage IV regressions and effectively learns high-performing policies. We derive a novel learning objective to reduce bias and design the DML-IV algorithm following the double/debiased machine learning (DML) framework. The learnt DML-IV estimator has strong convergence rate and O(N1/2)𝑂superscript𝑁12O(N^{-1/2})italic_O ( italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) suboptimality guarantees that match those when the dataset is unconfounded. DML-IV outperforms state-of-the-art IV regression methods on IV regression benchmarks and learns high-performing policies in the presence of instruments.

1 Introduction

Recent advances in deep learning (DL) have greatly facilitated the learning of decision-making policies in data-rich settings, but they often lack optimality guarantees. A common issue for learning from offline observational data is the existence of spurious correlations, which are relationships between variables that appear to be causal, but in fact are not. For example, suppose we have aeroplane ticket sales and pricing data in a ticket demand scenario Hartford et al. [2017], and we wish to learn a policy from this offline data that maximises revenue. During holiday season, observational data may contain evidence of a concurrent surge in both ticket sales and prices, which may result in the learning algorithm to learn an incorrect policy that higher ticket prices will drive higher sales.

Spurious correlations are often caused by hidden confounders Pearl [2000], which are unobserved variables that influence both the actions (or interventions) and the outcome. In the aeroplane ticket example, the occurrence of popular events and holidays serves as a hidden confounder that raises both ticket prices (actions) and sales (outcome). To properly account for these hidden confounders and understand the true causal effect of actions, we need to model the causal (or structural) relationship between the action and the outcome, which is expressed through a causal function. However, learning the causal function in the presence of hidden confounders is known to be challenging and sometimes infeasible Shpitser and Pearl [2008].

A popular approach to deal with hidden confounders is via instrumental variables (IVs) Wright [1928], which are heterogeneous random variables that only affect the action, but not the outcome. These IVs have been used extensively to identify the causal effect of actions in many applications, including econometrics Reiersöl [1945], Angrist and Pischke [2009], drug testings Angrist et al. [1996], and social sciences Angrist [1990]. In the aeroplane ticket example, we can employ supply cost-shifters (e.g., fuel price) as instrumental variables, as their variations are independent of the demand for aeroplane tickets and affect sales solely via ticket prices Blundell et al. [2012].

We focus on the problem of learning the causal function in the presence of hidden confounders using IVs (known as IV regression), in order to learn a decision policy that maximises the expected outcome in this setting (which we refer to as the offline IV bandit problem, described in Section 2.3) and comes with suboptimality guarantees. Two-stage least squares (2SLS) Angrist et al. [1996] is a classical IV regression algorithm, which has been extended to non-linear settings that utilise machine learning (ML) techniques, including deep neural networks (DNNs), to learn the causal function. The use of DNNs allows for greater flexibility in IV regression, as it does not impose strong assumptions on the functional form and can learn directly from data. However, regularisation is often employed to trade-off overfitting with the induced regularisation bias, especially for high-dimensional inputs. Both regularisation bias and overfitting may cause heavy bias Chernozhukov et al. [2018] in estimating the causal function when the first stage estimator is naively plugged in, which causes slow convergence of the causal function estimator.

Double/Debiased Machine Learning Chernozhukov et al. [2018] (DML) is a statistical technique that provides an unbiased estimator with convergence rate guarantees for general two-stage regressions. DML relies on having a Neyman orthogonal Neyman and Scott [1965] score function to deal with regularisation bias, and uses cross-fitting, that is, an efficient form of (randomised) data splitting, to tackle overfitting bias. However, the use of DML for IV regression that utilises neural networks has not been explored.

In this work, we propose DML-IV, a novel IV regression algorithm that adopts the DML framework to provide an unbiased estimation of the causal function with fast convergence rate guarantees. We derive a novel Neyman orthogonal score for IV regression, and design a cross-fitting regime such that, under mild regularity conditions, our estimator is guaranteed to converge at the rate of N1/2superscript𝑁12N^{-1/2}italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the sample size. We then extend DML-IV to solve the offline IV bandit problem, where we derive a policy from the DML-IV estimator and provide a O(N1/2)𝑂superscript𝑁12O(N^{-1/2})italic_O ( italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) suboptimality bound with high probability that matches the suboptimality bounds of unconfounded offline bandit algorithms ** et al. [2021], Nguyen-Tang et al. [2022]. Finally, we evaluate DML-IV on multiple benchmarks for IV regression and offline IV bandits, where superior results are demonstrated compared to state-of-the-art (SOTA) methods.

Novel Contributions.

  • We propose DML-IV, a novel IV regression algorithm that leverages the DML framework to provide unbiased estimation of the causal function.

  • We derive a novel, Neyman orthogonal, score function for IV regression, and design a cross-fitting regime for the DML-IV estimator to mitigate the bias.

  • We provide the first convergence rate guarantees for IV regression algorithms that use DL. Namely, we show that DML-IV converges at N1/2superscript𝑁12N^{-1/2}italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT rate leading to O(N1/2)𝑂superscript𝑁12O(N^{-1/2})italic_O ( italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) suboptimality for the derived policy.

  • On a range of IV regression and offline IV bandit benchmarks, including two real-world datasets, we experimentally demonstrate that DML-IV outperforms other SOTA methods.

1.1 Related Works

IV Regression. A number of approaches have been developed to extend the two-stage least squares (2SLS) algorithm Angrist et al. [1996] to non-linear settings. A common approach is to use non-linear basis functions, such as Sieve IV Newey and Powell [2003], Blundell et al. [2007], Chen and Christensen [2018], Kernel IV Singh et al. [2019] and Dual IV Muandet et al. [2020]. These methods enjoy theoretical benefits, but their flexibility is limited by the set of basis functions. More recently, DFIV Xu et al. [2020] proposed to use basis functions parameterised by DNNs, which remove the restrictions on the functional form. Another approach is to perform stage 1 regression through conditional density estimation Darolles et al. [2011], where DeepIV Hartford et al. [2017] adopts DNNs to perform these regressions. DeepGMM Bennett et al. [2019] is a DNN-based method that is inspired by the Generalised Method of Moments (GMM) to find a causal function that ensures the regression residual and the instrument are independent. The learning procedure of DeepGMM does not offer stability comparable to 2SLS approaches, as it is based on solving a smooth zero-sum game, similar to training Generative Adversarial Networks Goodfellow et al. [2014]. Our approach allows DNNs in both stages and compares favourably to Deep IV, DeepGMM, Kernel IV and DFIV.

Double Machine Learning (DML). DML was originally proposed for semiparametric regression Robinson [1988]; it relies on the derivation of a score function, which describes the regression problem that is Neyman orthogonal Neyman and Scott [1965]. DML was later extended by adopting DNNs for generalised linear regressions Chernozhukov et al. [2021]. Its strength is that it provides unbiased estimations for causal effects when the causal effect is identifiable Jung et al. [2021] or there are no hidden confounders Chernozhukov et al. [2022b]. DML offers strong (N1/2superscript𝑁12N^{-1/2}italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the size of the dataset) guarantees on the convergence rate, even in the presence of high-dimensional input.

There are previous works on combining DML with IV regression, but they are mainly focused on linear and partially linear models. Belloni et al. [2012] propose a method to use Lasso and then Post-Lasso methods for the first stage estimation of linear IV to estimate the optimal instruments. To avoid selection biases, Belloni et al. [2012] leverage techniques from weak identification robust inference. In addition, Chernozhukov et al. [2015] propose a Neyman-orthogonalised score for the linear IV problem with control and instrument selection, to potentially be robust to regularisation and selection biases of Lasso as a model selection method. Neyman orthogonality for partially linear models with instruments was primarily discussed in the work of Chernozhukov et al. [2018]. Furthermore, DML techniques for identifying the local average treatment effects (LATE) for nonlinear models with a binary instrument and treatment (action) have been explored before [Chernozhukov et al., 2024]. For additional discussion, we refer to the book [Chernozhukov et al., 2024].

DML for semiparametric models Chernozhukov et al. [2022a], Ichimura and Newey [2022] has been previously applied to solve the nonparametric IV (NPIV) problem. However, their methods require that the average moment of the Neyman orthogonal score is affine (linear) in the nuisance parameters. Therefore, when applied to solve NPIV, functional assumptions regarding the IV set and the residual function were made. Such assumptions are not required in our work since we are considering a different problem setting and their Neyman orthogonal score is very different from ours. To the best of our knowledge, there is no work that adopts the DML framework for IV regression with DNNs.

Causality. Doubly robust estimation for causality problems predominantly revolved around the estimation of average treatment effects (ATE) [Robins et al., 1994, Funk et al., 2011, Benkeser et al., 2017, Bang and Robins, 2005, Słoczyński and Wooldridge, 2018]. Recently, there has been a surge in doubly robust identification of causal structures beyond the ATE settings. Soleymani et al. [2022], Quinzan et al. [2023] focus on finding direct causes of the target variable by orthogonalised scores. Angelis et al. [2023] extend this line for testing Granger causality in the time-series domain. In this work, we focus on doubly robust estimation of the counterfactual prediction function, a central problem in the field of causal inference, which could be of independent interest beyond the IV settings.

Offline Bandit. Most bandit algorithms assume unconfoundedness (e.g., Nguyen-Tang et al. [2022], Subramanian and Ravindran [2022]). For bandit algorithms that consider hidden confounders, most of them work in the online setting, aiming to learn the best policy from scratch using the least amount of online interactions Zhang and Bareinboim [2020], Subramanian and Ravindran [2022], or with the help of a pre-collected dataset Lu et al. [2020]. Few works are dedicated to the offline confounded bandit, where only the offline dataset is provided, as it is essentially a causal inference problem. However, offline reinforcement learning (RL) with hidden confounders has been studied. Pace et al. [2023] develop a pessimistic algorithm based on the Delphic uncertainty due to hidden confounders, while other methods adopt IV regression in combination with value iteration Liao et al. [2021] and Actor-Critic methods Li et al. [2021] to learn policies in offline RL. Offline policy evaluation (OPE) under hidden confounders has also been studied. Using IVs, doubly robust estimators for policy values are derived through efficient influence functions Xu et al. [2023] and marginalised importance sampling Fu et al. [2022]. Bennett et al. [2021] solve OPE under an infinite-horizon ergodic MDP with hidden confounders using states and actions as proxies for the hidden confounders to identify policy values. Chen et al. [2021] consider the OPE problem in a standard unconfounded MDP, where they view the previous (action, state) pair as the instrument for the Bellman residual estimation problem of the current (action, state) pair and directly apply existing IV regression methods to estimate the Q value. We consider the setting of the offline confounded bandit with IVs, for which we leverage DML to obtain convergence and suboptimality guarantees.

2 Preliminaries

2.1 Notation

We use uppercase letters such as C𝐶Citalic_C to denote random variables. An observed realisation of C𝐶Citalic_C is denoted by a lowercase letter c𝑐citalic_c. We abbreviate 𝔼[R|C=c]\mathds{E}[R\lvert C=c]blackboard_E [ italic_R | italic_C = italic_c ], a realisation of the conditional expectation 𝔼[R|C]\mathds{E}[R\lvert C]blackboard_E [ italic_R | italic_C ], as 𝔼[R|c]\mathds{E}[R\lvert c]blackboard_E [ italic_R | italic_c ]. [N]delimited-[]𝑁[N][ italic_N ] denotes the set {1,,N}1𝑁\{1,...,N\}{ 1 , … , italic_N } for N𝑁N\in\mathbb{N}italic_N ∈ blackboard_N. We write 𝔼[R|do(A=a)]\mathds{E}[R\lvert do(A=a)]blackboard_E [ italic_R | italic_d italic_o ( italic_A = italic_a ) ] for the expectation of R𝑅Ritalic_R under do intervention Pearl [2000] of setting A=a𝐴𝑎A=aitalic_A = italic_a. We use psubscriptdelimited-∥∥𝑝\lVert\cdot\rVert_{p}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to denote the functional norm, defined as fp𝔼[|f(C)|p]1/psubscriptdelimited-∥∥𝑓𝑝𝔼superscriptdelimited-[]superscript𝑓𝐶𝑝1𝑝\lVert f\rVert_{p}\coloneqq\mathds{E}[\lvert f(C)\rvert^{p}]^{1/p}∥ italic_f ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≔ blackboard_E [ | italic_f ( italic_C ) | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT, where the measure is implicit from the context. For a function f𝑓fitalic_f, we use f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to denote the true function and f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG an estimator of the true function. We use O𝑂Oitalic_O and o𝑜oitalic_o to denote big-O and little-o notations Weisstein [2023] respectively.

2.2 Contextual IV Setting

We begin with a description of the contextual IV setting Hartford et al. [2017] that we use in this paper. We observe an action A𝒜dA𝐴𝒜superscriptsubscript𝑑𝐴A\in\mathcal{A}\subseteq\mathbb{R}^{d_{A}}italic_A ∈ caligraphic_A ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, a context C𝒞dC𝐶𝒞superscriptsubscript𝑑𝐶C\in\mathcal{C}\subseteq\mathbb{R}^{d_{C}}italic_C ∈ caligraphic_C ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, an instrumental variable (IV) Z𝒵dZ𝑍𝒵superscriptsubscript𝑑𝑍Z\in\mathcal{Z}\subseteq\mathbb{R}^{d_{Z}}italic_Z ∈ caligraphic_Z ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and an outcome R𝑅R\in\mathbb{R}italic_R ∈ blackboard_R, where there exist unobserved confounders that affect all of A𝐴Aitalic_A, C𝐶Citalic_C and R𝑅Ritalic_R through a hidden variable (or noise) ϵitalic-ϵ\epsilonitalic_ϵ. IV directly affects the action A𝐴Aitalic_A, does not directly affect the outcome R𝑅Ritalic_R and is not correlated with the hidden confounder ϵitalic-ϵ\epsilonitalic_ϵ. These causal relationships are illustrated in Fig. 1 and are represented by the following structural causal model Pearl [2000]:

Rfr(C,A)+ϵ,𝔼[ϵ]=0,𝔼[ϵ|A,C]0,\displaystyle R\coloneqq f_{r}(C,A)+\epsilon,\quad\mathds{E}[\epsilon]=0,\quad% \mathds{E}[\epsilon\lvert A,C]\neq 0,italic_R ≔ italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_C , italic_A ) + italic_ϵ , blackboard_E [ italic_ϵ ] = 0 , blackboard_E [ italic_ϵ | italic_A , italic_C ] ≠ 0 , (1)

where frsubscript𝑓𝑟f_{r}italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is an unknown, continuous, and potentially non-linear causal function, and 𝔼[ϵ|A,C]\mathds{E}[\epsilon\lvert A,C]blackboard_E [ italic_ϵ | italic_A , italic_C ] is not necessarily zero. Denote the set of observations (ci,zi,ai,ri)subscript𝑐𝑖subscript𝑧𝑖subscript𝑎𝑖subscript𝑟𝑖(c_{i},z_{i},a_{i},r_{i})( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ], generated from this model as the offline dataset 𝒟𝒟\mathcal{D}caligraphic_D. The goal of this paper is to learn the counterfactual prediction function Hartford et al. [2017],

h0(C,A):=fr(C,A)+𝔼[ϵ|C]=𝔼[R|do(A),C],\displaystyle h_{0}(C,A):=f_{r}(C,A)+\mathds{E}[\epsilon\lvert C]=\mathds{E}[R% \lvert do(A),C],italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_A ) := italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_C , italic_A ) + blackboard_E [ italic_ϵ | italic_C ] = blackboard_E [ italic_R | italic_d italic_o ( italic_A ) , italic_C ] ,

which is the expected outcome under do(A)𝑑𝑜𝐴do(A)italic_d italic_o ( italic_A ) intervention conditional on C𝐶Citalic_C, from the offline dataset 𝒟𝒟\mathcal{D}caligraphic_D. This task is also known as IV regression, and we aim to estimate h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using a DNN. The term 𝔼[ϵ|C]\mathds{E}[\epsilon\lvert C]blackboard_E [ italic_ϵ | italic_C ] is typically nonzero111In the setting where 𝔼[ϵ|C]=0\mathds{E}[\epsilon\lvert C]=0blackboard_E [ italic_ϵ | italic_C ] = 0 is assumed Bennett et al. [2019], Xu et al. [2020], h0=frsubscript0subscript𝑓𝑟h_{0}=f_{r}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and all our results apply., but learning h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT still allows us to compare between different actions when given a context as h0(C,a1)h0(C,a2)=fr(C,a1)fr(C,a2)subscript0𝐶subscript𝑎1subscript0𝐶subscript𝑎2subscript𝑓𝑟𝐶subscript𝑎1subscript𝑓𝑟𝐶subscript𝑎2h_{0}(C,a_{1})-h_{0}(C,a_{2})=f_{r}(C,a_{1})-f_{r}(C,a_{2})italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_C , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_C , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) for all a1,a2𝒜subscript𝑎1subscript𝑎2𝒜a_{1},a_{2}\in\mathcal{A}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_A, and in particular, argmaxah0(C,a)=argmaxafr(C,a)subscriptargmax𝑎subscript0𝐶𝑎subscriptargmax𝑎subscript𝑓𝑟𝐶𝑎\operatorname*{arg\,max}_{a}h_{0}(C,a)=\operatorname*{arg\,max}_{a}f_{r}(C,a)start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_a ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_C , italic_a ).

Generally, h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is allowed to be infinite-dimensional, as commonly seen in nonparametric IV literature Newey and Powell [2003]. We also allow h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to be infinite-dimensional for the Neyman orthogonal score introduced in Section 3.1, but later, in Section 3.2, we restrict h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to be finite-dimensional and parameterised to obtain the theoretical results of the convergence rate and the suboptimality bound of O(N1/2)𝑂superscript𝑁12O(N^{-1/2})italic_O ( italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ).

The challenge of learning h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from 𝒟𝒟\mathcal{D}caligraphic_D is that 𝔼[ϵ|C,A]0\mathds{E}[\epsilon\lvert C,A]\neq 0blackboard_E [ italic_ϵ | italic_C , italic_A ] ≠ 0, which reflects the existence of hidden confounders that obscure the true causal effect. It has been shown Bareinboim and Pearl [2012] that we cannot learn the causal effect of actions in the presence of hidden confounders without structural assumptions. Fortunately, IVs enable the identification of h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT if the following assumptions hold:

Assumption 2.1.

(a) ϵitalic-ϵ\epsilonitalic_ϵ is additive to R𝑅Ritalic_R and 𝔼[ϵ]=0𝔼delimited-[]italic-ϵ0\mathds{E}[\epsilon]=0blackboard_E [ italic_ϵ ] = 0; (b) ZϵCZ\perp\!\!\!\perp\epsilon\mid Citalic_Z ⟂ ⟂ italic_ϵ ∣ italic_C; and (c) (A|C,Z)\mathds{P}(A\lvert C,Z)blackboard_P ( italic_A | italic_C , italic_Z ) is not constant in Z𝑍Zitalic_Z.

Intuitively, 2.1 (a) and (b), introduced by Newey and Powell [2003], is known as the exclusion restriction, and requires that the instrument Z𝑍Zitalic_Z is uncorrelated with the hidden confounder ϵitalic-ϵ\epsilonitalic_ϵ. 2.1 (c), known as the relevance condition, ensures that Z𝑍Zitalic_Z induces variation in action and should be satisfied by the data generation policy. These assumptions are standard for the IV setting Newey and Powell [2003], Xu et al. [2020], Singh et al. [2019], and allow for the minimal condition to identify the causal effect.

Refer to caption
Figure 1: The causal graph of the contextual IV setting, where R=fr(C,A)+ϵ𝑅subscript𝑓𝑟𝐶𝐴italic-ϵR=f_{r}(C,A)+\epsilonitalic_R = italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_C , italic_A ) + italic_ϵ and Z𝑍Zitalic_Z is an instrumental variable that affects R𝑅Ritalic_R only through A𝐴Aitalic_A.

2.3 Offline IV Bandit

The learnt estimator of h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the offline dataset 𝒟𝒟\mathcal{D}caligraphic_D can be used to solve the offline bandit problem in the contextual IV setting Zhang et al. [2022], that is, to identify a (deterministic) policy π:𝒞𝒜:𝜋𝒞𝒜\pi:\mathcal{C}\rightarrow\mathcal{A}italic_π : caligraphic_C → caligraphic_A that maximises the value V(π)𝔼ctest[R|do(A=π(c)),c]=𝔼ctest[h0(c,π(c))]V(\pi)\coloneqq\mathds{E}_{c\sim\mathds{P}_{\textrm{test}}}[R\lvert do(A=\pi(c% )),c]=\mathds{E}_{c\sim\mathds{P}_{\textrm{test}}}[h_{0}(c,\pi(c))]italic_V ( italic_π ) ≔ blackboard_E start_POSTSUBSCRIPT italic_c ∼ blackboard_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R | italic_d italic_o ( italic_A = italic_π ( italic_c ) ) , italic_c ] = blackboard_E start_POSTSUBSCRIPT italic_c ∼ blackboard_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_c , italic_π ( italic_c ) ) ], which is the expected outcome when performing actions following π𝜋\piitalic_π. testsubscripttest\mathds{P}_{\textrm{test}}blackboard_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT is a test context distribution that can potentially differ from the distribution of 𝒟𝒟\mathcal{D}caligraphic_D. The optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT should satisfy V(π)=maxπV(π)𝑉superscript𝜋subscript𝜋𝑉𝜋V(\pi^{*})=\max_{\pi}V(\pi)italic_V ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V ( italic_π ), and suboptimality is defined as subopt(π)V(π)V(π)subopt𝜋𝑉superscript𝜋𝑉𝜋\textrm{subopt}(\pi)\coloneqq V(\pi^{*})-V(\pi)subopt ( italic_π ) ≔ italic_V ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_V ( italic_π ). We see that the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can be retrieved from h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by selecting π(c)=argmaxa𝒜h0(c,a)superscript𝜋𝑐subscriptargmax𝑎𝒜subscript0𝑐𝑎\pi^{*}(c)=\operatorname*{arg\,max}_{a\in\mathcal{A}}h_{0}(c,a)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_c ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_c , italic_a ).

2.4 Two-Stage IV Regression

In order to identify h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, a key observation Newey and Powell [2003] is that, by taking the expectation on both sides of Eq. 1 conditional on (C,Z)𝐶𝑍(C,Z)( italic_C , italic_Z ), we have

𝔼[R|C,Z]\displaystyle\mathds{E}[R\lvert C,Z]blackboard_E [ italic_R | italic_C , italic_Z ] =𝔼[fr(C,A)+𝔼[ϵ|C]|C,Z]\displaystyle=\mathds{E}\Big{[}f_{r}(C,A)+\mathds{E}[\epsilon\lvert C]\Big{% \lvert}C,Z\Big{]}= blackboard_E [ italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_C , italic_A ) + blackboard_E [ italic_ϵ | italic_C ] | italic_C , italic_Z ]
=𝔼[h0(C,A)|C,Z]\displaystyle=\mathds{E}[h_{0}(C,A)\lvert C,Z]= blackboard_E [ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_A ) | italic_C , italic_Z ] (2)
=h0(C,A)(A|C,Z)dA,\displaystyle=\int h_{0}(C,A)\mathds{P}(A\lvert C,Z)dA,= ∫ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_A ) blackboard_P ( italic_A | italic_C , italic_Z ) italic_d italic_A ,

where the expectation 𝔼[R|C,Z]\mathds{E}[R\lvert C,Z]blackboard_E [ italic_R | italic_C , italic_Z ] and the distribution (A|C,Z)\mathds{P}(A\lvert C,Z)blackboard_P ( italic_A | italic_C , italic_Z ) are both observable. However, solving this equation analytically is ill-posed Nashed and Wahba [1974]. This is an inverse problem for definite integrals that requires the derivation of a function inside the definite integral based on numerical integration values, which is thus not solvable analytically. Recent IV regression methods instead estimate h^^\hat{h}over^ start_ARG italic_h end_ARG in some space of continuous functions \mathcal{H}caligraphic_H by solving the following optimisation problem with a two-stage approach:

minh𝔼[(R𝔼[h(C,A)|C,Z])2].\min_{h\in\mathcal{H}}\mathds{E}[(R-\mathds{E}[h(C,A)\lvert C,Z])^{2}].roman_min start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT blackboard_E [ ( italic_R - blackboard_E [ italic_h ( italic_C , italic_A ) | italic_C , italic_Z ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (3)

In the first stage, the conditional expectation 𝔼[h(C,A)|c,z]\mathds{E}[h(C,A)\lvert c,z]blackboard_E [ italic_h ( italic_C , italic_A ) | italic_c , italic_z ] is learnt as a function of (c,z)𝑐𝑧(c,z)( italic_c , italic_z ) using observations, and in the second stage, the loss in Eq. 3 is minimised using the estimator obtained in stage 1. In both stages, linear regression or parametric ML methods, such as DNN, can be used to learn the true functions.

2.5 Double Machine Learning

DML is a parameter estimation method that can mitigate certain biases in the learning process [Chernozhukov et al., 2018, 2021, 2022b], which has been extended to work with ML methods, including DL. DML considers the problem of estimating a function of interest hhitalic_h as a solution to an equation of the form

𝔼[ψ(𝒟;h,η)]=0,𝔼delimited-[]𝜓𝒟𝜂0\mathds{E}[\psi(\mathcal{D};h,\eta)]=0,blackboard_E [ italic_ψ ( caligraphic_D ; italic_h , italic_η ) ] = 0 , (4)

where ψ𝜓\psiitalic_ψ is referred to as a score function. Here, η𝜂\etaitalic_η is a nuisance parameter, which is of no direct interest, but must be estimated to obtain hhitalic_h. DML provides a set of tools to derive an unbiased estimator of hhitalic_h with convergence rate guarantees, even when the nuisance parameter η𝜂\etaitalic_η suffers from regularisation, overfitting and other type of biases present in the training of ML models, which typically causes slow convergence when learning hhitalic_h.

In order to estimate hhitalic_h, DML reduces biases by using score functions ψ𝜓\psiitalic_ψ that are Neyman orthogonal Neyman and Scott [1965] in η𝜂\etaitalic_η, which require the Gateaux derivative

r|r=0𝔼[ψ(𝒟;h0,η0+rη)]=0,\displaystyle\frac{\partial}{\partial r}\Big{\lvert}_{r=0}\mathds{E}[\psi(% \mathcal{D};h_{0},\eta_{0}+r\eta)]=0,divide start_ARG ∂ end_ARG start_ARG ∂ italic_r end_ARG | start_POSTSUBSCRIPT italic_r = 0 end_POSTSUBSCRIPT blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_r italic_η ) ] = 0 , (5)

for all η𝜂\etaitalic_η. Here, h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and η0subscript𝜂0\eta_{0}italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are the true parameters that minimise the expected score, that is, 𝔼[ψ(𝒟;h0,η0)]=0𝔼delimited-[]𝜓𝒟subscript0subscript𝜂00\mathds{E}[\psi(\mathcal{D};h_{0},\eta_{0})]=0blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] = 0. Intuitively, the condition in Eq. 5 is met if small changes of the nuisance parameter do not significantly affect the score function around the true parameter h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Neyman orthogonality is key in DML, as it allows fast convergence guarantees for hhitalic_h, even if the estimator for the nuisance parameter η𝜂\etaitalic_η is biased. For score functions that are Neyman orthogonal, we define DML with K-fold cross-fitting as follows.

Definition 2.2 (DML, Definition 3.2 [Chernozhukov et al., 2018]).

Given a dataset 𝒟𝒟\mathcal{D}caligraphic_D of N𝑁Nitalic_N observations, consider a score function ψ𝜓\psiitalic_ψ as in Eq. 4, and suppose that ψ𝜓\psiitalic_ψ is Neyman orthogonal that satisfies Eq. 5. Take a K-fold random partition {Ik}k=1Ksubscriptsuperscriptsubscript𝐼𝑘𝐾𝑘1\{I_{k}\}^{K}_{k=1}{ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT of observation indices [N]delimited-[]𝑁[N][ italic_N ] each with size n=N/K𝑛𝑁𝐾n=N/Kitalic_n = italic_N / italic_K, and let 𝒟Iksubscript𝒟subscript𝐼𝑘\mathcal{D}_{I_{k}}caligraphic_D start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the set of observations {𝒟i:iIk}conditional-setsubscript𝒟𝑖𝑖subscript𝐼𝑘\{\mathcal{D}_{i}:i\in I_{k}\}{ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i ∈ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }. Furthermore, define Ikc[N]Iksubscriptsuperscript𝐼𝑐𝑘delimited-[]𝑁subscript𝐼𝑘I^{c}_{k}\coloneqq[N]\setminus I_{k}italic_I start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≔ [ italic_N ] ∖ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for each fold k𝑘kitalic_k, and construct estimators η^ksubscript^𝜂𝑘\hat{\eta}_{k}over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of the nuisance parameter using 𝒟Ikcsubscript𝒟subscriptsuperscript𝐼𝑐𝑘\mathcal{D}_{I^{c}_{k}}caligraphic_D start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Then, construct an estimator h^^\hat{h}over^ start_ARG italic_h end_ARG as a solution to the equation

1Kk=1K𝔼^k[ψ(𝒟Ik;h^,η^k)]=0,1𝐾superscriptsubscript𝑘1𝐾subscript^𝔼𝑘delimited-[]𝜓subscript𝒟subscript𝐼𝑘^subscript^𝜂𝑘0\displaystyle\frac{1}{K}\sum_{k=1}^{K}\hat{\mathds{E}}_{k}[\psi(\mathcal{D}_{I% _{k}};\hat{h},\hat{\eta}_{k})]=0,divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG blackboard_E end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_ψ ( caligraphic_D start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; over^ start_ARG italic_h end_ARG , over^ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] = 0 , (6)

where 𝔼^ksubscript^𝔼𝑘\hat{\mathds{E}}_{k}over^ start_ARG blackboard_E end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the empirical expectation over 𝒟Iksubscript𝒟subscript𝐼𝑘\mathcal{D}_{I_{k}}caligraphic_D start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

In the definition above, h^^\hat{h}over^ start_ARG italic_h end_ARG is defined as a solution to Eq. 6. In practice, however, finding an exact solution may not be feasible. To circumvent this problem, we can also define the estimator of interest h^^\hat{h}over^ start_ARG italic_h end_ARG as an ϵNsubscriptitalic-ϵ𝑁\epsilon_{N}italic_ϵ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT-approximate solution to Eq. 6, where ϵN=O(N1/2)subscriptitalic-ϵ𝑁𝑂superscript𝑁12\epsilon_{N}=O(N^{-1/2})italic_ϵ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_O ( italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ), which allows for a small optimisation error.

3 DML-IV Algorithm

We now present the main contributions of this paper. The key to our results is the DML-IV algorithm, a novel two-stage IV regression algorithm utilising DNNs in both stages that provides guarantees on the convergence rate by leveraging the DML framework (see Section 2.5). The DML-IV estimator is then utilised to solve an offline IV bandit (see Section 2.3) by retrieving a deterministic policy with suboptimality guarantees that match those of the uncounfounded bandit.

Firstly, we remark that, in order to estimate the counterfactual prediction function h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with convergence rate guarantees, we need a Neyman orthogonal score. We let g0(h,c,z)𝔼[h(C,A)|c,z]g_{0}(h,c,z)\coloneqq\mathds{E}[h(C,A)\lvert c,z]italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h , italic_c , italic_z ) ≔ blackboard_E [ italic_h ( italic_C , italic_A ) | italic_c , italic_z ] and let 𝒢𝒢\mathcal{G}caligraphic_G to be some function space that includes g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and its potential estimators g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG. Unfortunately, the standard score (or loss) function for two-stage IV regression =(Rg(h,c,z))2superscript𝑅𝑔𝑐𝑧2\ell=(R-g(h,c,z))^{2}roman_ℓ = ( italic_R - italic_g ( italic_h , italic_c , italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in Eq. 3 is not Neyman orthogonal (details in Section B), which means that small misspecifications or bias on g𝑔gitalic_g may lead to significant changes to this loss function, and there are no guarantees on the convergence rate if the first stage estimator g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG is naively plugged into the loss to estimate h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. To address this, we first derive a novel Neyman orthogonal score function for the IV regression problem and then design a DML algorithm with K-fold cross-fitting adapted to the IV regression problem.

3.1 Neyman Orthogonal Score

We first derive a novel Neyman orthogonal score for learning h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in the contextual IV setting. The key to constructing a Neyman orthogonal score usually involves estimating additional nuisance parameters Chernozhukov et al. [2018] and adding terms to the original score function to debias it, so we first select relevant quantities that should be estimated as nuisance parameters. Following two-stage IV regression approaches Hartford et al. [2017], estimating g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is essential for identifying h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, so we will estimate it as a nuisance parameter. We found that, by additionally estimating s0(c,z)𝔼[R|c,z]s_{0}(c,z)\coloneqq\mathds{E}[R\lvert c,z]italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_c , italic_z ) ≔ blackboard_E [ italic_R | italic_c , italic_z ] inside some function space 𝒮𝒮\mathcal{S}caligraphic_S, we can construct a new score function

ψ(𝒟;h,(s,g))=(s(c,z)g(h,c,z))2,𝜓𝒟𝑠𝑔superscript𝑠𝑐𝑧𝑔𝑐𝑧2\psi(\mathcal{D};h,(s,g))=(s(c,z)-g(h,c,z))^{2},italic_ψ ( caligraphic_D ; italic_h , ( italic_s , italic_g ) ) = ( italic_s ( italic_c , italic_z ) - italic_g ( italic_h , italic_c , italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (7)

by replacing R𝑅Ritalic_R in the standard score with s(c,z)𝑠𝑐𝑧s(c,z)italic_s ( italic_c , italic_z ). Here, the nuisance parameters are η=(s,g)𝜂𝑠𝑔\eta=(s,g)italic_η = ( italic_s , italic_g ). We see that ψ𝜓\psiitalic_ψ is a valid score function since 𝔼[ψ(𝒟;h0,(s0,g0))]=0𝔼delimited-[]𝜓𝒟subscript0subscript𝑠0subscript𝑔00\mathds{E}[\psi(\mathcal{D};h_{0},(s_{0},g_{0}))]=0blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ] = 0 with the true functions (s0,g0)subscript𝑠0subscript𝑔0(s_{0},g_{0})( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) by Eq. 2, and the next theorem shows that our score function is in fact Neyman orthogonal by checking its Gateaux derivative vanishes at (h0,(s0,g0))subscript0subscript𝑠0subscript𝑔0(h_{0},(s_{0},g_{0}))( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ), where the proof is deferred to Section C.1.

Theorem 3.1.

The score function ψ(𝒟;h,(s,g))=(s(c,z)g(h,c,z))2𝜓𝒟𝑠𝑔superscript𝑠𝑐𝑧𝑔𝑐𝑧2\psi(\mathcal{D};h,(s,g))=(s(c,z)-g(h,c,z))^{2}italic_ψ ( caligraphic_D ; italic_h , ( italic_s , italic_g ) ) = ( italic_s ( italic_c , italic_z ) - italic_g ( italic_h , italic_c , italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT obeys the Neyman orthogonality conditions at (h0,(s0,g0))subscript0subscript𝑠0subscript𝑔0(h_{0},(s_{0},g_{0}))( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ).

This Neyman orthogonal score function is abstract, in the sense that it allows for general estimation methods for g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, as long as they satisfy certain convergence conditions, which are introduced in the next section.

3.2 Learning Causal Effects through DML

  Input: Dataset 𝒟𝒟\mathcal{D}caligraphic_D of size N𝑁Nitalic_N, number of folds K𝐾Kitalic_K for cross-fitting, mini-batch size nbsubscript𝑛𝑏n_{b}italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT
  Output: The DML-IV estimator hθ^subscript^𝜃h_{\hat{\theta}}italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT
  Get a partition (Ik)k=1Ksubscriptsuperscriptsubscript𝐼𝑘𝐾𝑘1(I_{k})^{K}_{k=1}( italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT of dataset indices [N]delimited-[]𝑁[N][ italic_N ]
  for k=1𝑘1k=1italic_k = 1 to K𝐾Kitalic_K do
     Ikc[N]Iksubscriptsuperscript𝐼𝑐𝑘delimited-[]𝑁subscript𝐼𝑘I^{c}_{k}\coloneqq[N]\setminus I_{k}italic_I start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≔ [ italic_N ] ∖ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
     Learn s^ksubscript^𝑠𝑘\hat{s}_{k}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and g^ksubscript^𝑔𝑘\hat{g}_{k}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT using {(𝒟i):iIkc}conditional-setsubscript𝒟𝑖𝑖subscriptsuperscript𝐼𝑐𝑘\{(\mathcal{D}_{i}):{i\in I^{c}_{k}}\}{ ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : italic_i ∈ italic_I start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }
  end for
  Initialise hθ^subscript^𝜃h_{\hat{\theta}}italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT
  repeat
     for k=1𝑘1k=1italic_k = 1 to K𝐾Kitalic_K do
        Sample nbsubscript𝑛𝑏n_{b}italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT data (cik,zik)superscriptsubscript𝑐𝑖𝑘superscriptsubscript𝑧𝑖𝑘(c_{i}^{k},z_{i}^{k})( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) from {(𝒟i):iIk}conditional-setsubscript𝒟𝑖𝑖subscript𝐼𝑘\{(\mathcal{D}_{i}):{i\in I_{k}}\}{ ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : italic_i ∈ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }
        =𝔼^(cik,zik)[(s^k(c,z)g^k(hθ,c,z))2]subscript^𝔼superscriptsubscript𝑐𝑖𝑘superscriptsubscript𝑧𝑖𝑘delimited-[]superscriptsubscript^𝑠𝑘𝑐𝑧subscript^𝑔𝑘subscript𝜃𝑐𝑧2\mathcal{L}=\hat{\mathds{E}}_{(c_{i}^{k},z_{i}^{k})}\left[(\hat{s}_{k}(c,z)-% \hat{g}_{k}(h_{\theta},c,z))^{2}\right]caligraphic_L = over^ start_ARG blackboard_E end_ARG start_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ ( over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_c , italic_z ) - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_c , italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
        Update θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG to minimise loss \mathcal{L}caligraphic_L
     end for
  until convergence
Algorithm 1 DML-IV with K-fold cross-fitting

With the Neyman orthogonal score, we now introduce DML-IV. While the DML-IV algorithm does not require any assumptions on hhitalic_h, we assume that hhitalic_h is finite-dimensional and parameterised for the theoretical analysis of DML-IV. Let h0=hθ0subscript0subscriptsubscript𝜃0h_{0}=h_{\theta_{0}}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and ΘdθΘsuperscriptsubscript𝑑𝜃\Theta\subseteq\mathbb{R}^{d_{\theta}}roman_Θ ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be a compact space of parameters of hhitalic_h, where the true parameter θ0Θsubscript𝜃0Θ\theta_{0}\in\Thetaitalic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_Θ is in the interior of ΘΘ\Thetaroman_Θ, and {hθ:θΘ}conditional-setsubscript𝜃𝜃Θ\mathcal{H}\coloneqq\{h_{\theta}:\theta\in\Theta\}caligraphic_H ≔ { italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : italic_θ ∈ roman_Θ } is the function space of hhitalic_h. The procedure of the DML-IV algorithm for estimating h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is described in Algorithm 1. Given a dataset 𝒟𝒟\mathcal{D}caligraphic_D of size N𝑁Nitalic_N, we split the dataset using a random partition {Ik}k=1Ksubscriptsuperscriptsubscript𝐼𝑘𝐾𝑘1\{I_{k}\}^{K}_{k=1}{ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT of dataset indices [N]delimited-[]𝑁[N][ italic_N ] such that the size of each fold Iksubscript𝐼𝑘I_{k}italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is N/K𝑁𝐾N/Kitalic_N / italic_K.

In the first stage of DML-IV, for each fold k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], we learn s^ksubscript^𝑠𝑘\hat{s}_{k}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and g^ksubscript^𝑔𝑘\hat{g}_{k}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT using data 𝒟Ikcsubscript𝒟subscriptsuperscript𝐼𝑐𝑘\mathcal{D}_{I^{c}_{k}}caligraphic_D start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT with indices Ikc[N]Iksubscriptsuperscript𝐼𝑐𝑘delimited-[]𝑁subscript𝐼𝑘I^{c}_{k}\coloneqq[N]\setminus I_{k}italic_I start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≔ [ italic_N ] ∖ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. s^k𝔼[R|C,Z]\hat{s}_{k}\approx\mathds{E}[R\lvert C,Z]over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≈ blackboard_E [ italic_R | italic_C , italic_Z ] can be learnt through standard supervised learning using a neural network with inputs (C,Z)𝐶𝑍(C,Z)( italic_C , italic_Z ) and label R𝑅Ritalic_R. For g^ksubscript^𝑔𝑘\hat{g}_{k}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we follow Hartford et al. [2017] to estimate F0(A|C,Z)F_{0}(A\lvert C,Z)italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_A | italic_C , italic_Z ), the conditional distribution of A𝐴Aitalic_A given (C,Z)𝐶𝑍(C,Z)( italic_C , italic_Z ), with F^^𝐹\hat{F}over^ start_ARG italic_F end_ARG, and then estimate g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG via

g^(h,c,z)=A˙F^(A|C,Z)h(C,A˙)h(C,A)F^(A|C,Z)dA𝔼[h(C,A)|c,z].\hat{g}(h,c,z)=\sum_{\dot{A}\sim\hat{F}(A\lvert C,Z)}h(C,\dot{A})\approx\int h% (C,A)\hat{F}(A\lvert C,Z)dA\approx\mathds{E}[h(C,A)\lvert c,z].over^ start_ARG italic_g end_ARG ( italic_h , italic_c , italic_z ) = ∑ start_POSTSUBSCRIPT over˙ start_ARG italic_A end_ARG ∼ over^ start_ARG italic_F end_ARG ( italic_A | italic_C , italic_Z ) end_POSTSUBSCRIPT italic_h ( italic_C , over˙ start_ARG italic_A end_ARG ) ≈ ∫ italic_h ( italic_C , italic_A ) over^ start_ARG italic_F end_ARG ( italic_A | italic_C , italic_Z ) italic_d italic_A ≈ blackboard_E [ italic_h ( italic_C , italic_A ) | italic_c , italic_z ] .

If the action space is discrete, F^^𝐹\hat{F}over^ start_ARG italic_F end_ARG is a categorical model, e.g., a DNN with softmax output. For a continuous action space, a mixture of Gaussian models is adopted to estimate the distribution F0(A|C,Z)F_{0}(A\lvert C,Z)italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_A | italic_C , italic_Z ), where a DNN is used to predict the means and standard deviations of the Gaussian distributions.

In the second stage of DML-IV, we estimate θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG using our Neyman orthogonal score function ψ𝜓\psiitalic_ψ in Eq. 7. The key here is to optimise θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG with data from the k𝑘kitalic_k-th fold using nuisance parameters s^ksubscript^𝑠𝑘\hat{s}_{k}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, g^ksubscript^𝑔𝑘\hat{g}_{k}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that are trained with data 𝒟Ikcsubscript𝒟subscriptsuperscript𝐼𝑐𝑘\mathcal{D}_{I^{c}_{k}}caligraphic_D start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the complement of the data from the k𝑘kitalic_k-th fold. This is important to fully debias the estimator θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG. We alternate between the K𝐾Kitalic_K folds while sampling a mini-batch (cik,zik)superscriptsubscript𝑐𝑖𝑘superscriptsubscript𝑧𝑖𝑘(c_{i}^{k},z_{i}^{k})( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) of size nbsubscript𝑛𝑏n_{b}italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT from each fold k𝑘kitalic_k of the dataset to update θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG by minimising the empirical loss on the mini-batch following our Neyman orthogonal score ψ𝜓\psiitalic_ψ,

𝔼^(cik,zik)[(s^k(c,z)g^k(hθ,c,z))2]=(cik,zik)1nb((s^k(c,z)g^k(hθ,c,z))2).subscript^𝔼superscriptsubscript𝑐𝑖𝑘superscriptsubscript𝑧𝑖𝑘delimited-[]superscriptsubscript^𝑠𝑘𝑐𝑧subscript^𝑔𝑘subscript𝜃𝑐𝑧2subscriptsuperscriptsubscript𝑐𝑖𝑘superscriptsubscript𝑧𝑖𝑘1subscript𝑛𝑏superscriptsubscript^𝑠𝑘𝑐𝑧subscript^𝑔𝑘subscript𝜃𝑐𝑧2\hat{\mathds{E}}_{(c_{i}^{k},z_{i}^{k})}\left[(\hat{s}_{k}(c,z)-\hat{g}_{k}(h_% {\theta},c,z))^{2}\right]=\sum_{(c_{i}^{k},z_{i}^{k})}\frac{1}{n_{b}}\left((% \hat{s}_{k}(c,z)-\hat{g}_{k}(h_{\theta},c,z))^{2}\right).over^ start_ARG blackboard_E end_ARG start_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ ( over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_c , italic_z ) - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_c , italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = ∑ start_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG ( ( over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_c , italic_z ) - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_c , italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

When the second stage converges, we return the DML-IV estimator hθ^subscript^𝜃h_{\hat{\theta}}italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT.

To obtain the DML convergence rate guarantees Chernozhukov et al. [2018] for hθ^subscript^𝜃h_{\hat{\theta}}italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT, i.e., for θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG to converge to the true parameters θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at the rate of O(N1/2)𝑂superscript𝑁12O(N^{-1/2})italic_O ( italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) with high probability, there are two key conditions: i) Neyman orthogonality of the score function, and ii) the nuisance parameters should converge to their true values at the crude rate of o(N1/4)𝑜superscript𝑁14o(N^{-1/4})italic_o ( italic_N start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ). The Neyman orthogonal score is given in Theorem 3.1, so it remains to prove the convergence rate of the nuisance parameters. Define 𝒢Nsubscript𝒢𝑁\mathcal{G}_{N}caligraphic_G start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT to be the realisation set such that g^Nsubscript^𝑔𝑁\hat{g}_{N}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, the estimator of g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using a dataset of size N𝑁Nitalic_N, takes values in this set. Similarly, define 𝒮Nsubscript𝒮𝑁\mathcal{S}_{N}caligraphic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT to be the realisation set of s^Nsubscript^𝑠𝑁\hat{s}_{N}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. These realisation sets are properly shrinking neighbourhoods of the true functions g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and we later provide Footnote 2 that describes the rate of shrinkage of these realisation sets, for which we require boundedness of functions g,s,h𝑔𝑠g,s,hitalic_g , italic_s , italic_h and the outcome variable R𝑅Ritalic_R as stated in 3.2.

Assumption 3.2.

We assume that (a): g0,s0,h0𝒢,𝒮,formulae-sequencesubscript𝑔0subscript𝑠0subscript0𝒢𝒮g_{0},s_{0},h_{0}\in\mathcal{G},\mathcal{S},\mathcal{H}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_G , caligraphic_S , caligraphic_H are all bounded i.e.,
g0,s0,h0Bsubscriptdelimited-∥∥subscript𝑔0subscriptdelimited-∥∥subscript𝑠0subscriptdelimited-∥∥subscript0𝐵\lVert g_{0}\rVert_{\infty},\lVert s_{0}\rVert_{\infty},\lVert h_{0}\rVert_{% \infty}\leq B∥ italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT , ∥ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT , ∥ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_B; and (b): the outcome RBsubscriptdelimited-∥∥𝑅𝐵\lVert R\rVert_{\infty}\leq B∥ italic_R ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_B, where B+𝐵superscriptB\in\mathbb{R}^{+}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT.

To improve readability, we provide here an informal statement of the lemma, which expresses the relationship between the critical radius Wainwright [2019], Bartlett et al. [2005] of the realisation sets and the convergence rate of the nuisance parameters. We defer the formal statement and the proof to Section C.1.

Lemma 3.3 (Informal: nuisance parameters convergence.).
222See Lemma C.2 for the formal statement.

If 3.2 holds, let δNsubscript𝛿𝑁\delta_{N}italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT be an upper bound on the critical radius of the function spaces related to the realisation sets 𝒮Nsubscript𝒮𝑁\mathcal{S}_{N}caligraphic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and 𝒢Nsubscript𝒢𝑁\mathcal{G}_{N}caligraphic_G start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. Then, with probability 1ζ1𝜁1-\zeta1 - italic_ζ:

s^s022superscriptsubscriptdelimited-∥∥^𝑠subscript𝑠022\displaystyle\lVert\hat{s}-s_{0}\rVert_{2}^{2}∥ over^ start_ARG italic_s end_ARG - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =O(δN2+ln(1/ζ)N);absent𝑂superscriptsubscript𝛿𝑁21𝜁𝑁\displaystyle=O\left(\delta_{N}^{2}+\frac{\ln(1/\zeta)}{N}\right);= italic_O ( italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG roman_ln ( 1 / italic_ζ ) end_ARG start_ARG italic_N end_ARG ) ;
g^g022superscriptsubscriptdelimited-∥∥^𝑔subscript𝑔022\displaystyle\lVert\hat{g}-g_{0}\rVert_{2}^{2}∥ over^ start_ARG italic_g end_ARG - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =O(δN2+ln(1/ζ)N).absent𝑂superscriptsubscript𝛿𝑁21𝜁𝑁\displaystyle=O\left(\delta_{N}^{2}+\frac{\ln(1/\zeta)}{N}\right).= italic_O ( italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG roman_ln ( 1 / italic_ζ ) end_ARG start_ARG italic_N end_ARG ) .

The critical radius is a quantity that describes the complexity of estimation, and it is typically shown that δN=O(dNN1/2)subscript𝛿𝑁𝑂subscript𝑑𝑁superscript𝑁12\delta_{N}=O(d_{N}N^{-1/2})italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_O ( italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) Chernozhukov et al. [2022b, 2021], where dNsubscript𝑑𝑁d_{N}italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is the effective dimension of the hypothesis space (see Section C.3 for the derivation and formal definitions). This, together with Footnote 2, implies that s^s02=O(dNN1/2)subscriptdelimited-∥∥^𝑠subscript𝑠02𝑂subscript𝑑𝑁superscript𝑁12\lVert\hat{s}-s_{0}\rVert_{2}=O(d_{N}N^{-1/2})∥ over^ start_ARG italic_s end_ARG - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_O ( italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). Therefore, for function classes with dN=o(N1/4)subscript𝑑𝑁𝑜superscript𝑁14d_{N}=o(N^{1/4})italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_o ( italic_N start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT ), s^s02o(N1/4)subscriptdelimited-∥∥^𝑠subscript𝑠02𝑜superscript𝑁14\lVert\hat{s}-s_{0}\rVert_{2}\leq o(N^{-1/4})∥ over^ start_ARG italic_s end_ARG - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_o ( italic_N start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ) (and similarly for g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG). This is a broad class of functions that covers many machine learning methods such as deep ReLU networks and shallow regression trees Chernozhukov et al. [2021]. It has also been shown that conditional density and expectation estimation used for g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG satisfies dN=o(N1/4)subscript𝑑𝑁𝑜superscript𝑁14d_{N}=o(N^{1/4})italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_o ( italic_N start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT ) under mild assumptions Grünewälder [2018], Bilodeau et al. [2021]. We refer to Chernozhukov et al. [2021] for additional discussion and concrete convergence rates of nuisance estimators.

Footnote 2 shows that the nuisance parameters converge to their true values at the rate of o(N1/4)𝑜superscript𝑁14o(N^{-1/4})italic_o ( italic_N start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ) if dN=o(N1/4)subscript𝑑𝑁𝑜superscript𝑁14d_{N}=o(N^{1/4})italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_o ( italic_N start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT ), thus satisfying the second key condition to get the DML convergence rate guarantees. This allows us, after checking some mild regularity and continuity conditions, to obtain the following theorem regarding the convergence of the DML-IV estimator by applying Theorem 3.3 of Chernozhukov et al. [2018], with proof deferred to Section C.1.

Theorem 3.4 (Convergence of the DML-IV estimator).

If the effective dimension dN=o(N1/4)subscript𝑑𝑁𝑜superscript𝑁14d_{N}=o(N^{1/4})italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_o ( italic_N start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT ) for s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG, g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG, and Assumption 2.1, & 3.2 hold, we have that the DML-IV estimator θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG is concentrated in a 1/N1𝑁1/\sqrt{N}1 / square-root start_ARG italic_N end_ARG neighbourhood of θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and is approximately linear and centred Gaussian:

N(θ^θ0)𝒩(0,σ2) in distribution,𝑁^𝜃subscript𝜃0𝒩0superscript𝜎2 in distribution\displaystyle\sqrt{N}(\hat{\theta}-\theta_{0})\rightarrow\mathcal{N}(0,\sigma^% {2})\text{ in distribution},square-root start_ARG italic_N end_ARG ( over^ start_ARG italic_θ end_ARG - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) → caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) in distribution ,

where the estimator variance is given by

σ2J01𝔼[ψ(𝒟,θ0,η0)ψ(𝒟,θ0,η0)T](J01)T,superscript𝜎2superscriptsubscript𝐽01𝔼delimited-[]𝜓𝒟subscript𝜃0subscript𝜂0𝜓superscript𝒟subscript𝜃0subscript𝜂0𝑇superscriptsuperscriptsubscript𝐽01𝑇\sigma^{2}\coloneqq J_{0}^{-1}\mathds{E}[\psi(\mathcal{D},\theta_{0},\eta_{0})% \psi(\mathcal{D},\theta_{0},\eta_{0})^{T}](J_{0}^{-1})^{T},italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≔ italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT blackboard_E [ italic_ψ ( caligraphic_D , italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_ψ ( caligraphic_D , italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] ( italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,

which is constant w.r.t N𝑁Nitalic_N and J0subscript𝐽0J_{0}italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the Jacobian matrix of 𝔼[ψ]𝔼delimited-[]𝜓\mathds{E}[\psi]blackboard_E [ italic_ψ ] w.r.t θ𝜃\thetaitalic_θ.

Theorem 3.4 states that, with adequately trained nuisance parameter estimators, the estimator error θ^θ0^𝜃subscript𝜃0\hat{\theta}-\theta_{0}over^ start_ARG italic_θ end_ARG - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is normally distributed and variance shrinks at the rate of N1/2superscript𝑁12N^{-1/2}italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT. This implies that θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG converges to θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at the rate O(N1/2)𝑂superscript𝑁12O(N^{-1/2})italic_O ( italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) with high probability, which allows us to deduce suboptimaltiy bounds for the policy induced by hθ^subscript^𝜃h_{\hat{\theta}}italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT in the next section.

3.3 Suboptimality Bounds

From the DML-IV estimator hθ^subscript^𝜃h_{\hat{\theta}}italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT, we retrieve (an estimate of) the induced optimal policy as π^(c)argmaxahθ^(c,a)^𝜋𝑐subscriptargmax𝑎subscript^𝜃𝑐𝑎\hat{\pi}(c)\coloneqq\operatorname*{arg\,max}_{a}h_{\hat{\theta}}(c,a)over^ start_ARG italic_π end_ARG ( italic_c ) ≔ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_c , italic_a ). Recall that the suboptimality of a policy is subopt(π^)V(π)V(π^)subopt^𝜋𝑉superscript𝜋𝑉^𝜋\textrm{subopt}(\hat{\pi})\coloneqq V(\pi^{*})-V(\hat{\pi})subopt ( over^ start_ARG italic_π end_ARG ) ≔ italic_V ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_V ( over^ start_ARG italic_π end_ARG ). Next, we show a suboptimality bound for the DML-IV policy in terms of the sample size N𝑁Nitalic_N.

Theorem 3.5 (Suboptimality Bounds).

Let the learnt policy from a dataset of size N𝑁Nitalic_N be π^(c)argmaxahθ^(c,a)^𝜋𝑐subscriptargmax𝑎subscript^𝜃𝑐𝑎\hat{\pi}(c)\coloneqq\operatorname*{arg\,max}_{a}h_{\hat{\theta}}(c,a)over^ start_ARG italic_π end_ARG ( italic_c ) ≔ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_c , italic_a ), where θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG is the DML-IV estimator. Let L𝐿Litalic_L be a constant such that |hθ(C,A)hθ(C,A)|Lθθsubscript𝜃𝐶𝐴subscriptsuperscript𝜃𝐶𝐴𝐿delimited-∥∥𝜃superscript𝜃\lvert h_{\theta}(C,A)-h_{\theta^{\prime}}(C,A)\rvert\leq L\lVert\theta-\theta% ^{\prime}\rVert| italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_C , italic_A ) - italic_h start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_C , italic_A ) | ≤ italic_L ∥ italic_θ - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ for all C𝐶Citalic_C in the support of testsubscripttest\mathds{P}_{\textrm{test}}blackboard_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT, A𝒜𝐴𝒜A\in\mathcal{A}italic_A ∈ caligraphic_A, and θ,θΘ𝜃superscript𝜃Θ\theta,\theta^{\prime}\in\Thetaitalic_θ , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Θ. Then, for all ζ(0,1]𝜁01\zeta\in(0,1]italic_ζ ∈ ( 0 , 1 ], we have that the suboptimality of π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG satisfies

subopt(π^)=O(Lln(1/ζ)N),subopt^𝜋𝑂𝐿1𝜁𝑁\displaystyle\textrm{subopt}(\hat{\pi})=O\left(L\sqrt{\frac{\ln(1/\zeta)}{N}}% \right),subopt ( over^ start_ARG italic_π end_ARG ) = italic_O ( italic_L square-root start_ARG divide start_ARG roman_ln ( 1 / italic_ζ ) end_ARG start_ARG italic_N end_ARG end_ARG ) ,

with probability 1ζ1𝜁1-\zeta1 - italic_ζ.

The proof is deferred to Section C.2. To the best of our knowledge, this is the first time that the convergence rate and suboptimality bounds of O(N1/2)𝑂superscript𝑁12O(N^{-1/2})italic_O ( italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) have been proved for IV regression methods that use DL, matching the suboptimality bounds of the unconfounded bandit. On the other hand, most other DL-based IV regression methods only demonstrate that their estimators converge in the limit.

4 Experimental Results

Refer to caption
Refer to caption
(a) The mean squared error of h^^\hat{h}over^ start_ARG italic_h end_ARG.
Refer to caption
(b) The average reward following the policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG derived from h^^\hat{h}over^ start_ARG italic_h end_ARG.
Refer to caption
(c) The average reward following π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG with out of training distribution context.
Figure 2: Results on the aeroplane ticket demand dataset with low-dimensional context.

In this section, we empirically evaluate DML-IV for IV regression and offline IV bandit problems. In addition, we evaluate a computationally efficient version of DML-IV, referred to as CE-DML-IV, which does not apply K𝐾Kitalic_K-fold cross-fitting. It trains s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG and g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG only once (instead of K𝐾Kitalic_K times) using the entire dataset, and can also be considered as an ablation study on K𝐾Kitalic_K-fold cross-fitting. Without K𝐾Kitalic_K-fold cross-fitting, it lacks the theoretical convergence rate guarantees but it still enjoys the partial debiasing effect Mackey et al. [2018] from the Neyman orthogonal score and trades off computational complexity with bias. We found that CE-DML-IV empirically performs as well as standard DML-IV on low-dimensional datasets. We provide details and discussion regarding CE-DML-IV in Section A.

Our evaluation considers both low- and high-dimensional contexts, as well as semi-synthetic real-world datasets. We compare our methods with leading modern IV regression methods Deep IV Hartford et al. [2017], DeepGMM Bennett et al. [2019], KIV Singh et al. [2019] and DFIV Xu et al. [2020]. In this section we use DNN estimators for both stages with network architecture and hyper-parameters provided in Section F. Additional results of DML-IV using tree-based estimators such as Random Forests and Gradient Boosting are provided in Section G.2, where SOTA performance is also demonstrated. The algorithms are implemented using PyTorch Paszke et al. [2019], and the code is available on GitHub333https://github.com/shaodaqian/DML-IV.

4.1 Aeroplane Ticket Demand Dataset

Refer to caption
Refer to caption
(a) The mean squared error of h^^\hat{h}over^ start_ARG italic_h end_ARG.
Refer to caption
(b) The average reward following the policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG derived from h^^\hat{h}over^ start_ARG italic_h end_ARG.
Refer to caption
(c) The average reward following π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG with out of training distribution context.
Figure 3: Results on the aeroplane ticket demand dataset with high-dimensional context.

We first conduct experiments for IV regression on the aeroplane ticket demand dataset, which is a synthetic dataset introduced by Hartford et al. [2017] that is now a standard benchmark for nonlinear IV methods. In this dataset, we aim to understand how ticket prices p𝑝pitalic_p affect ticket sales r𝑟ritalic_r. We observe two context variables, which are the time of year t[0,10]𝑡010t\in[0,10]italic_t ∈ [ 0 , 10 ] and customer type s[7]𝑠delimited-[]7s\in[7]italic_s ∈ [ 7 ] variables, the latter categorised by the level of price sensitivity. Price and context affect sales through h0((t,s),p)=100+(10+p)sψ(t)2psubscript0𝑡𝑠𝑝10010𝑝𝑠𝜓𝑡2𝑝h_{0}((t,s),p)=100+(10+p)\cdot s\cdot\psi(t)-2pitalic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ( italic_t , italic_s ) , italic_p ) = 100 + ( 10 + italic_p ) ⋅ italic_s ⋅ italic_ψ ( italic_t ) - 2 italic_p, where ψ(t)𝜓𝑡\psi(t)italic_ψ ( italic_t ) is a complex nonlinear function. However, the noise of r𝑟ritalic_r and p𝑝pitalic_p is correlated, which indicates the existence of unobserved confounders. The fuel price z𝑧zitalic_z is introduced as an instrumental variable. Details of this dataset are included in Section D.1.

The results for learning h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with this dataset of various sizes are provided in Fig. 2(a). We ran each method 20 times and report the mean squared errors (MSE) between the estimators h^^\hat{h}over^ start_ARG italic_h end_ARG and h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where the median, 25th and 75th percentiles are shown. It can be seen that DML-IV performs better than other IV regression methods for all dataset sizes. CE-DML-IV, which requires significantly less computation, matches the performance of DML-IV in this case.

High-Dimensional Feature Space

In real applications, we typically do not observe variables such as the customer type as explicit categories. Therefore, we follow Hartford et al. [2017] and consider the case where the customer type s[7]𝑠delimited-[]7s\in[7]italic_s ∈ [ 7 ] is replaced by images of the corresponding handwritten digits from the MNIST dataset LeCun and Cortes [2010] to evaluate our methods with high-dimensional (282superscript28228^{2}28 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT=784 dimensions) inputs. The task remains to learn h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, but the algorithms are no longer explicitly given the 7 customer types, and instead have to infer the relationship between the image data and the outcome. Results for IV regression are plotted in Fig. 3(a), where DML-IV and CE-DML-IV outperforms all other methods. In these high-dimensional settings, regularisation is heavily used to avoid overfitting. DML-IV demonstrates the benefits of using DML to reduce both the regularisation and overfitting bias caused by learning the nuisance parameters.

To demonstrate the robustness of DML-IV, we first provide a sensitivity analysis against hyperparameter changes in Section G.3. We evaluate DML-IV and CE-DML-IV on the aeroplane ticket demand datasets under a range of hyperparameters, where stable performance is observed. In addition, we consider the case when the IV is weakly correlated with the action in Section G.1, where we empirically demonstrate that DML-IV and CE-DML-IV perform significantly better than SOTA methods under weak instruments.

4.2 Offline IV Bandit

We also evaluate DML-IV’s ability to learn good decision policies in the offline IV bandit problem. We reuse the aeroplane ticket demand dataset and aim to find the best pricing policy that maximises sales. From the learnt h^^\hat{h}over^ start_ARG italic_h end_ARG, for each context sampled from the test distribution, we retrieve the best action by uniformly sampling actions from the action space 𝒜𝒜\mathcal{A}caligraphic_A and selecting the action for which h^^\hat{h}over^ start_ARG italic_h end_ARG returns the highest value. Using this induced policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG, we compare the expected reward following π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG over the test distribution.

For the low-dimensional ticket demand dataset, we first set the test distribution to be the same as the training distribution and plot the average rewards in Fig. 2(b). In Fig. 2(c), we shift the test distribution out of the training distribution by incrementing the distribution of t𝑡titalic_t by 1111. For the high-dimensional setting, Fig. 3(b) and Fig. 3(c) demonstrate the expected rewards for test distributions in and out of the training distribution, respectively. There is a clear trend that a better fitted (low MSE) h^^\hat{h}over^ start_ARG italic_h end_ARG leads to an induced policy with higher expected reward. In all cases, DML-IV outperforms all other methods, especially in the high-dimensional setting, where DML-IV consistently learns the near-optimal policy with only 2000 samples. CE-DML-IV, on the other hand, only matches the performance of DML-IV for the low-dimensional setting, but still outperforms the other methods in the high-dimensional setting.

We only compare with other IV regression methods because there are no offline bandit methods that consider the IV setting, and standard offline bandit algorithms (e.g., Valko et al. [2013], ** et al. [2021], Nguyen-Tang et al. [2022]) fail to learn meaningful policies when the dataset is confounded, as demonstrated in Section E.

4.3 Real-World Decision Problem

Lastly, we test the performance of DML-IV on real-world datasets. The true counterfactual prediction function is rarely available for real-world data. Therefore, in line with previous approaches Shalit et al. [2017], Wu et al. [2023], Schwab et al. [2019], Bica et al. [2020], we instead consider two semi-synthetic real-world datasets IHDP444IHDP: https://www.fredjo.com/. Hill [2011] and PM-CMR555PM-CMR:https://doi.org/10.23719/1506014. Wyatt et al. [2020]. We directly use the continuous variables from IHDP and PM-CMR as context variables, and generate the outcome variable with a nonlinear synthetic function following Wu et al. [2023]. There are 470 and 1350 training samples in IHDP and PM-CMR, respectively (for details see Section D.2). We also run each method 20 times, where the MSE of h^^\hat{h}over^ start_ARG italic_h end_ARG and the expected reward of the induced policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG on the test dataset are plotted in Fig. 4. DML-IV and CE-DML-IV demonstrate comparable, if not lower, MSE of fitting h^^\hat{h}over^ start_ARG italic_h end_ARG than the other methods, while outperforming all other methods in average reward. This shows that our algorithm can reliably learn the counterfactual prediction function and policies with the highest average reward from real-world data.

Refer to caption
Refer to caption
Refer to caption
Figure 4: The mean squared error of h^^\hat{h}over^ start_ARG italic_h end_ARG and average reward following π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG for the real-world datasets.

5 Conclusion

We have proposed a novel method for instrumental variable regression, DML-IV. By leveraging IVs and DML on offline data, DML-IV can learn counterfactual predictions and effective decision policies with fast convergence rate and suboptimality guarantees by mitigating the regularisation and overfitting biases of DL. We evaluated DML-IV on IV regression benchmarks and IV bandit problems, including semi-synthetic real-world data, experimentally showing it is superior compared to SOTA IV regression methods.

Future work includes considering other estimation methods for the nuisance parameters following our Neyman-orthogonal score, and extending the method to sequential decision problems and reinforcement learning in the presence of hidden confounders Namkoong et al. [2020].

Acknowledgments

This work was supported by the EPSRC Prosperity Partnership FAIR (grant number EP/V056883/1). DS acknowledges funding from the Turing Institute and Accenture collaboration. AS was partially supported by AI Singapore, grant AISG2-RP-2020-018. FQ acknowledges funding from ELSA: European Lighthouse on Secure and Safe AI project (grant agreement No. 101070617 under UK guarantee). MK receives funding from the ERC under the European Union’s Horizon 2020 research and innovation programme (FUN2MODEL, grant agreement No. 834115).

Impact Statement

The goal of the paper is to develop a methodology to learn high-performing decision policies from offline data. There are many applications of our work in automated decision making, for example, in planning, healthcare, and finance. The theoretical guarantees that we provide ensure the reliability and suboptimality guarantees of the learnt policies. We do not foresee negative implications of our methodology, but would caution against deploying it without human input and recommend additional validation in any new setting to reduce the risk of misapplication.

References

  • Andrews et al. [2019] I. Andrews, J. H. Stock, and L. Sun. Weak instruments in instrumental variables regression: Theory and practice. Annual Review of Economics, 11:727–753, 8 2019. ISSN 19411391. doi: 10.1146/ANNUREV-ECONOMICS-080218-025643/1.
  • Angelis et al. [2023] E. Angelis, F. Quinzan, A. Soleymani, P. Jaillet, and S. Bauer. Doubly robust structure identification from temporal data. arXiv preprint arXiv:2311.06012, 2023.
  • Angrist [1990] J. D. Angrist. Lifetime earnings and the vietnam era draft lottery: Evidence from social security administrative records. The American Economic Review, 80:1284–1286, 1990. ISSN 00028282.
  • Angrist and Pischke [2009] J. D. Angrist and J.-S. Pischke. Mostly Harmless Econometrics. Princeton University Press, 2 2009. doi: 10.2307/J.CTVCM4J72.
  • Angrist et al. [1996] J. D. Angrist, G. W. Imbens, and D. B. Rubin. Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91:444–455, 6 1996. ISSN 1537274X. doi: 10.1080/01621459.1996.10476902.
  • Bang and Robins [2005] H. Bang and J. M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
  • Bareinboim and Pearl [2012] E. Bareinboim and J. Pearl. Causal inference by surrogate experiments: z-identifiability. Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, 2012.
  • Bartlett et al. [2005] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. The Annals of Statistics, 33:1497-1537, 2005.
  • Belloni et al. [2012] A. Belloni, D. Chen, V. Chernozhukov, and C. Hansen. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429, 2012.
  • Benkeser et al. [2017] D. Benkeser, M. Carone, M. V. D. Laan, and P. B. Gilbert. Doubly robust nonparametric inference on the average treatment effect. Biometrika, 104(4):863–880, 2017.
  • Bennett et al. [2019] A. Bennett, N. Kallus, and T. Schnabel. Deep generalized method of moments for instrumental variable analysis. Advances in Neural Information Processing Systems, 32, 2019. ISSN 10495258.
  • Bennett et al. [2021] A. Bennett, N. Kallus, L. Li, and A. Mousavi. Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pages 1999–2007, 3 2021. ISSN 2640-3498.
  • Bica et al. [2020] I. Bica, J. Jordon, and M. van der Schaar. Estimating the effects of continuous-valued interventions using generative adversarial networks. Advances in Neural Information Processing Systems, 2020-December, 2 2020. ISSN 10495258.
  • Bilodeau et al. [2021] B. Bilodeau, D. J. Foster, and D. M. Roy. Minimax rates for conditional density estimation via empirical entropy. Annals of Statistics, 51:762–790, 9 2021. doi: 10.1214/23-AOS2270. URL http://arxiv.longhoe.net/abs/2109.10461http://dx.doi.org/10.1214/23-AOS2270.
  • Blair et al. [1976] J. M. Blair, C. A. Edwards, and J. H. Johnson. Rational chebyshev approximations for the inverse of the error function. Mathematics of Computation, 30(136):827, 10 1976. ISSN 00255718. doi: 10.2307/2005402.
  • Blundell et al. [2007] R. Blundell, X. Chen, and D. Kristensen. Semi-nonparametric iv estimation of shape-invariant engel curves. Econometrica, 75:1613–1669, 11 2007. ISSN 1468-0262. doi: 10.1111/J.1468-0262.2007.00808.X.
  • Blundell et al. [2012] R. Blundell, J. L. Horowitz, and M. Parey. Measuring the price responsiveness of gasoline demand: Economic shape restrictions and nonparametric demand estimation. Quantitative Economics, 3:29–51, 3 2012. ISSN 1759-7331. doi: 10.3982/QE91.
  • Bound et al. [1995] J. Bound, D. A. Jaeger, and R. M. Baker. Problems with instrumental variables estimation when the correlation between the instruments and the endogeneous explanatory variable is weak. Journal of the American Statistical Association, 90:443, 6 1995. ISSN 01621459. doi: 10.2307/2291055.
  • Chen and Christensen [2018] X. Chen and T. M. Christensen. Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric iv regression. Quantitative Economics, 9:39–84, 3 2018. ISSN 17597331. doi: 10.3982/qe722.
  • Chen et al. [2021] Y. Chen, L. Xu, C. Gulcehre, T. L. Paine, A. Gretton, N. de Freitas, and A. Doucet. On instrumental variable regression for deep offline policy evaluation. Journal of Machine Learning Research, 23, 5 2021. ISSN 15337928.
  • Chernozhukov et al. [2015] V. Chernozhukov, C. Hansen, and M. Spindler. Post-selection and post-regularization inference in linear models with many controls and instruments. American Economic Review, 105(5):486–490, 2015.
  • Chernozhukov et al. [2018] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018. ISSN 1368-4221. doi: 10.1111/ECTJ.12097.
  • Chernozhukov et al. [2021] V. Chernozhukov, W. K. Newey, V. Quintas-Martinez, and V. Syrgkanis. Automatic debiased machine learning via neural nets for generalized linear regression. 4 2021. URL https://arxiv.longhoe.net/abs/2104.14737v1.
  • Chernozhukov et al. [2022a] V. Chernozhukov, J. C. Escanciano, H. Ichimura, W. K. Newey, and J. M. Robins. Locally robust semiparametric estimation. Econometrica, 90(4):1501–1535, 7 2022a. ISSN 0012-9682. doi: 10.3982/ecta16294.
  • Chernozhukov et al. [2022b] V. Chernozhukov, W. Newey, V. Quintas-Martínez, and V. Syrgkanis. RieszNet and ForestRiesz: Automatic debiased machine learning with neural nets and random forests. Proceedings of Machine Learning Research, 162:3901–3914, 10 2022b. ISSN 26403498.
  • Chernozhukov et al. [2024] V. Chernozhukov, C. Hansen, N. Kallus, M. Spindler, and V. Syrgkanis. Applied causal inference powered by ml and ai. rem, 12(1):338, 2024.
  • Darolles et al. [2011] S. Darolles, Y. Fan, J. P. Florens, and E. Renault. Nonparametric instrumental regression. Econometrica, 79:1541–1565, 9 2011. ISSN 1468-0262. doi: 10.3982/ECTA6539.
  • Fu et al. [2022] Z. Fu, Z. Qi, Z. Wang, Z. Yang, Y. Xu, and M. R. Kosorok. Offline reinforcement learning with instrumental variables in confounded markov decision processes. 2022.
  • Funk et al. [2011] M. J. Funk, D. Westreich, C. Wiesen, T. Stürmer, M. A. Brookhart, and M. Davidian. Doubly robust estimation of causal effects. American journal of epidemiology, 173(7):761–767, 2011.
  • Goodfellow et al. [2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. Communications of the ACM, 63:139–144, 6 2014. ISSN 15577317. doi: 10.1145/3422622.
  • Grünewälder [2018] S. Grünewälder. Plug-in estimators for conditional expectations and probabilities. Proceedings of the 21 International Conference on Artificial Intelligence and Statistics, pages 1513–1521, 3 2018. ISSN 2640-3498.
  • Hartford et al. [2017] J. Hartford, G. Lewis, K. Leyton-Brown, and M. Taddy. Deep IV: A flexible approach for counterfactual prediction. Proceedings of the 34th International Conference on Machine Learning, 2017. doi: 10.5555/3305381.3305527.
  • Hill [2011] J. L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20:217–240, 3 2011. ISSN 10618600. doi: 10.1198/JCGS.2010.08162.
  • Ichimura and Newey [2022] H. Ichimura and W. K. Newey. The influence function of semiparametric estimators. Quantitative Economics, 13:29–61, 1 2022. ISSN 1759-7331. doi: 10.3982/QE826.
  • ** et al. [2021] Y. **, Z. Yang, and Z. Wang. Is pessimism provably efficient for offline rl? International Conference on Machine Learning, 2021.
  • Jung et al. [2021] Y. Jung, J. Tian, and E. Bareinboim. Estimating identifiable causal effects through double machine learning. AAAI Conference on Artificial Intelligence, 2021.
  • LeCun and Cortes [2010] Y. LeCun and C. Cortes. Mnist handwritten digit database, 2010. URL http://yann.lecun.com/exdb/mnist/.
  • Li et al. [2021] J. Li, Y. Luo, X. Zhang, C. Ai, V. Chernozhukov, J. Dai, I. Fernandez-Val, J.-J. Forneron, W. Jiang, and H. Kaido. Causal reinforcement learning: An instrumental variable approach. SSRN Electronic Journal, 3 2021. doi: 10.2139/ssrn.3792824.
  • Liao et al. [2021] L. Liao, Z. Fu, Z. Yang, Y. Wang, M. Kolar, and Z. Wang. Instrumental variable value iteration for causal offline reinforcement learning. 2021. doi: CoRRabs/2102.09907.
  • Loshchilov and Hutter [2017] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. 7th International Conference on Learning Representations, ICLR 2019, 11 2017.
  • Lu et al. [2020] Y. Lu, A. Meisami, A. Tewari, and Z. Yan. Regret analysis of bandit problems with causal background knowledge. Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence, 10 2020.
  • Mackey et al. [2018] L. Mackey, V. Syrgkanis, and D. Zadik. Orthogonal machine learning: Power and limitations. 35th International Conference on Machine Learning, ICML 2018, 13:9112–9124, 11 2018.
  • Muandet et al. [2020] K. Muandet, A. Mehrjou, S. K. Lee, and A. Raj. Dual instrumental variable regression. Advances in Neural Information Processing Systems, 2020-December, 10 2020. ISSN 10495258.
  • Namkoong et al. [2020] H. Namkoong, R. Keramati, S. Yadlowsky, and E. Brunskill. Off-policy policy evaluation for sequential decisions under unobserved confounding. Advances in Neural Information Processing Systems, 33:18819–18831, 2020.
  • Nashed and Wahba [1974] M. Z. Nashed and G. Wahba. Generalized inverses in reproducing kernel spaces: An approach to regularization of linear operator equations. SIAM Journal on Mathematical Analysis, 5, 1974.
  • Newey and Powell [2003] W. K. Newey and J. L. Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71:1565–1578, 9 2003. ISSN 1468-0262. doi: 10.1111/1468-0262.00459.
  • Neyman and Scott [1965] J. Neyman and E. L. Scott. Asymptotically optimal tests of composite hypotheses for randomized experiments with noncontrolled predictor variables. Journal of the American Statistical Association, 60:699–721, 1965. ISSN 1537274X. doi: 10.1080/01621459.1965.10480822.
  • Nguyen-Tang et al. [2022] T. Nguyen-Tang, S. Gupta, A. T. Nguyen, and S. Venkatesh. Offline neural contextual bandits: Pessimism, optimization and generalization. Proceeding of the International Conference on Learning Representations, 2022.
  • Pace et al. [2023] A. Pace, H. Y. Eche, B. Schölkopf, G. Rätsch, and G. Tennenholtz. Delphic offline reinforcement learning under nonidentifiable hidden confounding. Workshop on New Frontiers in Learning, Control, and Dynamical Systems at ICML, 6 2023.
  • Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 12 2019. ISSN 10495258.
  • Pearl [2000] J. Pearl. Causality: models, reasoning, and inference. Econometric Theory, 2000.
  • Quinzan et al. [2023] F. Quinzan, A. Soleymani, P. Jaillet, C. R. Rojas, and S. Bauer. Drcfs: Doubly robust causal feature selection. In International Conference on Machine Learning, pages 28468–28491, 2023.
  • Reiersöl [1945] O. Reiersöl. Confluence analysis by means of instrumental sets of variables. astronomi och fysik, 1945.
  • Robins et al. [1994] J. M. Robins, A. Rotnitzky, and L. P. Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994.
  • Robinson [1988] P. M. Robinson. Root-n-consistent semiparametric regression. Econometrica, 56:931, 7 1988. ISSN 00129682. doi: 10.2307/1912705.
  • Schwab et al. [2019] P. Schwab, L. Linhardt, S. Bauer, J. M. Buhmann, and W. Karlen. Learning counterfactual representations for estimating individual dose-response curves. AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, pages 5612–5619, 2 2019. doi: 10.1609/aaai.v34i04.6014.
  • Shalit et al. [2017] U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds and algorithms. 34th International Conference on Machine Learning, ICML 2017, 6:4709–4718, 6 2017.
  • Shpitser and Pearl [2008] I. Shpitser and J. Pearl. Complete identification methods for the causal hierarchy. Journal of Machine Learning Research, 9(64):1941–1979, 2008. ISSN 1533-7928.
  • Singh et al. [2019] R. Singh, M. Sahani, and A. Gretton. Kernel instrumental variable regression. Advances in Neural Information Processing Systems, 32, 6 2019. ISSN 10495258.
  • Słoczyński and Wooldridge [2018] T. Słoczyński and J. M. Wooldridge. A general double robustness result for estimating average treatment effects. Econometric Theory, 34(1):112–133, 2018.
  • Soleymani et al. [2022] A. Soleymani, A. Raj, S. Bauer, B. Schölkopf, and M. Besserve. Causal feature selection via orthogonal search. Transactions on Machine Learning Research, 2022.
  • Subramanian and Ravindran [2022] C. Subramanian and B. Ravindran. Causal contextual bandits with targeted interventions. In International Conference on Learning Representations, 1 2022.
  • Valko et al. [2013] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. Uncertainty in Artificial Intelligence - Proceedings of the 29th Conference, UAI 2013, pages 654–663, 9 2013.
  • Van Handel [2014] R. Van Handel. Probability in high dimension. Lecture Notes (Princeton University), 2014.
  • Wainwright [2019] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, pages 1–552, 1 2019. doi: 10.1017/9781108627771.
  • Weisstein [2023] E. W. Weisstein. Asymptotic notation, 2023. URL https://mathworld.wolfram.com/AsymptoticNotation.html.
  • Wright [1928] P. G. Wright. The tariff on animal and vegetable oils. https://doi.org/10.1086/254144, 38:619–620, 10 1928. ISSN 0022-3808. doi: 10.1086/254144.
  • Wu et al. [2023] A. Wu, K. Kuang, R. Xiong, M. Zhu, Y. Liu, B. Li, F. Liu, Z. Wang, and F. Wu. Learning instrumental variable from data fusion for treatment effect estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 8 2023.
  • Wyatt et al. [2020] L. H. Wyatt, G. C. L. Peterson, T. J. Wade, L. M. Neas, and A. G. Rappold. Annual pm2.5 and cardiovascular mortality rate data: Trends modified by county socioeconomic status in 2,132 us counties. Data in brief, 30, 6 2020. ISSN 2352-3409. doi: 10.1016/J.DIB.2020.105318.
  • Xu et al. [2020] L. Xu, Y. Chen, S. Srinivasan, N. de Freitas, A. Doucet, and A. Gretton. Learning deep features in instrumental variable regression. ICLR 2021 - 9th International Conference on Learning Representations, 10 2020.
  • Xu et al. [2023] Y. Xu, J. Zhu, C. Shi, S. Luo, and R. Song. An instrumental variable approach to confounded off-policy evaluation. Proceedings of the 40th International Conference on Machine Learning, 2023.
  • Yang and Barron [1999] Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. Annals of Statistics, pages 1564–1599, 1999.
  • Zhang and Bareinboim [2020] J. Zhang and E. Bareinboim. Designing optimal dynamic treatment regimes: A causal reinforcement learning approach. Proceedings of the 37th International Conference on Machine Learning, page 119, 2020.
  • Zhang et al. [2022] J. Zhang, Y. Chen, P. G. Allen, and A. Singh. Causal bandits: Online decision-making in endogenous settings. A causal view on dynamical systems workshop at NeurIPS 2022, 2022.

Appendix

A Computationally Efficient CE-DML-IV

  Input: Dataset 𝒟𝒟\mathcal{D}caligraphic_D with size N𝑁Nitalic_N, mini-batch size nbsubscript𝑛𝑏n_{b}italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT
  Output: The CE-DML-IV estimator hθ^subscript^𝜃h_{\hat{\theta}}italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT
  Learn s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG and g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG using 𝒟𝒟\mathcal{D}caligraphic_D
  Initialise hθ^subscript^𝜃h_{\hat{\theta}}italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT
  repeat
     Sample nbsubscript𝑛𝑏n_{b}italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT data (ci,zi)subscript𝑐𝑖subscript𝑧𝑖(c_{i},z_{i})( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) from 𝒟𝒟\mathcal{D}caligraphic_D
     =𝔼^(ci,zi)[(s^(c,z)g^(hθ,c,z))2]subscript^𝔼subscript𝑐𝑖subscript𝑧𝑖delimited-[]superscript^𝑠𝑐𝑧^𝑔subscript𝜃𝑐𝑧2\mathcal{L}=\hat{\mathds{E}}_{(c_{i},z_{i})}\left[(\hat{s}(c,z)-\hat{g}(h_{% \theta},c,z))^{2}\right]caligraphic_L = over^ start_ARG blackboard_E end_ARG start_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ( over^ start_ARG italic_s end_ARG ( italic_c , italic_z ) - over^ start_ARG italic_g end_ARG ( italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_c , italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
     Update θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG to minimise loss \mathcal{L}caligraphic_L
  until convergence
Algorithm 2 Computationally Efficient CE-DML-IV

The standard DML-IV with K𝐾Kitalic_K-fold cross-fitting trains s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG and g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG K𝐾Kitalic_K times on different subsets of the dataset to tackle overfitting bias, but it is computationally expensive. Therefore, as mentioned in Section 4, we also evaluate CE-DML-IV, a computationally efficient version of DML-IV that does not apply K𝐾Kitalic_K-fold cross-fitting and trains s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG and g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG only once using the entire dataset. It uses the same Neyman orthogonal score as the standard DML-IV, so it still enjoys the partial debiasing effect Mackey et al. [2018] from the Neyman orthogonal score. However, without K𝐾Kitalic_K-fold cross-fitting, it lacks the theoretical convergence rate guarantees provided by Theorem 3.4 and Theorem 3.5. CE-DML-IV can be viewed as a trade-off between computational complexity and theoretical guarantees, and we found that CE-DML-IV empirically performs as well as standard DML-IV on low-dimensional datasets, where overfitting bias is not prevalent.

B Standard Loss Function for IV Regression

The standard score (or loss) function for two-stage IV regression is =(Rg(h,c,z))2superscript𝑅𝑔𝑐𝑧2\ell=(R-g(h,c,z))^{2}roman_ℓ = ( italic_R - italic_g ( italic_h , italic_c , italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, as described in Eq. 3. This score is not Neyman orthogonal because, first of all, 𝔼[(Rg0(h0,c,z))2]=𝔼[(R𝔼[R|C,Z])2]0\mathds{E}[(R-g_{0}(h_{0},c,z))^{2}]=\mathds{E}[(R-\mathds{E}[R\lvert C,Z])^{2% }]\neq 0blackboard_E [ ( italic_R - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c , italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E [ ( italic_R - blackboard_E [ italic_R | italic_C , italic_Z ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≠ 0 since 𝔼[h0|C,Z]=𝔼[R|C,Z]\mathds{E}[h_{0}\lvert C,Z]=\mathds{E}[R\lvert C,Z]blackboard_E [ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C , italic_Z ] = blackboard_E [ italic_R | italic_C , italic_Z ] and R𝔼[R|C,Z]0R-\mathds{E}[R\lvert C,Z]\neq 0italic_R - blackboard_E [ italic_R | italic_C , italic_Z ] ≠ 0 due to the noise on R𝑅Ritalic_R.

Secondly, the derivative against small changes in g𝑔gitalic_g for score 𝔼[(Rg0(h0,c,z))2]𝔼delimited-[]superscript𝑅subscript𝑔0subscript0𝑐𝑧2\mathds{E}[(R-g_{0}(h_{0},c,z))^{2}]blackboard_E [ ( italic_R - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c , italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] is

r𝔼[\displaystyle\frac{\partial}{\partial r}\mathds{E}\Bigl{[}divide start_ARG ∂ end_ARG start_ARG ∂ italic_r end_ARG blackboard_E [ (Rg0(h0,C,Z)rg(h0,C,Z))2]\displaystyle(R-g_{0}(h_{0},C,Z)-r\cdot g(h_{0},C,Z))^{2}\Bigr{]}( italic_R - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) - italic_r ⋅ italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle== r𝔼[(Rg0(h0,C,Z))22r(Rg0(h0,C,Z))g(h0,C,Z)+r2g(h0,C,Z)2]𝑟𝔼delimited-[]superscript𝑅subscript𝑔0subscript0𝐶𝑍22𝑟𝑅subscript𝑔0subscript0𝐶𝑍𝑔subscript0𝐶𝑍superscript𝑟2𝑔superscriptsubscript0𝐶𝑍2\displaystyle\frac{\partial}{\partial r}\mathds{E}\Bigl{[}(R-g_{0}(h_{0},C,Z))% ^{2}-2r\cdot(R-g_{0}(h_{0},C,Z))g(h_{0},C,Z)+r^{2}\cdot g(h_{0},C,Z)^{2}\Bigr{]}divide start_ARG ∂ end_ARG start_ARG ∂ italic_r end_ARG blackboard_E [ ( italic_R - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_r ⋅ ( italic_R - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) + italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle== 𝔼[2(Rg0(h0,C,Z))g(h0,C,Z)+2rg(h0,C,Z)2],𝔼delimited-[]2𝑅subscript𝑔0subscript0𝐶𝑍𝑔subscript0𝐶𝑍2𝑟𝑔superscriptsubscript0𝐶𝑍2\displaystyle\mathds{E}\Bigl{[}2(R-g_{0}(h_{0},C,Z))g(h_{0},C,Z)+2r\cdot g(h_{% 0},C,Z)^{2}\Bigr{]},blackboard_E [ 2 ( italic_R - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) + 2 italic_r ⋅ italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

and, when r=0𝑟0r=0italic_r = 0, this derivative evaluates to

𝔼[2(Rg0(h0,c,z))g(h0,c,z)]=𝔼[2(R𝔼[R|C,Z])g(h0,c,z)]\mathds{E}[2(R-g_{0}(h_{0},c,z))g(h_{0},c,z)]=\mathds{E}[2(R-\mathds{E}[R% \lvert C,Z])g(h_{0},c,z)]blackboard_E [ 2 ( italic_R - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c , italic_z ) ) italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c , italic_z ) ] = blackboard_E [ 2 ( italic_R - blackboard_E [ italic_R | italic_C , italic_Z ] ) italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c , italic_z ) ]

which does not equal to 0 for general g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G since generally g(h0,c,z)𝑔subscript0𝑐𝑧g(h_{0},c,z)italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c , italic_z ) and the residual (R𝔼[R|C,Z])(R-\mathds{E}[R\lvert C,Z])( italic_R - blackboard_E [ italic_R | italic_C , italic_Z ] ) are correlated. Therefore, the standard score function for two-stage IV regression can not be used to create a DML estimator.

C Omitted Proofs

In this section, we state all the conditions required to prove the N1/2superscript𝑁12N^{-1/2}italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT convergence rate guarantees for the DML-IV estimator, and provide the omitted proofs in the main paper for Theorem 3.1Footnote 2Theorem 3.4 and Theorem 3.5.

C.1 DML-IV N1/2superscript𝑁12N^{-1/2}italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT Convergence Rate Guarantees

To obtain N1/2superscript𝑁12N^{-1/2}italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT convergence rate guarantees of the DML-IV estimator, the following conditions must be satisfied.

Condition C.1 (Conditions for N1/2superscript𝑁12N^{-1/2}italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT convergence of DML, Assumption 3.3 and 3.4 in Chernozhukov et al. [2018]).

For N3𝑁3N\geq 3italic_N ≥ 3, all the following conditions hold. (a): The true parameter θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT obeys 𝔼[ψ(𝒟;h0,(s0,g0))]=0𝔼delimited-[]𝜓𝒟subscript0subscript𝑠0subscript𝑔00\mathds{E}[\psi(\mathcal{D};h_{0},(s_{0},g_{0}))]=0blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ] = 0 and ΘΘ\Thetaroman_Θ contains a ball of radius c1N1/2logNsubscript𝑐1superscript𝑁12𝑁c_{1}N^{-1/2}\log Nitalic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_log italic_N centered at θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. (b): The map (θ,(s,g))𝔼[ψ(𝒟;hθ,(s,g))]maps-to𝜃𝑠𝑔𝔼delimited-[]𝜓𝒟subscript𝜃𝑠𝑔(\theta,(s,g))\mapsto\mathds{E}[\psi(\mathcal{D};h_{\theta},(s,g))]( italic_θ , ( italic_s , italic_g ) ) ↦ blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s , italic_g ) ) ] is twice continuously Gateaux-differentiable. (c): For all θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, the identification relationship

𝔼[ψ(𝒟;hθ,(s0,g0))]J0(θθ0)greater-than-or-equivalent-todelimited-∥∥𝔼delimited-[]𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔0delimited-∥∥subscript𝐽0𝜃subscript𝜃0\displaystyle\lVert\mathds{E}[\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))]% \rVert\gtrsim\lVert J_{0}(\theta-\theta_{0})\rVert∥ blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ] ∥ ≳ ∥ italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ (8)

is satisfied, where J0θ{𝔼[ψ(𝒟;hθ,(s0,g0))]}|θ=θ0subscript𝐽0evaluated-atsubscriptsuperscript𝜃𝔼delimited-[]𝜓𝒟subscriptsuperscript𝜃subscript𝑠0subscript𝑔0superscript𝜃subscript𝜃0J_{0}\coloneqq\partial_{\theta^{\prime}}\{\mathds{E}[\psi(\mathcal{D};h_{% \theta^{\prime}},(s_{0},g_{0}))]\}|_{\theta^{\prime}=\theta_{0}}italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≔ ∂ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ] } | start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the Jacobian matrix, with singular values strictly positive (bounded away from zero). (d): The score ψ𝜓\psiitalic_ψ obeys the Neyman orthogonality. (e): Let K𝐾Kitalic_K be a fixed integer. Given a random partition {Ik}k=1Ksuperscriptsubscriptsubscript𝐼𝑘𝑘1𝐾\{I_{k}\}_{k=1}^{K}{ italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT of indices [N]delimited-[]𝑁[N][ italic_N ] each of size n=N/K𝑛𝑁𝐾n=N/Kitalic_n = italic_N / italic_K, we have that the nuisance parameter estimator η^^𝜂\hat{\eta}over^ start_ARG italic_η end_ARG learnt using data with indices Ikcsubscriptsuperscript𝐼𝑐𝑘I^{c}_{k}italic_I start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT belongs to a shrinking realisation set 𝒯Nsubscript𝒯𝑁\mathcal{T}_{N}caligraphic_T start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, and the nuisance parameters should be estimated at the o(N1/4)𝑜superscript𝑁14o(N^{-1/4})italic_o ( italic_N start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ) rate, i.e., η^η02=o(N1/4)subscriptdelimited-∥∥^𝜂subscript𝜂02𝑜superscript𝑁14\lVert\hat{\eta}-\eta_{0}\rVert_{2}=o(N^{-1/4})∥ over^ start_ARG italic_η end_ARG - italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_o ( italic_N start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ). (f): All eigenvalues of the matrix 𝔼[ψ(𝒟;hθ0,(s0,g0))ψ(𝒟;hθ0,(s0,g0))T]𝔼delimited-[]𝜓𝒟subscriptsubscript𝜃0subscript𝑠0subscript𝑔0𝜓superscript𝒟subscriptsubscript𝜃0subscript𝑠0subscript𝑔0𝑇\mathds{E}[\psi(\mathcal{D};h_{\theta_{0}},(s_{0},g_{0}))\psi(\mathcal{D};h_{% \theta_{0}},(s_{0},g_{0}))^{T}]blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] are strictly positive (bounded away from zero).

We will check all these conditions in Theorem 3.1Lemma C.2 and Theorem 3.4.

Proof of Theorem 3.1:.

Firstly, by Equation 2, we have s0(C,Z)=g0(h0,C,Z)subscript𝑠0𝐶𝑍subscript𝑔0subscript0𝐶𝑍s_{0}(C,Z)=g_{0}(h_{0},C,Z)italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z ) = italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ), thus

ψ(𝒟;h0,(s0,g0))=𝔼[(s0(C,Z)g0(h0,C,Z))2]=0𝜓𝒟subscript0subscript𝑠0subscript𝑔0𝔼delimited-[]superscriptsubscript𝑠0𝐶𝑍subscript𝑔0subscript0𝐶𝑍20\psi(\mathcal{D};h_{0},(s_{0},g_{0}))=\mathds{E}\Bigl{[}(s_{0}(C,Z)-g_{0}(h_{0% },C,Z))^{2}\Bigr{]}=0italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) = blackboard_E [ ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z ) - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 0

Then we compute the derivative w.r.t. small changes in the nuisance parameters. For all s,g𝒮,𝒢formulae-sequence𝑠𝑔𝒮𝒢s,g\in\mathcal{S},\mathcal{G}italic_s , italic_g ∈ caligraphic_S , caligraphic_G,

r𝑟\displaystyle\frac{\partial}{\partial r}divide start_ARG ∂ end_ARG start_ARG ∂ italic_r end_ARG 𝔼[(s0(C,Z)+rs(C,Z)g0(h0,C,Z)rg(h0,C,Z))2]𝔼delimited-[]superscriptsubscript𝑠0𝐶𝑍𝑟𝑠𝐶𝑍subscript𝑔0subscript0𝐶𝑍𝑟𝑔subscript0𝐶𝑍2\displaystyle\mathds{E}\Bigl{[}(s_{0}(C,Z)+r\cdot s(C,Z)-g_{0}(h_{0},C,Z)-r% \cdot g(h_{0},C,Z))^{2}\Bigr{]}blackboard_E [ ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z ) + italic_r ⋅ italic_s ( italic_C , italic_Z ) - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) - italic_r ⋅ italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle== r𝔼[2r(s0(C,Z)g0(h0,C,Z))(s(C,Z)g(h0,C,Z))+r2(s(C,Z)g(h0,C,Z))2]𝑟𝔼delimited-[]2𝑟subscript𝑠0𝐶𝑍subscript𝑔0subscript0𝐶𝑍𝑠𝐶𝑍𝑔subscript0𝐶𝑍superscript𝑟2superscript𝑠𝐶𝑍𝑔subscript0𝐶𝑍2\displaystyle\frac{\partial}{\partial r}\mathds{E}\Bigl{[}2r(s_{0}(C,Z)-g_{0}(% h_{0},C,Z))(s(C,Z)-g(h_{0},C,Z))+r^{2}(s(C,Z)-g(h_{0},C,Z))^{2}\Bigr{]}divide start_ARG ∂ end_ARG start_ARG ∂ italic_r end_ARG blackboard_E [ 2 italic_r ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z ) - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) ( italic_s ( italic_C , italic_Z ) - italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) + italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_s ( italic_C , italic_Z ) - italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle== 𝔼[2(s0(C,Z)g0(h0,C,Z))(s(C,Z)g(h0,C,Z))+2r(s(C,Z)g(h0,C,Z))2],𝔼delimited-[]2subscript𝑠0𝐶𝑍subscript𝑔0subscript0𝐶𝑍𝑠𝐶𝑍𝑔subscript0𝐶𝑍2𝑟superscript𝑠𝐶𝑍𝑔subscript0𝐶𝑍2\displaystyle\mathds{E}\Bigl{[}2(s_{0}(C,Z)-g_{0}(h_{0},C,Z))(s(C,Z)-g(h_{0},C% ,Z))+2r(s(C,Z)-g(h_{0},C,Z))^{2}\Bigr{]},blackboard_E [ 2 ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z ) - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) ( italic_s ( italic_C , italic_Z ) - italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) + 2 italic_r ( italic_s ( italic_C , italic_Z ) - italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

and, when at r=0𝑟0r=0italic_r = 0, the derivative evaluates to

𝔼[\displaystyle\mathds{E}\Bigl{[}blackboard_E [ 2(s0(C,Z)g0(h0,C,Z))(s(C,Z)g(h0,C,Z))]\displaystyle 2(s_{0}(C,Z)-g_{0}(h_{0},C,Z))(s(C,Z)-g(h_{0},C,Z))\Bigr{]}2 ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z ) - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) ( italic_s ( italic_C , italic_Z ) - italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) ]
=𝔼[0×(s(C,Z)g(h0,C,Z))]absent𝔼delimited-[]0𝑠𝐶𝑍𝑔subscript0𝐶𝑍\displaystyle=\mathds{E}\Bigl{[}0\times(s(C,Z)-g(h_{0},C,Z))\Bigr{]}= blackboard_E [ 0 × ( italic_s ( italic_C , italic_Z ) - italic_g ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ) ) ]
=0s,g𝒮,𝒢,formulae-sequenceabsent0for-all𝑠𝑔𝒮𝒢\displaystyle=0\quad\forall s,g\in\mathcal{S},\mathcal{G},= 0 ∀ italic_s , italic_g ∈ caligraphic_S , caligraphic_G ,

since s0(C,Z)=𝔼[R|C,Z]=𝔼[h0|C,Z]=g0(h0,C,Z)subscript𝑠0𝐶𝑍𝔼delimited-[]conditional𝑅𝐶𝑍𝔼delimited-[]conditionalsubscript0𝐶𝑍subscript𝑔0subscript0𝐶𝑍s_{0}(C,Z)=\mathds{E}[R|C,Z]=\mathds{E}[h_{0}|C,Z]=g_{0}(h_{0},C,Z)italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z ) = blackboard_E [ italic_R | italic_C , italic_Z ] = blackboard_E [ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_C , italic_Z ] = italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_Z ). Therefore, our moment function ψ𝜓\psiitalic_ψ is Neyman orthogonal at (h0,(s0,g0))subscript0subscript𝑠0subscript𝑔0(h_{0},(s_{0},g_{0}))( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ). ∎

Lemma C.2 (Formal version of Footnote 2: Nuisances parameters convergence).

If Assumption 3.2 holds, let δNsubscript𝛿𝑁\delta_{N}italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT be an upper bound on the critical radius of the two following function spaces:

{(C,Z)γ(s(C,Z)s0(C,Z)):s𝒮N,γ[0,1]};conditional-setmaps-to𝐶𝑍𝛾𝑠𝐶𝑍subscript𝑠0𝐶𝑍formulae-sequence𝑠subscript𝒮𝑁𝛾01\displaystyle\{(C,Z)\mapsto\gamma(s(C,Z)-s_{0}(C,Z)):s\in\mathcal{S}_{N},% \gamma\in[0,1]\};{ ( italic_C , italic_Z ) ↦ italic_γ ( italic_s ( italic_C , italic_Z ) - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z ) ) : italic_s ∈ caligraphic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_γ ∈ [ 0 , 1 ] } ; (9)
{(C,Z)γ(g(C,Z,h0)g0(C,Z,h0)):g𝒢N,γ[0,1]},conditional-setmaps-to𝐶𝑍𝛾𝑔𝐶𝑍subscript0subscript𝑔0𝐶𝑍subscript0formulae-sequence𝑔subscript𝒢𝑁𝛾01\displaystyle\{(C,Z)\mapsto\gamma(g(C,Z,h_{0})-g_{0}(C,Z,h_{0})):g\in\mathcal{% G}_{N},\gamma\in[0,1]\},{ ( italic_C , italic_Z ) ↦ italic_γ ( italic_g ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) : italic_g ∈ caligraphic_G start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_γ ∈ [ 0 , 1 ] } , (10)

and suppose that all functions f𝑓fitalic_f in the two spaces above satisfy fBsubscriptdelimited-∥∥𝑓𝐵\lVert f\rVert_{\infty}\leq B∥ italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_B for some B+𝐵superscriptB\in\mathbb{R}^{+}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Then, for some universal constants c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have that with probability 1ζ1𝜁1-\zeta1 - italic_ζ:

s^s022superscriptsubscriptdelimited-∥∥^𝑠subscript𝑠022\displaystyle\lVert\hat{s}-s_{0}\rVert_{2}^{2}∥ over^ start_ARG italic_s end_ARG - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT c1(δN2+B2log(1/ζ)N+infs𝒮Nss022);absentsubscript𝑐1superscriptsubscript𝛿𝑁2superscript𝐵21𝜁𝑁subscriptinfimumsubscript𝑠subscript𝒮𝑁superscriptsubscriptdelimited-∥∥subscript𝑠subscript𝑠022\displaystyle\leq c_{1}\left(\delta_{N}^{2}+\frac{B^{2}\log(1/\zeta)}{N}+\inf_% {s_{*}\in\mathcal{S}_{N}}\lVert s_{*}-s_{0}\rVert_{2}^{2}\right);≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 1 / italic_ζ ) end_ARG start_ARG italic_N end_ARG + roman_inf start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_s start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ;
g^g022superscriptsubscriptdelimited-∥∥^𝑔subscript𝑔022\displaystyle\lVert\hat{g}-g_{0}\rVert_{2}^{2}∥ over^ start_ARG italic_g end_ARG - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT c2(δN2+B2log(1/ζ)N+infg𝒢Ngg022).absentsubscript𝑐2superscriptsubscript𝛿𝑁2superscript𝐵21𝜁𝑁subscriptinfimumsubscript𝑔subscript𝒢𝑁superscriptsubscriptdelimited-∥∥subscript𝑔subscript𝑔022\displaystyle\leq c_{2}\left(\delta_{N}^{2}+\frac{B^{2}\log(1/\zeta)}{N}+\inf_% {g_{*}\in\mathcal{G}_{N}}\lVert g_{*}-g_{0}\rVert_{2}^{2}\right).≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 1 / italic_ζ ) end_ARG start_ARG italic_N end_ARG + roman_inf start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ caligraphic_G start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_g start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .
Proof of Lemma C.2:.

We will mainly use the result from Theorem 1 of Chernozhukov et al. [2021], which states the following. For a function α𝛼\alphaitalic_α that is the minimizer of a loss function that can be represented as 𝔼[2m(𝒟,α)+α(x)2]𝔼delimited-[]2𝑚𝒟𝛼𝛼superscript𝑥2\mathds{E}[-2m(\mathcal{D},\alpha)+\alpha(x)^{2}]blackboard_E [ - 2 italic_m ( caligraphic_D , italic_α ) + italic_α ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], where 𝒟𝒟\mathcal{D}caligraphic_D is the offline dataset and m𝑚mitalic_m is some moment function that satisfies

𝔼[(m(W,α)m(W,α))2]Mαα22α,α𝒜N.formulae-sequence𝔼delimited-[]superscript𝑚𝑊𝛼𝑚𝑊superscript𝛼2𝑀superscriptsubscriptdelimited-∥∥𝛼superscript𝛼22for-all𝛼superscript𝛼subscript𝒜𝑁\mathds{E}[(m(W,\alpha)-m(W,\alpha^{\prime}))^{2}]\leq M\lVert\alpha-\alpha^{% \prime}\rVert_{2}^{2}\quad\forall\alpha,\alpha^{\prime}\in\mathcal{A}_{N}.blackboard_E [ ( italic_m ( italic_W , italic_α ) - italic_m ( italic_W , italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_M ∥ italic_α - italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∀ italic_α , italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT .

Let δNsubscript𝛿𝑁\delta_{N}italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT be an upper bound on the critical radius of the two function spaces:

{Wγ(α(W)α0(W)):α𝒜N,γ[0,1]};conditional-setmaps-to𝑊𝛾𝛼𝑊subscript𝛼0𝑊formulae-sequence𝛼subscript𝒜𝑁𝛾01\displaystyle\{W\mapsto\gamma(\alpha(W)-\alpha_{0}(W)):\alpha\in\mathcal{A}_{N% },\gamma\in[0,1]\};{ italic_W ↦ italic_γ ( italic_α ( italic_W ) - italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_W ) ) : italic_α ∈ caligraphic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_γ ∈ [ 0 , 1 ] } ;
{Wγ(m(W,α)m(W,α0)):α𝒜N,γ[0,1]}.conditional-setmaps-to𝑊𝛾𝑚𝑊𝛼𝑚𝑊subscript𝛼0formulae-sequence𝛼subscript𝒜𝑁𝛾01\displaystyle\{W\mapsto\gamma(m(W,\alpha)-m(W,\alpha_{0})):\alpha\in\mathcal{A% }_{N},\gamma\in[0,1]\}.{ italic_W ↦ italic_γ ( italic_m ( italic_W , italic_α ) - italic_m ( italic_W , italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) : italic_α ∈ caligraphic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_γ ∈ [ 0 , 1 ] } .

Then, if αBsubscriptdelimited-∥∥𝛼𝐵\lVert\alpha\rVert_{\infty}\leq B∥ italic_α ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_B for some B+𝐵superscriptB\in\mathbb{R}^{+}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, there exists a universal constant c𝑐citalic_c such that with probability 1ζ1𝜁1-\zeta1 - italic_ζ,

α^α022c(δN2+Mlog(1/ζ)N+infα𝒜Nαα022).superscriptsubscriptdelimited-∥∥^𝛼subscript𝛼022𝑐superscriptsubscript𝛿𝑁2𝑀1𝜁𝑁subscriptinfimumsubscript𝛼subscript𝒜𝑁subscriptsuperscriptdelimited-∥∥subscript𝛼subscript𝛼022\displaystyle\lVert\hat{\alpha}-\alpha_{0}\rVert_{2}^{2}\leq c\left(\delta_{N}% ^{2}+\frac{M\log(1/\zeta)}{N}+\inf_{\alpha_{*}\in\mathcal{A}_{N}}\lVert\alpha_% {*}-\alpha_{0}\rVert^{2}_{2}\right).∥ over^ start_ARG italic_α end_ARG - italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_c ( italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_M roman_log ( 1 / italic_ζ ) end_ARG start_ARG italic_N end_ARG + roman_inf start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_α start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

In our case, we show that the loss function for both s𝑠sitalic_s and g𝑔gitalic_g satisfies the above conditions, and thus Theorem 1 of Chernozhukov et al. [2021] is applicable to provide an upper bound on the convergence rate of our nuisance parameters.

The loss function for s𝒮N𝑠subscript𝒮𝑁s\in\mathcal{S}_{N}italic_s ∈ caligraphic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is

s0subscript𝑠0\displaystyle s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =argmins𝒮𝔼[(Rs(C,Z))2]absentsubscriptargmin𝑠𝒮𝔼delimited-[]superscript𝑅𝑠𝐶𝑍2\displaystyle=\operatorname*{arg\,min}_{s\in\mathcal{S}}\mathds{E}\left[(R-s(C% ,Z))^{2}\right]= start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT blackboard_E [ ( italic_R - italic_s ( italic_C , italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=argmins𝒮𝔼[R22Rs(C,Z)+s(C,Z)2]absentsubscriptargmin𝑠𝒮𝔼delimited-[]superscript𝑅22𝑅𝑠𝐶𝑍𝑠superscript𝐶𝑍2\displaystyle=\operatorname*{arg\,min}_{s\in\mathcal{S}}\mathds{E}\left[R^{2}-% 2Rs(C,Z)+s(C,Z)^{2}\right]= start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT blackboard_E [ italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_R italic_s ( italic_C , italic_Z ) + italic_s ( italic_C , italic_Z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=argmins𝒮𝔼[2Rs(C,Z)+s(C,Z)2],absentsubscriptargmin𝑠𝒮𝔼delimited-[]2𝑅𝑠𝐶𝑍𝑠superscript𝐶𝑍2\displaystyle=\operatorname*{arg\,min}_{s\in\mathcal{S}}\mathds{E}\left[-2Rs(C% ,Z)+s(C,Z)^{2}\right],= start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT blackboard_E [ - 2 italic_R italic_s ( italic_C , italic_Z ) + italic_s ( italic_C , italic_Z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where we can set m(W,s)=Rs(C,Z)𝑚𝑊𝑠𝑅𝑠𝐶𝑍m(W,s)=Rs(C,Z)italic_m ( italic_W , italic_s ) = italic_R italic_s ( italic_C , italic_Z ) and check that

𝔼[(Rs(C,Z)Rs0(C,Z))2]𝔼delimited-[]superscript𝑅𝑠𝐶𝑍𝑅subscript𝑠0𝐶𝑍2\displaystyle\mathds{E}[(Rs(C,Z)-Rs_{0}(C,Z))^{2}]blackboard_E [ ( italic_R italic_s ( italic_C , italic_Z ) - italic_R italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] 𝔼[R2(s(C,Z)s0(C,Z))2]absent𝔼delimited-[]superscript𝑅2superscript𝑠𝐶𝑍subscript𝑠0𝐶𝑍2\displaystyle\leq\mathds{E}[R^{2}(s(C,Z)-s_{0}(C,Z))^{2}]≤ blackboard_E [ italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_s ( italic_C , italic_Z ) - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
R2𝔼[(s(C,Z)s0(C,Z))2]absentsubscriptdelimited-∥∥superscript𝑅2𝔼delimited-[]superscript𝑠𝐶𝑍subscript𝑠0𝐶𝑍2\displaystyle\leq\lVert R^{2}\rVert_{\infty}\mathds{E}[(s(C,Z)-s_{0}(C,Z))^{2}]≤ ∥ italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT blackboard_E [ ( italic_s ( italic_C , italic_Z ) - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=B2s(C,Z)s0(C,Z)22,absentsuperscript𝐵2superscriptsubscriptdelimited-∥∥𝑠𝐶𝑍subscript𝑠0𝐶𝑍22\displaystyle=B^{2}\lVert s(C,Z)-s_{0}(C,Z)\rVert_{2}^{2},= italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_s ( italic_C , italic_Z ) - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

by Hölder’s inequality and the assumption that RBsubscriptdelimited-∥∥𝑅𝐵\lVert R\rVert_{\infty}\leq B∥ italic_R ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_B. Therefore, by Theorem 1 of Chernozhukov et al. [2021], there exists a universal constant c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT such that with probability 1ζ1𝜁1-\zeta1 - italic_ζ,

s^s022c1(δN2+B2log(1/ζ)N+infs𝒮Nss022),superscriptsubscriptdelimited-∥∥^𝑠subscript𝑠022subscript𝑐1superscriptsubscript𝛿𝑁2superscript𝐵21𝜁𝑁subscriptinfimumsubscript𝑠subscript𝒮𝑁subscriptsuperscriptdelimited-∥∥subscript𝑠subscript𝑠022\displaystyle\lVert\hat{s}-s_{0}\rVert_{2}^{2}\leq c_{1}\left(\delta_{N}^{2}+% \frac{B^{2}\log(1/\zeta)}{N}+\inf_{s_{*}\in\mathcal{S}_{N}}\lVert s_{*}-s_{0}% \rVert^{2}_{2}\right),∥ over^ start_ARG italic_s end_ARG - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 1 / italic_ζ ) end_ARG start_ARG italic_N end_ARG + roman_inf start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_s start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,

where recall δNsubscript𝛿𝑁\delta_{N}italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is an upper bound on the critical radius of the function spaces defined in Eq. (9).

For the second part of the proof, recall that

g(h,c,z)𝑔𝑐𝑧\displaystyle g(h,c,z)italic_g ( italic_h , italic_c , italic_z ) =h(C,A)F(AC,Z)𝑑A,absent𝐶𝐴𝐹conditional𝐴𝐶𝑍differential-d𝐴\displaystyle=\int h(C,A)F(A\mid C,Z)dA,= ∫ italic_h ( italic_C , italic_A ) italic_F ( italic_A ∣ italic_C , italic_Z ) italic_d italic_A ,

where F(AC,Z)𝐹conditional𝐴𝐶𝑍F(A\mid C,Z)italic_F ( italic_A ∣ italic_C , italic_Z ) is some distribution over A𝐴Aitalic_A and F0(AC,Z)=(AC,Z)subscript𝐹0conditional𝐴𝐶𝑍conditional𝐴𝐶𝑍F_{0}(A\mid C,Z)=\mathds{P}(A\mid C,Z)italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_A ∣ italic_C , italic_Z ) = blackboard_P ( italic_A ∣ italic_C , italic_Z ) is the distribution of A𝐴Aitalic_A conditional on (C,Z)𝐶𝑍(C,Z)( italic_C , italic_Z ). Therefore, g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT should minimise the following loss:

g0subscript𝑔0\displaystyle g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =argming𝒢𝔼[(h0(C,A)(AC,Z)𝑑Ag(C,Z,h0))2]absentsubscriptargmin𝑔𝒢𝔼delimited-[]superscriptsubscript0𝐶𝐴conditional𝐴𝐶𝑍differential-d𝐴𝑔𝐶𝑍subscript02\displaystyle=\operatorname*{arg\,min}_{g\in\mathcal{G}}\mathds{E}\left[\left(% \int h_{0}(C,A)\mathds{P}(A\mid C,Z)dA-g(C,Z,h_{0})\right)^{2}\right]= start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT blackboard_E [ ( ∫ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_A ) blackboard_P ( italic_A ∣ italic_C , italic_Z ) italic_d italic_A - italic_g ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=argming𝒢𝔼[2h0(C,A)(AC,Z)𝑑Ag(C,Z,h0)+g(C,Z,h0)2],absentsubscriptargmin𝑔𝒢𝔼delimited-[]2subscript0𝐶𝐴conditional𝐴𝐶𝑍differential-d𝐴𝑔𝐶𝑍subscript0𝑔superscript𝐶𝑍subscript02\displaystyle=\operatorname*{arg\,min}_{g\in\mathcal{G}}\mathds{E}\left[-2\int h% _{0}(C,A)\mathds{P}(A\mid C,Z)dA\cdot g(C,Z,h_{0})+g(C,Z,h_{0})^{2}\right],= start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT blackboard_E [ - 2 ∫ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_A ) blackboard_P ( italic_A ∣ italic_C , italic_Z ) italic_d italic_A ⋅ italic_g ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_g ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where we can set m(𝒟,g)=h0(C,A)(AC,Z)𝑑Ag(C,Z,h0)𝑚𝒟𝑔subscript0𝐶𝐴conditional𝐴𝐶𝑍differential-d𝐴𝑔𝐶𝑍subscript0m(\mathcal{D},g)=\int h_{0}(C,A)\mathds{P}(A\mid C,Z)dA\cdot g(C,Z,h_{0})italic_m ( caligraphic_D , italic_g ) = ∫ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_A ) blackboard_P ( italic_A ∣ italic_C , italic_Z ) italic_d italic_A ⋅ italic_g ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and check that

𝔼𝔼\displaystyle\mathds{E}blackboard_E [(h0(C,A)(AC,Z)𝑑Ag(C,Z,h0)h0(C,A)(AC,Z)𝑑Ag0(C,Z,h0))2]delimited-[]superscriptsubscript0𝐶𝐴conditional𝐴𝐶𝑍differential-d𝐴𝑔𝐶𝑍subscript0subscript0𝐶𝐴conditional𝐴𝐶𝑍differential-d𝐴subscript𝑔0𝐶𝑍subscript02\displaystyle\left[\left(\int h_{0}(C,A)\mathds{P}(A\mid C,Z)dA\cdot g(C,Z,h_{% 0})-\int h_{0}(C,A)\mathds{P}(A\mid C,Z)dA\cdot g_{0}(C,Z,h_{0})\right)^{2}\right][ ( ∫ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_A ) blackboard_P ( italic_A ∣ italic_C , italic_Z ) italic_d italic_A ⋅ italic_g ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - ∫ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_A ) blackboard_P ( italic_A ∣ italic_C , italic_Z ) italic_d italic_A ⋅ italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle== 𝔼[(h0(C,A)(AC,Z)𝑑A)2(g(C,Z,h0)g0(C,Z,h0))2]𝔼delimited-[]superscriptsubscript0𝐶𝐴conditional𝐴𝐶𝑍differential-d𝐴2superscript𝑔𝐶𝑍subscript0subscript𝑔0𝐶𝑍subscript02\displaystyle\mathds{E}\left[\left(\int h_{0}(C,A)\mathds{P}(A\mid C,Z)dA% \right)^{2}\cdot\left(g(C,Z,h_{0})-g_{0}(C,Z,h_{0})\right)^{2}\right]blackboard_E [ ( ∫ italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_A ) blackboard_P ( italic_A ∣ italic_C , italic_Z ) italic_d italic_A ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( italic_g ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=\displaystyle== 𝔼[g0(C,Z,h0)2(g(C,Z,h0)g0(C,Z,h0))2]𝔼delimited-[]subscript𝑔0superscript𝐶𝑍subscript02superscript𝑔𝐶𝑍subscript0subscript𝑔0𝐶𝑍subscript02\displaystyle\mathds{E}\left[g_{0}(C,Z,h_{0})^{2}\cdot\left(g(C,Z,h_{0})-g_{0}% (C,Z,h_{0})\right)^{2}\right]blackboard_E [ italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( italic_g ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
\displaystyle\leq g02g(C,Z,h0)g0(C,Z,h0)22subscriptdelimited-∥∥superscriptsubscript𝑔02superscriptsubscriptdelimited-∥∥𝑔𝐶𝑍subscript0subscript𝑔0𝐶𝑍subscript022\displaystyle\lVert g_{0}^{2}\rVert_{\infty}\lVert g(C,Z,h_{0})-g_{0}(C,Z,h_{0% })\rVert_{2}^{2}∥ italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_g ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq B2g(C,Z,h0)g0(C,Z,h0)22,superscript𝐵2superscriptsubscriptdelimited-∥∥𝑔𝐶𝑍subscript0subscript𝑔0𝐶𝑍subscript022\displaystyle B^{2}\lVert g(C,Z,h_{0})-g_{0}(C,Z,h_{0})\rVert_{2}^{2},italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_g ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_C , italic_Z , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

by Hölder’s inequality, where M𝑀Mitalic_M is a constant since g𝑔gitalic_g is bounded. Therefore, by Theorem 1 of Chernozhukov et al. [2021], there exists a universal constant c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that with probability 1ζ1𝜁1-\zeta1 - italic_ζ,

gg022c2(δN2+B2log(1/ζ)N+infg𝒢Ngg022),superscriptsubscriptdelimited-∥∥𝑔subscript𝑔022subscript𝑐2superscriptsubscript𝛿𝑁2superscript𝐵21𝜁𝑁subscriptinfimumsubscript𝑔subscript𝒢𝑁subscriptsuperscriptdelimited-∥∥subscript𝑔subscript𝑔022\displaystyle\lVert g-g_{0}\rVert_{2}^{2}\leq c_{2}\left(\delta_{N}^{2}+\frac{% B^{2}\log(1/\zeta)}{N}+\inf_{g_{*}\in\mathcal{G}_{N}}\lVert g_{*}-g_{0}\rVert^% {2}_{2}\right),∥ italic_g - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 1 / italic_ζ ) end_ARG start_ARG italic_N end_ARG + roman_inf start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ caligraphic_G start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_g start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,

where again δNsubscript𝛿𝑁\delta_{N}italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is an upper bound on the critical radius of the function spaces defined in Equation 10, which completes the proof. ∎

Now, we are ready to prove Theorem 3.4, which is our main theorem that states the N1/2superscript𝑁12N^{-1/2}italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT convergence rate guarantees for the DML-IV estimator.

Proof of Theorem 3.4:.

We mainly use Theorem 3.3 from Chernozhukov et al. [2018], where properties of the DML estimator for non-linear scores are demonstrated. It states that, if Condition C.1 holds, the DML estimator θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG is concentrated in a 1/N1𝑁1/\sqrt{N}1 / square-root start_ARG italic_N end_ARG neighbourhood of θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT:

Nσ(θ^θ0)=1Nψ¯(𝒟i)+O(ρN)𝒩(0,1) in distribution,𝑁𝜎^𝜃subscript𝜃01𝑁¯𝜓subscript𝒟𝑖𝑂subscript𝜌𝑁𝒩01 in distribution\displaystyle\frac{\sqrt{N}}{\sigma}(\hat{\theta}-\theta_{0})=\frac{1}{\sqrt{N% }}\sum{\bar{\psi}(\mathcal{D}_{i})+O(\rho_{N})}\rightarrow\mathcal{N}(0,1)% \text{ in distribution},divide start_ARG square-root start_ARG italic_N end_ARG end_ARG start_ARG italic_σ end_ARG ( over^ start_ARG italic_θ end_ARG - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ∑ over¯ start_ARG italic_ψ end_ARG ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_O ( italic_ρ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) → caligraphic_N ( 0 , 1 ) in distribution ,

where ψ¯()σ1J01ψ(,θ0,η0)¯𝜓superscript𝜎1superscriptsubscript𝐽01𝜓subscript𝜃0subscript𝜂0\bar{\psi}(\cdot)\coloneqq-\sigma^{-1}J_{0}^{-1}\psi(\cdot,\theta_{0},\eta_{0})over¯ start_ARG italic_ψ end_ARG ( ⋅ ) ≔ - italic_σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ψ ( ⋅ , italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the influence function, J0subscript𝐽0J_{0}italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the Jacobian of ψ𝜓\psiitalic_ψ, the approximate variance is σ2J01𝔼[ψ(𝒟,θ0,η0)ψ(𝒟,θ0,η0)T](J01)Tsuperscript𝜎2superscriptsubscript𝐽01𝔼delimited-[]𝜓𝒟subscript𝜃0subscript𝜂0𝜓superscript𝒟subscript𝜃0subscript𝜂0𝑇superscriptsuperscriptsubscript𝐽01𝑇\sigma^{2}\coloneqq J_{0}^{-1}\mathds{E}[\psi(\mathcal{D},\theta_{0},\eta_{0})% \psi(\mathcal{D},\theta_{0},\eta_{0})^{T}](J_{0}^{-1})^{T}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≔ italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT blackboard_E [ italic_ψ ( caligraphic_D , italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_ψ ( caligraphic_D , italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] ( italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and the size of the remainder ρNsubscript𝜌𝑁\rho_{N}italic_ρ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT converges to 0. Therefore, we only need to check whether, under Assumption 2.1 and 3.2, all of Condition C.1 for DML N1/2superscript𝑁12N^{-1/2}italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT convergence rate is satisfied. Conditions (a) and (d) are satisfied by Theorem 3.1. Condition (b) is satisfied since (sg)2superscript𝑠𝑔2(s-g)^{2}( italic_s - italic_g ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is twice continuously differentiable with respect to s𝑠sitalic_s and g𝑔gitalic_g.

Condition (c) is a sufficient identifiability condition, which states the closeness of the loss function at point θ𝜃\thetaitalic_θ to zero and implies the closeness of θ𝜃\thetaitalic_θ to θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This assumption is standard in condition moment problems. To check condition (c), we first point out that under analytical assumptions for s,g𝑠𝑔s,gitalic_s , italic_g, and hhitalic_h, we can write down first order Taylor series for the score function 𝔼[ψ(𝒟;hθ,(s0,g0))]𝔼delimited-[]𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔0\mathds{E}[\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))]blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ] around the point θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT,

𝔼[ψ(𝒟;hθ,(s0,g0))]=𝔼[ψ(𝒟;hθ0,(s0,g0))]+J0(θθ0)+O(θθ02).𝔼delimited-[]𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔0𝔼delimited-[]𝜓𝒟subscriptsubscript𝜃0subscript𝑠0subscript𝑔0subscript𝐽0𝜃subscript𝜃0𝑂superscriptdelimited-∥∥𝜃subscript𝜃02\displaystyle\mathds{E}[\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))]=\mathds{E}% [\psi(\mathcal{D};h_{\theta_{0}},(s_{0},g_{0}))]+J_{0}(\theta-\theta_{0})+O(% \lVert\theta-\theta_{0}\rVert^{2}).blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ] = blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ] + italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_O ( ∥ italic_θ - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Plugging in validity of the score function ψ(𝒟;hθ,(s0,g0))𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔0\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ), i.e., 𝔼[ψ(𝒟;hθ0,(s0,g0))]=0𝔼delimited-[]𝜓𝒟subscriptsubscript𝜃0subscript𝑠0subscript𝑔00\mathds{E}[\psi(\mathcal{D};h_{\theta_{0}},(s_{0},g_{0}))]=0blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ] = 0, we infer that

𝔼[ψ(𝒟;hθ,(s0,g0))]J0(θθ0).greater-than-or-equivalent-todelimited-∥∥𝔼delimited-[]𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔0delimited-∥∥subscript𝐽0𝜃subscript𝜃0\displaystyle\lVert\mathds{E}[\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))]% \rVert\gtrsim\lVert J_{0}(\theta-\theta_{0})\rVert.∥ blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ] ∥ ≳ ∥ italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ .

Now for identifiability, we only need to assume that J0J0Tsubscript𝐽0superscriptsubscript𝐽0𝑇J_{0}J_{0}^{T}italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is non-singular, which is a common technical assumption.

Condition (e) is satisfied since we have that the effective dimension dN=o(N1/4)subscript𝑑𝑁𝑜superscript𝑁14d_{N}=o(N^{1/4})italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_o ( italic_N start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT ), and together with Lemma C.2 and the fact that the upper bound of the critical radius δN=O(dNN1/2)subscript𝛿𝑁𝑂subscript𝑑𝑁superscript𝑁12\delta_{N}=O(d_{N}N^{-1/2})italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_O ( italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) (see Section C.3), the nuisance parameters converge sufficiently quickly to ensure s^s02O(δN+N1/2)=O(dNN1/2)=o(N1/4)subscriptdelimited-∥∥^𝑠subscript𝑠02𝑂subscript𝛿𝑁superscript𝑁12𝑂subscript𝑑𝑁superscript𝑁12𝑜superscript𝑁14\lVert\hat{s}-s_{0}\rVert_{2}\leq O(\delta_{N}+N^{-1/2})=O(d_{N}N^{-1/2})=o(N^% {-1/4})∥ over^ start_ARG italic_s end_ARG - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_O ( italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT + italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) = italic_O ( italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) = italic_o ( italic_N start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ) and g^g02O(δN+N1/2)=O(dNN1/2)=o(N1/4)subscriptdelimited-∥∥^𝑔subscript𝑔02𝑂subscript𝛿𝑁superscript𝑁12𝑂subscript𝑑𝑁superscript𝑁12𝑜superscript𝑁14\lVert\hat{g}-g_{0}\rVert_{2}\leq O(\delta_{N}+N^{-1/2})=O(d_{N}N^{-1/2})=o(N^% {-1/4})∥ over^ start_ARG italic_g end_ARG - italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_O ( italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT + italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) = italic_O ( italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) = italic_o ( italic_N start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ).

Condition (f) is the non-degeneracy assumption for covariance of the score function ψ(𝒟;hθ,(s0,g0))𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔0\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ). By definition,

𝔼[ψ(𝒟;hθ,(s0,g0))ψ(𝒟;hθ,(s0,g0))T]=ψ(𝒟;hθ,(s0,g0))ψ(𝒟;hθ,(s0,g0))T𝑑(𝒟).𝔼delimited-[]𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔0𝜓superscript𝒟subscript𝜃subscript𝑠0subscript𝑔0𝑇𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔0𝜓superscript𝒟subscript𝜃subscript𝑠0subscript𝑔0𝑇differential-d𝒟\displaystyle\mathds{E}[\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))\psi(% \mathcal{D};h_{\theta},(s_{0},g_{0}))^{T}]=\int\psi(\mathcal{D};h_{\theta},(s_% {0},g_{0}))\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))^{T}d\mathds{P}(\mathcal{% D}).blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] = ∫ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_d blackboard_P ( caligraphic_D ) .

By trace trick, for each datapoint 𝒟𝒟\mathcal{D}caligraphic_D, the only eigenvalue of ψ(𝒟;hθ,(s0,g0))ψ(𝒟;hθ,(s0,g0))T𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔0𝜓superscript𝒟subscript𝜃subscript𝑠0subscript𝑔0𝑇\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))\psi(\mathcal{D};h_{\theta},(s_{0},g% _{0}))^{T}italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is ψ(𝒟;hθ,(s0,g0))20superscriptdelimited-∥∥𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔020\lVert\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))\rVert^{2}\geq 0∥ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 0, with ψ(𝒟;hθ,(s0,g0))𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔0\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) as the corresponding eigenvector. Therefore, 𝔼[ψ(𝒟;hθ,(s0,g0))ψ(𝒟;hθ,(s0,g0))T]𝔼delimited-[]𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔0𝜓superscript𝒟subscript𝜃subscript𝑠0subscript𝑔0𝑇\mathds{E}[\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))\psi(\mathcal{D};h_{% \theta},(s_{0},g_{0}))^{T}]blackboard_E [ italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] is positive-definite if for each member d𝑑ditalic_d of the support of \mathds{P}blackboard_P, which is the distribution of 𝒟𝒟\mathcal{D}caligraphic_D, there are at least as many eigenvectors of d𝑑ditalic_d as the number of dimension of ψ(𝒟;hθ,(s0,g0))𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔0\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ), which is true in our setting as the co-domain of ψ(𝒟;hθ,(s0,g0))𝜓𝒟subscript𝜃subscript𝑠0subscript𝑔0\psi(\mathcal{D};h_{\theta},(s_{0},g_{0}))italic_ψ ( caligraphic_D ; italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) is \mathbb{R}blackboard_R.

Therefore, all conditions for Theorem 3.3 Chernozhukov et al. [2018] to hold are satisfied, which concludes the proof. ∎

C.2 Suboptimaltiy

Proof of Theorem 3.5:.

From theorem 3.4, we have that the parameters θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG for hθ^subscript^𝜃h_{\hat{\theta}}italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT learned from a dataset of size N𝑁Nitalic_N using DML-IV satisfy (θ^θ0)𝑑𝒩(0,σ2/N)𝑑^𝜃subscript𝜃0𝒩0superscript𝜎2𝑁(\hat{\theta}-\theta_{0})\xrightarrow{d}\mathcal{N}(0,\sigma^{2}/N)( over^ start_ARG italic_θ end_ARG - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_ARROW overitalic_d → end_ARROW caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_N ), where σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the is the DML-IV estimator variance. This means that, for all ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 and ζ>0𝜁0\zeta>0italic_ζ > 0, there exists an integer K>0𝐾0K>0italic_K > 0 such that for all NK𝑁𝐾N\geq Kitalic_N ≥ italic_K,

(θ^θ0>ϵ)1Φ(ϵN/σ)+ζ/2,delimited-∥∥^𝜃subscript𝜃0italic-ϵ1Φitalic-ϵ𝑁𝜎𝜁2\displaystyle\mathds{P}(\lVert\hat{\theta}-\theta_{0}\rVert>\epsilon)\leq 1-% \Phi\left(\epsilon\cdot\sqrt{N}/\sigma\right)+\zeta/2,blackboard_P ( ∥ over^ start_ARG italic_θ end_ARG - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ > italic_ϵ ) ≤ 1 - roman_Φ ( italic_ϵ ⋅ square-root start_ARG italic_N end_ARG / italic_σ ) + italic_ζ / 2 ,

where ΦΦ\Phiroman_Φ is the CDF of a standard Gaussian distribution. If we assume L𝐿Litalic_L to be a constant such that |hθ(C,A)hθ(C,A)|Lθθsubscript𝜃𝐶𝐴subscriptsuperscript𝜃𝐶𝐴𝐿delimited-∥∥𝜃superscript𝜃\lvert h_{\theta}(C,A)-h_{\theta^{\prime}}(C,A)\rvert\leq L\lVert\theta-\theta% ^{\prime}\rVert| italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_C , italic_A ) - italic_h start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_C , italic_A ) | ≤ italic_L ∥ italic_θ - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ for all C,AsuppM(C,A)𝐶𝐴superscriptsupp𝑀𝐶𝐴C,A\in\textrm{supp}^{M}(C,A)italic_C , italic_A ∈ supp start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_C , italic_A ) and θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, we have that for all ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 and ζ>0𝜁0\zeta>0italic_ζ > 0, there exists an integer K>0𝐾0K>0italic_K > 0 such that for all NK𝑁𝐾N\geq Kitalic_N ≥ italic_K,

(|hθ^(C,A)hθ0(C,A)|>Lϵ)1Φ(ϵN/σ)+ζ/2C,AsuppM(C,A).formulae-sequencesubscript^𝜃𝐶𝐴subscriptsubscript𝜃0𝐶𝐴𝐿italic-ϵ1Φitalic-ϵ𝑁𝜎𝜁2for-all𝐶𝐴superscriptsupp𝑀𝐶𝐴\mathds{P}(\lvert h_{\hat{\theta}}(C,A)-h_{\theta_{0}}(C,A)\rvert>L\cdot% \epsilon)\leq 1-\Phi(\epsilon\cdot\sqrt{N}/\sigma)+\zeta/2\quad\forall C,A\in% \textrm{supp}^{M}(C,A).blackboard_P ( | italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_C , italic_A ) - italic_h start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_C , italic_A ) | > italic_L ⋅ italic_ϵ ) ≤ 1 - roman_Φ ( italic_ϵ ⋅ square-root start_ARG italic_N end_ARG / italic_σ ) + italic_ζ / 2 ∀ italic_C , italic_A ∈ supp start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_C , italic_A ) . (11)

Next, we can show that the suboptimality of π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG satisfies

subopt(π^)subopt^𝜋\displaystyle\textrm{subopt}(\hat{\pi})subopt ( over^ start_ARG italic_π end_ARG ) =V(π)V(π^)absent𝑉superscript𝜋𝑉^𝜋\displaystyle=V(\pi^{*})-V(\hat{\pi})= italic_V ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_V ( over^ start_ARG italic_π end_ARG )
=𝔼Ctest[RC,do(A=π(c))]𝔼Ctest[RC,do(A=π^(c))]absentsubscript𝔼similar-to𝐶subscripttestdelimited-[]conditional𝑅𝐶𝑑𝑜𝐴superscript𝜋𝑐subscript𝔼similar-to𝐶subscripttestdelimited-[]conditional𝑅𝐶𝑑𝑜𝐴^𝜋𝑐\displaystyle=\mathds{E}_{C\sim\mathds{P}_{\textrm{test}}}[R\mid C,do(A=\pi^{*% }(c))]-\mathds{E}_{C\sim\mathds{P}_{\textrm{test}}}[R\mid C,do(A=\hat{\pi}(c))]= blackboard_E start_POSTSUBSCRIPT italic_C ∼ blackboard_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R ∣ italic_C , italic_d italic_o ( italic_A = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_c ) ) ] - blackboard_E start_POSTSUBSCRIPT italic_C ∼ blackboard_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R ∣ italic_C , italic_d italic_o ( italic_A = over^ start_ARG italic_π end_ARG ( italic_c ) ) ]
=𝔼Ctest[fr(C,π(C))fr(C,π^(C))]absentsubscript𝔼similar-to𝐶subscripttestdelimited-[]subscript𝑓𝑟𝐶superscript𝜋𝐶subscript𝑓𝑟𝐶^𝜋𝐶\displaystyle=\mathds{E}_{C\sim\mathds{P}_{\textrm{test}}}[f_{r}(C,\pi^{*}(C))% -f_{r}(C,\hat{\pi}(C))]= blackboard_E start_POSTSUBSCRIPT italic_C ∼ blackboard_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_C , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_C ) ) - italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_C , over^ start_ARG italic_π end_ARG ( italic_C ) ) ]
=𝔼Ctest[h(C,π(C))h(C,π^(C))]absentsubscript𝔼similar-to𝐶subscripttestdelimited-[]𝐶superscript𝜋𝐶𝐶^𝜋𝐶\displaystyle=\mathds{E}_{C\sim\mathds{P}_{\textrm{test}}}[h(C,\pi^{*}(C))-h(C% ,\hat{\pi}(C))]= blackboard_E start_POSTSUBSCRIPT italic_C ∼ blackboard_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_h ( italic_C , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_C ) ) - italic_h ( italic_C , over^ start_ARG italic_π end_ARG ( italic_C ) ) ]
maxcsupp(test)(h(c,π(c))h(c,π^(c)))absentsubscript𝑐suppsubscripttest𝑐superscript𝜋𝑐𝑐^𝜋𝑐\displaystyle\leq\max_{c\in\textrm{supp}(\mathds{P}_{\textrm{test}})}\left(h(c% ,\pi^{*}(c))-h(c,\hat{\pi}(c))\right)≤ roman_max start_POSTSUBSCRIPT italic_c ∈ supp ( blackboard_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_h ( italic_c , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_c ) ) - italic_h ( italic_c , over^ start_ARG italic_π end_ARG ( italic_c ) ) )
maxcsupp(test)|h(c,π(c))hθ^(c,π(c))|+(hθ^(c,π(c))hθ^(c,π^(c)))absentsubscript𝑐suppsubscripttest𝑐superscript𝜋𝑐subscript^𝜃𝑐superscript𝜋𝑐subscript^𝜃𝑐superscript𝜋𝑐subscript^𝜃𝑐^𝜋𝑐\displaystyle\leq\max_{c\in\textrm{supp}(\mathds{P}_{\textrm{test}})}\lvert h(% c,\pi^{*}(c))-h_{\hat{\theta}}(c,\pi^{*}(c))\rvert+(h_{\hat{\theta}}(c,\pi^{*}% (c))-h_{\hat{\theta}}(c,\hat{\pi}(c)))≤ roman_max start_POSTSUBSCRIPT italic_c ∈ supp ( blackboard_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT | italic_h ( italic_c , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_c ) ) - italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_c , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_c ) ) | + ( italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_c , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_c ) ) - italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_c , over^ start_ARG italic_π end_ARG ( italic_c ) ) )
+|hθ^(c,π^(c))h(c,π^(c))|subscript^𝜃𝑐^𝜋𝑐𝑐^𝜋𝑐\displaystyle\quad\quad+\lvert h_{\hat{\theta}}(c,\hat{\pi}(c))-h(c,\hat{\pi}(% c))\rvert+ | italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_c , over^ start_ARG italic_π end_ARG ( italic_c ) ) - italic_h ( italic_c , over^ start_ARG italic_π end_ARG ( italic_c ) ) |
2Lϵ with probability (Φ(ϵN/σ)ζ/2)absent2𝐿italic-ϵ with probability Φitalic-ϵ𝑁𝜎𝜁2\displaystyle\leq 2L\cdot\epsilon\quad\text{ with probability }\left(\Phi(% \epsilon\cdot\sqrt{N}/\sigma)-\zeta/2\right)≤ 2 italic_L ⋅ italic_ϵ with probability ( roman_Φ ( italic_ϵ ⋅ square-root start_ARG italic_N end_ARG / italic_σ ) - italic_ζ / 2 ) (12)

where supp(test)suppsubscripttest\textrm{supp}(\mathds{P}_{\textrm{test}})supp ( blackboard_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ) is the support of testsubscripttest\mathds{P}_{\textrm{test}}blackboard_P start_POSTSUBSCRIPT test end_POSTSUBSCRIPT, by Equation 11 and the fact that hθ^(C,π(C))hθ^(C,π^(C))0subscript^𝜃𝐶superscript𝜋𝐶subscript^𝜃𝐶^𝜋𝐶0h_{\hat{\theta}}(C,\pi^{*}(C))-h_{\hat{\theta}}(C,\hat{\pi}(C))\leq 0italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_C , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_C ) ) - italic_h start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_C , over^ start_ARG italic_π end_ARG ( italic_C ) ) ≤ 0. Setting Φ(ϵN/σ)=1ζ/2Φitalic-ϵ𝑁𝜎1𝜁2\Phi(\epsilon\cdot\sqrt{N}/\sigma)=1-\zeta/2roman_Φ ( italic_ϵ ⋅ square-root start_ARG italic_N end_ARG / italic_σ ) = 1 - italic_ζ / 2 in Equation 12 and substituting ϵitalic-ϵ\epsilonitalic_ϵ yields

subopt(π^)2LΦ1(1ζ/2)σ/N with probability 1ζ.subopt^𝜋2𝐿superscriptΦ11𝜁2𝜎𝑁 with probability 1𝜁\textrm{subopt}(\hat{\pi})\leq 2L\Phi^{-1}(1-\zeta/2)\sigma/\sqrt{N}\quad\text% { with probability }1-\zeta.subopt ( over^ start_ARG italic_π end_ARG ) ≤ 2 italic_L roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_ζ / 2 ) italic_σ / square-root start_ARG italic_N end_ARG with probability 1 - italic_ζ .

From Blair et al.’s approximation for the inverse of the error function (erf) Blair et al. [1976], we have that for all y(0,1]𝑦01y\in(0,1]italic_y ∈ ( 0 , 1 ], Φ1(1y)2ln(y)superscriptΦ11𝑦2𝑦\Phi^{-1}(1-y)\leq\sqrt{-2\ln(y)}roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_y ) ≤ square-root start_ARG - 2 roman_ln ( italic_y ) end_ARG. Thus, we conclude that there exists K>0𝐾0K>0italic_K > 0 such that for all N>K𝑁𝐾N>Kitalic_N > italic_K

subopt(π^N)22Lσln(2/ζ)N with probability 1ζ,suboptsubscript^𝜋𝑁22𝐿𝜎2𝜁𝑁 with probability 1𝜁\textrm{subopt}(\hat{\pi}_{N})\leq 2\sqrt{2}L\sigma\sqrt{\frac{\ln(2/\zeta)}{N% }}\quad\text{ with probability }1-\zeta,subopt ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ≤ 2 square-root start_ARG 2 end_ARG italic_L italic_σ square-root start_ARG divide start_ARG roman_ln ( 2 / italic_ζ ) end_ARG start_ARG italic_N end_ARG end_ARG with probability 1 - italic_ζ ,

which completes the proof. ∎

C.3 Critical Radius and Effective Dimension

Definition C.3 (Wainwright [2019]).

The critical radius denoted by δNsubscript𝛿𝑁\delta_{N}italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is defined as the minimum δ𝛿\deltaitalic_δ that satisfies the following upper bound on the local Gaussian complexity of a star-shaped function class superscript\mathcal{F}^{*}caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT666A function class \mathcal{F}caligraphic_F is star-shaped if for every f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F and α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ], we have αf𝛼𝑓\alpha f\in\mathcal{F}italic_α italic_f ∈ caligraphic_F., 𝒢(,δ)δ2/2𝒢superscript𝛿superscript𝛿22\mathcal{G(\mathcal{F}^{*},\delta)}\leq{\delta^{2}}/2caligraphic_G ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_δ ) ≤ italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2, where local Gaussian complexity is defined as

𝒢(,δ)=𝔼ϵ[supg:gNδϵ,g],𝒢superscript𝛿subscript𝔼italic-ϵdelimited-[]subscriptsupremum:𝑔superscriptsubscriptdelimited-∥∥𝑔𝑁𝛿italic-ϵ𝑔\displaystyle\mathcal{G(\mathcal{F}^{*},\delta)}=\mathds{E}_{\epsilon}[\sup_{g% \in\mathcal{F}^{*}:\lVert g\rVert_{N}\leq\delta}\langle\epsilon,g\rangle],caligraphic_G ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_δ ) = blackboard_E start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_g ∈ caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : ∥ italic_g ∥ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≤ italic_δ end_POSTSUBSCRIPT ⟨ italic_ϵ , italic_g ⟩ ] ,

with ϵitalic-ϵ\epsilonitalic_ϵ being a random i.i.d. zero-mean Gaussian vector.

The critical radius is a standard notion to bound the estimation error in the regression problem. Since local Gaussian complexity can be viewed as an expected value of a supremum of a stochastic process indexed by g𝑔gitalic_g, we can apply empirical process theory tools, namely the Dudley’s entropy integral [Wainwright, 2019, Van Handel, 2014], to provide a bound on the critical radius,

𝒢(,δ)infα0{α+1Nα/4δlog𝒩(,L2(PN),ϵ)𝑑ϵ},𝒢superscript𝛿subscriptinfimum𝛼0𝛼1𝑁superscriptsubscript𝛼4𝛿𝒩superscriptsuperscript𝐿2subscript𝑃𝑁italic-ϵdifferential-ditalic-ϵ\displaystyle\mathcal{G(\mathcal{F}^{*},\delta)}\leq\inf_{\alpha\geq 0}\left\{% \alpha+\frac{1}{\sqrt{N}}\int_{\alpha/4}^{\delta}\sqrt{\log\mathcal{N}(% \mathcal{F}^{*},L^{2}(P_{N}),\epsilon)}\>d\epsilon\right\},caligraphic_G ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_δ ) ≤ roman_inf start_POSTSUBSCRIPT italic_α ≥ 0 end_POSTSUBSCRIPT { italic_α + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ∫ start_POSTSUBSCRIPT italic_α / 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT square-root start_ARG roman_log caligraphic_N ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) , italic_ϵ ) end_ARG italic_d italic_ϵ } ,

where 𝒩(,L2(PN),ϵ)𝒩superscriptsuperscript𝐿2subscript𝑃𝑁italic-ϵ\mathcal{N}(\mathcal{F}^{*},L^{2}(P_{N}),\epsilon)caligraphic_N ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) , italic_ϵ ) is the ϵitalic-ϵ\epsilonitalic_ϵ-covering number of function class superscript\mathcal{F}^{*}caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in L2(PN)superscript𝐿2subscript𝑃𝑁L^{2}(P_{N})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) norm. Now, by placing α=0𝛼0\alpha=0italic_α = 0, when the integral is a single scale value of log𝒩(,L2(Pn),ϵ)𝒩superscriptsuperscript𝐿2subscript𝑃𝑛italic-ϵ\sqrt{\log\mathcal{N}(\mathcal{F}^{*},L^{2}(P_{n}),\epsilon)}square-root start_ARG roman_log caligraphic_N ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_ϵ ) end_ARG, we infer that

𝒢(,δ)δNlog𝒩(,L2(PN),ϵ).𝒢superscript𝛿𝛿𝑁𝒩superscriptsuperscript𝐿2subscript𝑃𝑁italic-ϵ\displaystyle\mathcal{G(\mathcal{F}^{*},\delta)}\leq\frac{\delta}{\sqrt{N}}% \sqrt{\log\mathcal{N}(\mathcal{F}^{*},L^{2}(P_{N}),\epsilon)}.caligraphic_G ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_δ ) ≤ divide start_ARG italic_δ end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG square-root start_ARG roman_log caligraphic_N ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) , italic_ϵ ) end_ARG .

Thus, the critical radius will be upper bounded by

δNlog𝒩(,L2(PN),ϵ)N=O(dNN1/2).less-than-or-similar-tosubscript𝛿𝑁𝒩superscriptsuperscript𝐿2subscript𝑃𝑁italic-ϵ𝑁𝑂subscript𝑑𝑁superscript𝑁12\displaystyle\delta_{N}\lesssim\frac{\sqrt{\log\mathcal{N}(\mathcal{F}^{*},L^{% 2}(P_{N}),\epsilon)}}{\sqrt{N}}=O(d_{N}N^{-1/2}).italic_δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≲ divide start_ARG square-root start_ARG roman_log caligraphic_N ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) , italic_ϵ ) end_ARG end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG = italic_O ( italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) .

Chernozhukov et al. [2022b, 2021] referred to dN=log𝒩(,L2(PN),ϵ)subscript𝑑𝑁𝒩superscriptsuperscript𝐿2subscript𝑃𝑁italic-ϵd_{N}=\sqrt{\log\mathcal{N}(\mathcal{F}^{*},L^{2}(P_{N}),\epsilon)}italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = square-root start_ARG roman_log caligraphic_N ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) , italic_ϵ ) end_ARG as the effective dimension of the hypothesis space. Note that this matches the minimax lower bound of fixed design estimation for this setting [Yang and Barron, 1999].

D Datasets Details

In this section, we provide details of the datasets considered in this paper.

D.1 Aeroplane Ticket Demand Dataset

Here, we describe the aeroplane ticket demand dataset, first introduced by Hartford et al. [2017]. The observable variables are generated by the following model:

r𝑟\displaystyle ritalic_r =h0((t,s),p)+ϵ,𝔼[ϵ|t,s,p]=0;\displaystyle=h_{0}((t,s),p)+\epsilon,\quad\mathds{E}[\epsilon\lvert t,s,p]=0;= italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ( italic_t , italic_s ) , italic_p ) + italic_ϵ , blackboard_E [ italic_ϵ | italic_t , italic_s , italic_p ] = 0 ;
p𝑝\displaystyle pitalic_p =25+(z+3)ψ(t)+ω,absent25𝑧3𝜓𝑡𝜔\displaystyle=25+(z+3)\psi(t)+\omega,= 25 + ( italic_z + 3 ) italic_ψ ( italic_t ) + italic_ω ,

where r𝑟ritalic_r is the ticket sales (as the outcome variable) and p𝑝pitalic_p is the ticket price (as the action variable). (t,s)𝑡𝑠(t,s)( italic_t , italic_s ) are observed context variables, where t𝑡titalic_t is the time of year and s𝑠sitalic_s is the customer type. The fuel price z𝑧zitalic_z is introduced as an instrumental variable, which only affects the ticket price p𝑝pitalic_p. The noises ϵitalic-ϵ\epsilonitalic_ϵ and ω𝜔\omegaitalic_ω are correlated with correlation ρ[0,1]𝜌01\rho\in[0,1]italic_ρ ∈ [ 0 , 1 ], where in our experiments we set ρ=0.9𝜌0.9\rho=0.9italic_ρ = 0.9. h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the true counterfactual prediction function, defined as

h0((t,s),p)subscript0𝑡𝑠𝑝\displaystyle h_{0}((t,s),p)italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ( italic_t , italic_s ) , italic_p ) =100+(10+p)sψ(t)2p,absent10010𝑝𝑠𝜓𝑡2𝑝\displaystyle=100+(10+p)\cdot s\cdot\psi(t)-2p,= 100 + ( 10 + italic_p ) ⋅ italic_s ⋅ italic_ψ ( italic_t ) - 2 italic_p ,
ψ(t)𝜓𝑡\displaystyle\psi(t)italic_ψ ( italic_t ) =2((t5)4600+exp(4(t5)2)+t102),absent2superscript𝑡546004superscript𝑡52𝑡102\displaystyle=2\left(\frac{(t-5)^{4}}{600}+\exp(-4(t-5)^{2})+\frac{t}{10}-2% \right),= 2 ( divide start_ARG ( italic_t - 5 ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG 600 end_ARG + roman_exp ( - 4 ( italic_t - 5 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG italic_t end_ARG start_ARG 10 end_ARG - 2 ) ,

where ψ(t)𝜓𝑡\psi(t)italic_ψ ( italic_t ) is a complex non-linear function of t𝑡titalic_t plotted in Fig. 5. The offline dataset is sampled with the following distributions:

s𝑠\displaystyle sitalic_s Unif{1,,7}similar-toabsentUnif17\displaystyle\sim\text{Unif}\{1,...,7\}∼ Unif { 1 , … , 7 }
t𝑡\displaystyle titalic_t Unif(0,10)similar-toabsentUnif010\displaystyle\sim\text{Unif}(0,10)∼ Unif ( 0 , 10 )
z𝑧\displaystyle zitalic_z 𝒩(0,1)similar-toabsent𝒩01\displaystyle\sim\mathcal{N}(0,1)∼ caligraphic_N ( 0 , 1 )
ω𝜔\displaystyle\omegaitalic_ω 𝒩(0,1)similar-toabsent𝒩01\displaystyle\sim\mathcal{N}(0,1)∼ caligraphic_N ( 0 , 1 )
ϵitalic-ϵ\displaystyle\epsilonitalic_ϵ 𝒩(ρω,1ρ2).similar-toabsent𝒩𝜌𝜔1superscript𝜌2\displaystyle\sim\mathcal{N}(\rho\omega,1-\rho^{2}).∼ caligraphic_N ( italic_ρ italic_ω , 1 - italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

From the observations (r,p,t,s,z)𝑟𝑝𝑡𝑠𝑧(r,p,t,s,z)( italic_r , italic_p , italic_t , italic_s , italic_z ), we estimate h^^\hat{h}over^ start_ARG italic_h end_ARG using IV regression methods, and the mean squared error between h^^\hat{h}over^ start_ARG italic_h end_ARG and the true causal function h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are computed on 10000 random samples from the above model. For the out of distribution test samples, we sample tUnif(1,11)similar-to𝑡Unif111t\sim\text{Unif}(1,11)italic_t ∼ Unif ( 1 , 11 ) instead.

We standardise the action and outcome variables p𝑝pitalic_p and r𝑟ritalic_r to centre the data around a mean of zero and a standard deviation of one following Hartford et al. [2017]. This is standard practice for DNN training, which improves training stability and optimization efficiency.

Refer to caption
Figure 5: A graph of the nonlinear function ψ(t)𝜓𝑡\psi(t)italic_ψ ( italic_t ) in the aeroplane ticket demand dataset.

High-Dimensional Setting

For the high-dimensional setting, we again follow Hartford et al. [2017] to replace the customer type s[7]𝑠delimited-[]7s\in[7]italic_s ∈ [ 7 ] in the low-dimensional setting with images of the corresponding handwritten digits from the MNIST dataset LeCun and Cortes [2010]. For each digit d[7]𝑑delimited-[]7d\in[7]italic_d ∈ [ 7 ], we select a random MNIST image from the digit class d𝑑ditalic_d as the new customer type variable s𝑠sitalic_s. The images are 28×28=784282878428\times 28=78428 × 28 = 784 dimensional.

D.2 Real-World Datasets

Following previously studied causal inference methods Shalit et al. [2017], Wu et al. [2023], Schwab et al. [2019], Bica et al. [2020], we consider two semi-synthetic real-world datasets IHDP777IHDP: https://www.fredjo.com/. Hill [2011] and PM-CMR888PM-CMR:https://doi.org/10.23719/1506014. Wyatt et al. [2020] for experiments, since the true counterfactual prediction function is rarely available for real-world datasets.

IHDP, the Infant Health and Development Program (IHDP), comprises 747 units with 6 pre-treatment continuous variables, one action variable and 19 discrete variables related to the children and their mothers, aiming at evaluating the effect of specialist home visits on the future cognitive test scores of premature infants. From the original data, We select all 6 continuous covariance variables as our context variable C𝐶Citalic_C.

PM-CMR studies the impact of PM2.5 particle level on the cardiovascular mortality rate (CMR) in 2132 counties in the United States using data provided by the National Studies on Air Pollution and Health Wyatt et al. [2020]. We use 6 continuous variables about CMR in each city as our context variable C𝐶Citalic_C.

Following Wu et al. [2023], from the context variables C𝐶Citalic_C obtained from real-world datasets, we generate the instrument Z𝑍Zitalic_Z, the action A𝐴Aitalic_A and the outcome R𝑅Ritalic_R using the following model:

Z(Z=z)=1/K,z[1..K];\displaystyle Z\sim\mathds{P}(Z=z)=1/K,\quad z\in[1..K];italic_Z ∼ blackboard_P ( italic_Z = italic_z ) = 1 / italic_K , italic_z ∈ [ 1 . . italic_K ] ;
A=z=1K1Z=zi=1dCwiz(Ci+0.2ϵ+fz(z))+δA,wizUnif(1,1);formulae-sequence𝐴superscriptsubscript𝑧1𝐾subscript1𝑍𝑧superscriptsubscript𝑖1subscript𝑑𝐶subscript𝑤𝑖𝑧subscript𝐶𝑖0.2italic-ϵsubscript𝑓𝑧𝑧subscript𝛿𝐴similar-tosubscript𝑤𝑖𝑧Unif11\displaystyle A=\sum_{z=1}^{K}1_{Z=z}\sum_{i=1}^{d_{C}}w_{iz}(C_{i}+0.2% \epsilon+f_{z}(z))+\delta_{A},\quad w_{iz}\sim\text{Unif}(-1,1);italic_A = ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT italic_Z = italic_z end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_z end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 0.2 italic_ϵ + italic_f start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ) ) + italic_δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i italic_z end_POSTSUBSCRIPT ∼ Unif ( - 1 , 1 ) ;
R=9A21.5A+i=1dCCidC+|C1C2|sin(10+C2C3)+2ϵ+δR,𝑅9superscript𝐴21.5𝐴superscriptsubscript𝑖1subscript𝑑𝐶subscript𝐶𝑖subscript𝑑𝐶subscript𝐶1subscript𝐶210subscript𝐶2subscript𝐶32italic-ϵsubscript𝛿𝑅\displaystyle R=9A^{2}-1.5A+\sum_{i=1}^{d_{C}}\frac{C_{i}}{d_{C}}+\lvert C_{1}% C_{2}\rvert-\sin{(10+C_{2}C_{3})}+2\epsilon+\delta_{R},italic_R = 9 italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1.5 italic_A + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_ARG + | italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | - roman_sin ( 10 + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) + 2 italic_ϵ + italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ,

where Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the i𝑖iitalic_i-th variable in C𝐶Citalic_C, fzsubscript𝑓𝑧f_{z}italic_f start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is a function that returns different constants depending on the input z𝑧zitalic_z, δR,δA𝒩(0,1)similar-tosubscript𝛿𝑅subscript𝛿𝐴𝒩01\delta_{R},\delta_{A}\sim\mathcal{N}(0,1)italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 ) and ϵ𝒩(0,0.1)similar-toitalic-ϵ𝒩00.1\epsilon\sim\mathcal{N}(0,0.1)italic_ϵ ∼ caligraphic_N ( 0 , 0.1 ) is the unobserved confounder. The fully generated semi-synthetic datasets IHDP and PM-CMR have 747 and 2132 samples respectively, and we randomly split them into training (63%), validation (27%), and testing (10%) following Wu et al. [2023].

E Failure of Standard Offline Bandit Algorithms

Refer to caption
Figure 6: Comparing the average reward obtained by policies learned using offline bandit algorithms that do not take IVs into account with a random policy on the aeroplane ticket demand dataset with low-dimensional context.

It has been demonstrated that standard supervised learning that does not take IVs into account fails to learn the causal function or the counterfactual prediction function from a confounded offline dataset Hartford et al. [2017]. Similarly, we demonstrate here that standard offline bandit algorithms also fail to learn meaningful policies from confounded offline datasets. We evaluate PEVI, also called LinLCB ** et al. [2021], NeuraLCB Nguyen-Tang et al. [2022], KernLCB Valko et al. [2013], NeuralLinLCB Nguyen-Tang et al. [2022] and NeuralLinGreedy Nguyen-Tang et al. [2022] algorithms, for which we combine the context C𝐶Citalic_C and instrument Z𝑍Zitalic_Z variables together as the new context input for these offline bandit algorithms. For algorithms that only support discrete actions, we discretise the action space 𝒜𝒜\mathcal{A}caligraphic_A into 20 discrete actions.

For all methods, we follow the network architecture and hyper parameters from the original papers, and we adopt the implementation999https://github.com/thanhnguyentang/offline_neural_bandits of Nguyen-Tang et al. [2022]. We evaluate these methods on the aeroplane ticket demand dataset described in Section D.1 and compare the average reward obtained by the learned policies with a random policy in Fig. 6. It can be seen that all the offline bandit algorithms do not outperform a random policy while DML-IV achieves an average reward higher then 1 as shown in Fig. 2(b). This is unsurprising because these bandit methods do not exploit IVs explicitly and are unable to learn the true causal effect of actions.

F Network Structures and Hyper-Parameters

Here, we describe the network architecture and hyper-parameters of all experiments. Unless otherwise specified, all neural network algorithms are optimised using AdamW Loshchilov and Hutter [2017] with learning rate =0.001absent0.001=0.001= 0.001, β=(0.9,0.999)𝛽0.90.999\beta=(0.9,0.999)italic_β = ( 0.9 , 0.999 ) and ϵ=108italic-ϵsuperscript108\epsilon=10^{-8}italic_ϵ = 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT. In addition, we set K=10𝐾10K=10italic_K = 10 for K𝐾Kitalic_K-fold cross-fitting in DML-IV.

F.1 Aeroplane Ticket Demand Dataset

For DML-IV and CE-DML-IV, we use the network architecture described in Fig. 6(c). We use a learning rate of 0.00020.00020.00020.0002 with a weight decay of 0.0010.0010.0010.001 (L2 regularisation) and a dropout rate of 10005000+N10005000𝑁\frac{1000}{5000+N}divide start_ARG 1000 end_ARG start_ARG 5000 + italic_N end_ARG that depends on the data size N𝑁Nitalic_N. For DeepGMM, we use the same structure as the outcome network of DML-IV with dropout =0.1absent0.1=0.1= 0.1 and the same learning rate as DML-IV. For DFIV, we follow the original structure proposed in Xu et al. [2020] with regularisers λ1𝜆1\lambda 1italic_λ 1, λ2𝜆2\lambda 2italic_λ 2 both set to 0.1 and weight decay of 0.001. For DeepIV, we use the same network architectures as action network and stage 2 network for DML-IV, with the dropout rate in Hartford et al. [2017] and weight decay of 0.001. For KIV, we use the Gaussian kernel, where the bandwidth is determined by the median trick as originally described by Singh et al. [2019], and we use the random Fourier feature trick with 100 dimensions.

Table 1: Network architecture for DML-IV and CE-DML-IV for the aeroplane ticket demand low-dimensional dataset. For the input layer, we provide the input variables. For mixture of Gaussians output, we report the number of components. The dropout rate is given in the main text.
Layer Type Configuration
Input C,Z𝐶𝑍C,Zitalic_C , italic_Z
FC + ReLU in:3 out:128
Dropout -
FC + ReLU in:128 out:64
Dropout -
FC + ReLU in:64 out:32
Dropout -
MixtureGaussian 10
(a)
Layer Type Configuration
Input C,Z𝐶𝑍C,Zitalic_C , italic_Z
FC + ReLU in:3 out:128
Dropout -
FC + ReLU in:128 out:64
Dropout -
FC + ReLU in:64 out:32
Dropout -
FC in:32 out:1
(b)
Layer Type Configuration
Input C,A𝐶𝐴C,Aitalic_C , italic_A
FC + ReLU in:3 out:128
Dropout -
FC + ReLU in:128 out:64
Dropout -
FC + ReLU in:64 out:32
Dropout -
FC in:32 out:1
(c)

F.2 Aeroplane Ticket Demand with MNIST

For DML-IV and CE-DML-IV, we use a convolutional neural network (CNN) feature extractor, which we denote as ImageFeature, described in Table 2, for all networks. The full network architecture is described in Fig. 6(f); we use weight decay of 0.05. For DeepGMM, we use the same structure as the outcome network of DML-IV, with a dropout rate of 0.1 and weight decay of 0.05. For DFIV, we follow the original structure proposed in Xu et al. [2020] with regularisers λ1𝜆1\lambda 1italic_λ 1, λ2𝜆2\lambda 2italic_λ 2 both set to 0.1 and weight decay of 0.05. For DeepIV, we use the same network architecture as the action network and stage 2 network for DML-IV, with the dropout rate in Hartford et al. [2017] and weight decay of 0.05. For KIV, we use the Gaussian kernel, where the bandwidth is determined by the median trick as originally described by Singh et al. [2019], and we use the random Fourier feature trick with 100 dimensions.

Table 2: Network architecture of the feature extractor used for the aeroplane ticket demand dataset with MNIST. For each convolution layer, we list the kernel size, input dimension and output dimension, where s stands for stride and p stands for padding. For max-pooling, we provide the size of the kernel. The dropout rate here is set to 0.3. We denote this feature extractor as ImageFeature.
Layer Type Configuration
Input 28×28282828\times 2828 × 28
Conv + ReLU 3×3×3233323\times 3\times 323 × 3 × 32, s:1, p:0
Max Pooling 2×2222\times 22 × 2, s:2
Dropout -
Conv + ReLU 3×3×6433643\times 3\times 643 × 3 × 64, s:1, p:0
Max Pooling 2×2222\times 22 × 2, s:2
Dropout -
Conv + ReLU 3×3×6433643\times 3\times 643 × 3 × 64, s:1, p:0
Dropout -
FC + ReLU in: 576, out:64
Table 3: Network architecture for DML-IV and CE-DML-IV for the aeroplane ticket demand dataset with MNIST. For the input layer, we provide the input variables. For a mixture of Gaussians output, we report the number of components. The dropout rate is given in the main text.
Layer Type Configuration
Input ImageFeature(C),Z𝐶𝑍(C),Z( italic_C ) , italic_Z
FC + ReLU in:66 out:32
Dropout -
MixtureGaussian 10
(d)
Layer Type Configuration
Input ImageFeature(C),Z𝐶𝑍(C),Z( italic_C ) , italic_Z
FC + ReLU in:66 out:32
Dropout -
FC in:32 out:1
(e)
Layer Type Configuration
Input ImageFeature(C),A𝐶𝐴(C),A( italic_C ) , italic_A
FC + ReLU in:66 out:32
Dropout -
FC in:32 out:1
(f)

F.3 IHDP and PM-CMR

For the two real-world datasets, we use the same network architectures described in Fig. 6(c) as in the low-dimensional ticket demand setting, where the input dimension is increased to 7 for all networks. We use a dropout rate of 0.1 and weight decay of 0.001. For DeepGMM, we use the same structure as the outcome network of DML-IV with dropout =0.1absent0.1=0.1= 0.1. For DFIV, we also use the same network architectures as in the low dimensional ticket demand setting with regularisers λ1𝜆1\lambda 1italic_λ 1, λ2𝜆2\lambda 2italic_λ 2 both set to 0.1 and weight decay of 0.001. For DeepIV, we use the same network architectures as the action network and stage 2 network of DML-IV, with a dropout rate of 0.1 and weight decay of 0.001. For KIV, we use the Gaussian kernel where the bandwidth is determined by the median trick as originally described by Singh et al. [2019], and we use the random Fourier feature trick with 100 dimensions.

F.4 Valiadation and Hyper-Parameter Tuning

Validation procedures are crucial for tuning DNN hyper-parameters and optimizer parameters. All the DML-IV and CE-DML-IV training stages can be validated by simply evaluating the respective losses on held-out data, as discussed in Hartford et al. [2017]. This allows independent validation and hyperparameter tuning of the two first stage networks (the action and the outcome networks), and perform second stage validation using the best network selected in the first stage. This validation procedure guards against the ‘weak instruments’ bias Bound et al. [1995] that can occur when the instruments are only weakly correlated with the actions variable (see detailed discussion in Hartford et al. [2017]).

G Additional Experimental Results

In this section, we provide additional experimental results including the effects of weak IVs, performance with tree-based estimators, and a hyperparameter sensitivity analysis.

G.1 Effects of Weak Instruments

When the correlation between instruments and the endogenous variable (the action in our case) is weak, IV regression methods generally become unreliable Andrews et al. [2019] because the weak correlation induces variance and bias in the first stage estimator thus induces bias in the second stage estimator, especially for non-linear IV regressions. In theory, DML-IV should be more resistant to biases in the first stage thanks to the DML framework, as long as the causal effect is identifiable under the weak instrument. Under this identifiability condition, Lemma 3.3, Theorem 3.4 and 3.5 all hold, and the convergence rate guarantees still apply. However, while causal identifiability with weak instruments are studied theoretically in the linear setting Andrews et al. [2019], such a theoretical study for non-linear IV models, to the best of our knowledge, does not exist due to the difficulty of analyzing non-linear models and estimators.

Experimentally, for the airplane ticket demand dataset, we alter the instrument strength by changing how much the instrument z affects the price p. Recall from Section D.1 that p=25+(z+3)ψ(t)+ω𝑝25𝑧3𝜓𝑡𝜔p=25+(z+3)\psi(t)+\omegaitalic_p = 25 + ( italic_z + 3 ) italic_ψ ( italic_t ) + italic_ω, where ψ𝜓\psiitalic_ψ is a nonlinear function and ω𝜔\omegaitalic_ω is the noise. We add an IV strength parameter ϱitalic-ϱ\varrhoitalic_ϱ such that p=25+(ϱz+3)ψ(t)+ω𝑝25italic-ϱ𝑧3𝜓𝑡𝜔p=25+(\varrho\cdot z+3)\psi(t)+\omegaitalic_p = 25 + ( italic_ϱ ⋅ italic_z + 3 ) italic_ψ ( italic_t ) + italic_ω. In Table 4, we present the mean and standard deviation of the MSE of h^^\hat{h}over^ start_ARG italic_h end_ARG for various IV strengths ϱitalic-ϱ\varrhoitalic_ϱ from 0.01 to 1 and sample size N=5000𝑁5000N=5000italic_N = 5000. It is very interesting to see that DML-IV indeed performs significantly better than SOTA nonlinear IV regression methods under weak instruments.

IV Strength 1.0 0.8 0.6 0.4 0.2 0.01
DML-IV 0.0676(0.0116) 0.0984(0.0161) 0.1295(0.0168) 0.1859(0.0376) 0.2899(0.0494) 0.4872(0.1295)
CE-DML-IV 0.0765(0.0119) 0.1064(0.0120) 0.1514(0.0203) 0.2070(0.0329) 0.3194(0.0572) 0.5302(0.1625)
DeepIV 0.1213(0.0209) 0.2039(0.0269) 0.3051(0.0415) 0.4476(0.0656) 0.6891(0.1210) 0.9293(0.2382)
DFIV 0.1124(0.0481) 0.1586(0.0320) 0.3080(0.1907) 0.8117(0.2779) 0.9622(0.3892) 1.6503(0.6845)
DeepGMM 0.2699(0.0522) 0.3330(0.1171) 0.4762(0.1056) 0.8666(0.2248) 1.0056(0.4334) 2.0218(0.6555)
KIV 0.2312(0.0272) 0.3149(0.0218) 0.4275(0.0368) 0.6646(0.0538) 0.8099(0.0657) 1.226(0.1014)
Table 4: Results for the low-dimensional ticket demand dataset when the IV is weakly correlated with the action.

G.2 Performance of DML-IV with tree-based estimators

The DML-IV framework allows for general estimators following the Neyman orthogonal score function. While deep learning is flexible and widely used in SOTA non-linear IV regression methods, Gradient Boosting and Random Forests regression are all good candidate estimators for DML-IV. In addition, as discussed in Lemma 3.3, the convergence rate and suboptimality guarantees in Theorem 3.4 and 3.5 both hold for these tree-based regressions.

Empirically, we replace the DNN estimators in DML-IV, CE-DML-IV and DeepIV with Random Forests and Gradient Boosting regressors (using scikit-learn implementation). DeepIV is a good baseline for comparison, since it optimizes directly using a non-Neyman-orthogonal score and allows for direct replacement of all DNN estimators with tree-based estimators. We use 500 trees for both regressors, with minimum samples required at each leaf node of 100 for the nuisance parameters and 10 for h^^\hat{h}over^ start_ARG italic_h end_ARG.

In Table 5, we present the mean and standard deviation of the MSE of h^^\hat{h}over^ start_ARG italic_h end_ARG with Random Forests and Gradient Boosting estimators on the aeroplane ticket demand dataset with various dataset sample sizes. The results demonstrate the benefits of our Neyman orthogonal score function, and interestingly the performance of Gradient Boosting is comparable to DNN estimators.

IV Strength Dataset Size DNN (results in the paper) Random Forests Gradient Boosting
DML-IV 2000 0.1308(0.0206) 0.1689(0.0172) 0.1301(0.0112)
CE-DML-IV 2000 0.1410(0.0246) 0.1733(0.0198) 0.1329(0.0125)
DeepIV 2000 0.2388(0.0438) 0.2642(0.0261) 0.2052(0.0232)
DML-IV 5000 0.0676(0.0129) 0.1067(0.0131) 0.0632(0.0107)
CE-DML-IV 5000 0.0765(0.0119) 0.1154(0.0138) 0.0699(0.0069)
DeepIV 5000 0.1213(0.0209) 0.1626(0.0128) 0.1020(0.0091)
DML-IV 10000 0.0378(0.0094) 0.0657(0.0062) 0.0482(0.0079)
CE-DML-IV 10000 0.0442(0.0070) 0.0721(0.0039) 0.0523(0.0059)
DeepIV 10000 0.0714(0.0140) 0.1106(0.0080) 0.1017(0.0075)
Table 5: Results for the low-dimensional ticket demand dataset using tree-based estimators compared to DNN estimators.

G.3 Sensitivity analysis for different Hyperparameters

The tunable hyperparameters in DML-IV are the learning rate, network width, weight decay and dropout rate (see Section F). As a sensitivity analysis, we provide results for the mean and standard deviation of the MSE of the DML-IV estimator h^^\hat{h}over^ start_ARG italic_h end_ARG with different hyperparameter values for both the low-dimensional and high-dimensional datasets with sample size N=5000 in Table 6 and Table 7. Overall, we see that DML-IV is not very sensitive to small changes of the hyperparameters.

Learning Rate Weight Decay Dropout DNN Width DML-IV CE-DML-IV
0.0002 0.001 0.1 128 0.0676(0.0129) 0.0765(0.0119)
0.0005 0.0752(0.0122) 0.0897(0.0196)
0.0001 0.0703(0.0195) 0.0794(0.0201)
0.0005 0.0794(0.0185) 0.0823(0.0149)
0.005 0.0765(0.0135) 0.0809(0.0159)
0.01 0.0820(0.0162) 0.0865(0.0174)
0.05 0.0715(0.0074) 0.0813(0.0089)
0.2 0.0836(0.0100) 0.0919(0.0157)
64 0.0830(0.0162) 0.0924(0.0121)
256 0.0943(0.0179) 0.0981(0.0126)
0.0005 0.2 0.0805(0.0133) 0.0910(0.0106)
0.005 0.05 0.0672(0.0116) 0.0742(0.0102)
0.01 0.05 0.0825(0.0152) 0.0914(0.0125)
0.2 256 0.0810(0.0129) 0.0852(0.0121)
0.05 64 0.0907(0.0149) 0.0963(0.0161)
0.005 256 0.0939(0.0146) 0.0991(0.0093)
Table 6: Results for the low-dimensional ticket demand dataset for a range of hyperparameter values. The default hyperparameters in this case are: learning rate=0.0002, weight decay=0.001, dropout=0.1 and DNN width 128.
Learning Rate Weight Decay Dropout CNN Channels DML-IV CE-DML-IV
0.001 0.05 0.2 64 0.3513(0.0125) 0.3808(0.0150)
0.0005 0.4063(0.0129) 0.5008(0.0369)
0.002 0.3659(0.0219) 0.4133(0.0267)
0.005 0.3377(0.0218) 0.3555(0.0202)
0.01 0.3935(0.0176) 0.4461(0.0478)
0.02 0.3595(0.03013) 0.3851(0.0293)
0.1 0.4066(0.0172) 0.5160(0.0329)
0.1 0.4136(0.0211) 0.5386(0.0398)
0.3 0.3857(0.0171) 0.4002(0.0249)
128 0.4176(0.01941) 0.5129(0.0630)
256 0.4942(0.0226) 0.6180(0.0396)
0.1 0.1 0.4163(0.0214) 0.5952(0.0343)
0.01 0.3 0.3636(0.0186) 0.3995(0.0250)
0.3 128 0.4006(0.0187) 0.4764(0.0216)
0.3 256 0.3429(0.0215) 0.3971(0.0264)
0.1 256 0.4170(0.0283) 0.5335(0.0371)
Table 7: Results for the high-dimensional ticket demand dataset for a range of hyperparameter values. The default hyperparameters in this case are: learning rate 0.001, weight decay=0.05, dropout=0.2 and 64 CNN channels.