Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy

Kyungbok Lee
Graduate School of Data Science
Seoul National University
[email protected]
&Myunghee Cho Paik

Shepherd23 Inc.
[email protected]
Abstract

We introduce a novel doubly-robust (DR) off-policy evaluation (OPE) estimator for Markov decision processes, DRUnknown, designed for situations where both the logging policy and the value function are unknown. The proposed estimator initially estimates the logging policy and then estimates the value function model by minimizing the asymptotic variance of the estimator while considering the estimating effect of the logging policy. When the logging policy model is correctly specified, DRUnknown achieves the smallest asymptotic variance within the class containing existing OPE estimators. When the value function model is also correctly specified, DRUnknown is optimal as its asymptotic variance reaches the semiparametric lower bound. We present experimental results conducted in contextual bandits and reinforcement learning to compare the performance of DRUnknown with that of existing methods.

1 Introduction

In various decision-making problems, estimating the value, the expected reward of a policy is a crucial question that needs to be addressed. Online evaluation requiring a comprehensive evaluation of policy value can be expensive and may not be applicable to multiple target policies.

Alternatively, off-policy evaluation (OPE) refers to a technique that estimates the value of a target policy by utilizing log data generated from a different logging policy. This approach has attracted considerable interest in the domains of contextual bandits (CB) (Dudík et al., 2011; Swaminathan et al., 2017) and reinforcement learning (RL) (Precup, 2000; Mahmood et al., 2014; Jiang and Li, 2016).

Several off-policy evaluation algorithms (Dudík et al., 2011; Thomas and Brunskill, 2016; Wang et al., 2017; Farajtabar et al., 2018; Su et al., 2020) currently in use rely on having complete knowledge of the logging policy in order to utilize inverse probability weighting (IPW). However, in situations where information about the data logger is not available, such as when using human decision data, it becomes necessary to estimate the logging policy. Applying existing estimators directly to cases where the logging policy is unknown can compromise the desired properties of these methods, as they fail to take into account the impact of logging policy estimation.

In this paper, we present a novel efficient estimator called DRUnknown, which simultaneously estimate both the logging policy model and the value function model. Our proposed approach, without the need for external data, effectively captures the interdependency between the two models.

In the field of conventional statistics, Cao et al. (2009) employs the influence function to estimate parameters in the doubly-robust (DR) estimator for population means when there are missing observations. The study focused on situations where data is incomplete and the missing mechanisms are unknown and estimated. To advance this concept, we utilized the influence function approach to address the OPE problem. Here, we treat the rewards of unselected actions as missing values and estimate the unknown logging policy. This enables us to derive the asymptotic variance of the OPE estimator and estimate the parameters that minimize it.

The proposed OPE estimator estimates both the logging policy model and the value function model, and is referred to as doubly-robust due to its consistency if either the model for the logging policy or the value function is correctly specified. When the logging policy is correctly specified, the proposed estimator is the most efficient among the class of DR OPE estimators with an estimated logging policy, and is at least as efficient as existing methods. Moreover, if the value function model is correctly specified, the estimator locally efficient, achieving the semiparametric lower bound for asymptotic variance and is asymptotically optimal. In order to demonstrate the effectiveness of the proposed estimator, we conducted simulation experiments to compare its performance with previous methods in contextual bandits and reinforcement learning problems.

The main contributions of this paper are as follows:

  • We present a new DR OPE estimator for Markov decision process, for the case where both the logging policy and the value function are unknown. The proposed estimator is consistent when either the logging policy model or the value function model is correctly specified.

  • We propose a method to estimate both the logging policy model parameter ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG and the value function model parameter β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG simultaneously. We use the maximum likelihood estimator (MLE) for ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG and minimize the asymptotic variance to estimate β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG, accounting for the impact of estimating ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG.

  • The proposed estimator has the smallest asymptotic variance among estimators using MLE for ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG when the logging policy model is correctly specified. When the value function model is also correctly specified the proposed estimator is optimal, as its asymptotic variance reaches the semiparametric lower bound.

  • Simulations on contextual bandits and reinforcement learning problems show that the proposed estimator consistently shows smaller mean-squared errors compared to the benchmarks methods.

2 Problem Setup

In this paper, we model the decision-making problem as a reinforcement learning framework in which the learner’s interaction with the system is represented as a Markov Decision Process (MDP). In this section, we provide definitions of MDP and the corresponding off-policy evaluation problem for our study.

2.1 Markov Decision Processes

An MDP is defined as a tuple (𝒳𝒳\mathcal{X}caligraphic_X, 𝒜𝒜\mathcal{A}caligraphic_A, R,P,P0,γ𝑅𝑃subscript𝑃0𝛾R,P,P_{0},\gammaitalic_R , italic_P , italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ), where 𝒳𝒳\mathcal{X}caligraphic_X and 𝒜𝒜\mathcal{A}caligraphic_A represent the state and the finite action spaces with |𝒜|=K𝒜𝐾|\mathcal{A}|=K| caligraphic_A | = italic_K, R(x,a)𝑅𝑥𝑎R(x,a)italic_R ( italic_x , italic_a ) denotes the distribution of the random variable r(x,a)𝑟𝑥𝑎r(x,a)italic_r ( italic_x , italic_a ) of the bounded immediate reward of taking action a𝑎aitalic_a in state x𝑥xitalic_x, P(|x,a)P(\cdot|x,a)italic_P ( ⋅ | italic_x , italic_a ) is the transition probability distribution. P0:X[0,1]:subscript𝑃0𝑋01P_{0}:X\rightarrow[0,1]italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_X → [ 0 , 1 ] is the initial state distribution, and γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] is the discounting factor. The learner utilizes a policy π:𝒳×𝒜[0,1]:𝜋𝒳𝒜01\pi:\mathcal{X}\times\mathcal{A}\rightarrow[0,1]italic_π : caligraphic_X × caligraphic_A → [ 0 , 1 ], a stochastic map** from states to actions. Here, π(a|x)𝜋conditional𝑎𝑥\pi(a|x)italic_π ( italic_a | italic_x ) represents the probability of taking action a𝑎aitalic_a in state x𝑥xitalic_x.

Let Tsubscript𝑇\mathcal{H}_{T}caligraphic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = {(xt,at,rt)}t=0T1superscriptsubscriptsubscript𝑥𝑡subscript𝑎𝑡subscript𝑟𝑡𝑡0𝑇1\{(x_{t},a_{t},r_{t})\}_{t=0}^{T-1}{ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT represent a T𝑇Titalic_T-step trajectory containing state-action-rewards generated by policy π𝜋\piitalic_π. Denote Rt1:t2=t=t1t2γtt1rtsubscript𝑅:subscript𝑡1subscript𝑡2superscriptsubscript𝑡subscript𝑡1subscript𝑡2superscript𝛾𝑡subscript𝑡1subscript𝑟𝑡R_{t_{1}:t_{2}}=\sum\limits_{t=t_{1}}^{t_{2}}\gamma^{t-t_{1}}r_{t}italic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the sum of discounted return of from time step t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Our goal is to estimate the policy value, the expected return of trajectories generated by policy π𝜋\piitalic_π, i.e., Vπ,T=𝔼π[R0:T1]superscript𝑉𝜋𝑇subscript𝔼𝜋delimited-[]subscript𝑅:0𝑇1V^{\pi,T}=\text{$\mathbb{E}$}_{\pi}[R_{0:T-1}]italic_V start_POSTSUPERSCRIPT italic_π , italic_T end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT 0 : italic_T - 1 end_POSTSUBSCRIPT ]. Here 𝔼πsubscript𝔼𝜋\text{$\mathbb{E}$}_{\pi}blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT denotes the expectation over the trajectory sampled from π𝜋\piitalic_π. We also define the value function of a policy π𝜋\piitalic_π at state x𝑥xitalic_x and at state-action pair (x,a)𝑥𝑎(x,a)( italic_x , italic_a ) from time step t𝑡titalic_t by Vπ,t(x)=𝔼π[Rt:T1|xt=x]superscript𝑉𝜋𝑡𝑥subscript𝔼𝜋delimited-[]conditionalsubscript𝑅:𝑡𝑇1subscript𝑥𝑡𝑥V^{\pi,t}(x)=\text{$\mathbb{E}$}_{\pi}[R_{t:T-1}|x_{t}=x]italic_V start_POSTSUPERSCRIPT italic_π , italic_t end_POSTSUPERSCRIPT ( italic_x ) = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_t : italic_T - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x ] and Qπ,t(x,a)=𝔼π[Rt:T1|xt=x,at=a]superscript𝑄𝜋𝑡𝑥𝑎subscript𝔼𝜋delimited-[]formulae-sequenceconditionalsubscript𝑅:𝑡𝑇1subscript𝑥𝑡𝑥subscript𝑎𝑡𝑎Q^{\pi,t}(x,a)=\text{$\mathbb{E}$}_{\pi}[R_{t:T-1}|x_{t}=x,a_{t}=a]italic_Q start_POSTSUPERSCRIPT italic_π , italic_t end_POSTSUPERSCRIPT ( italic_x , italic_a ) = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_t : italic_T - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ], respectively.

Throughout this paper, we fix the trajectory length as T𝑇Titalic_T. We also omit the π𝜋\piitalic_π from Vπ,t(s)superscript𝑉𝜋𝑡𝑠V^{\pi,t}(s)italic_V start_POSTSUPERSCRIPT italic_π , italic_t end_POSTSUPERSCRIPT ( italic_s ) and Qπ,t(s,a)superscript𝑄𝜋𝑡𝑠𝑎Q^{\pi,t}(s,a)italic_Q start_POSTSUPERSCRIPT italic_π , italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) to simplify them as Vt(s)superscript𝑉𝑡𝑠V^{t}(s)italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) and Qt(s,a)superscript𝑄𝑡𝑠𝑎Q^{t}(s,a)italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ), respectively, since we are only interested in the value of π𝜋\piitalic_π. Note that, unlike the case with T=𝑇T=\inftyitalic_T = ∞, where the Q𝑄Qitalic_Q and V𝑉Vitalic_V value functions remain unchanged, the value functions Qtsuperscript𝑄𝑡Q^{t}italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Vtsuperscript𝑉𝑡V^{t}italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT can vary for each time step t𝑡titalic_t in general.

The contextual bandits (CB) is a special case of the MDP, in which the trajectory length T𝑇Titalic_T equals to 1111, and the transition dynamics P𝑃Pitalic_P do not exist. In CB, the state is represented as the context, and the action is called as the arm.

2.2 Off-Policy Evaluation Problem

In an OPE problem, we are given a dataset 𝒟n={T(i)}i=1nsubscript𝒟𝑛superscriptsubscriptsuperscriptsubscript𝑇𝑖𝑖1𝑛\mathcal{D}_{n}=\{\mathcal{H}_{T}^{(i)}\}_{i=1}^{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { caligraphic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT comprising n𝑛nitalic_n independent T𝑇Titalic_T-step trajectories drawn from a logging policy μ𝜇\muitalic_μ. We use the superscript (i)𝑖(i)( italic_i ) to refer to the data from the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT trajectory, denoted as T(i)superscriptsubscript𝑇𝑖\mathcal{H}_{T}^{(i)}caligraphic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. The objective is to estimate the policy value Vπsuperscript𝑉𝜋V^{\pi}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT of a separate target policy π𝜋\piitalic_π. We define the importance ratio ρt=π(at|xt)/μ(at|xt)subscript𝜌𝑡𝜋conditionalsubscript𝑎𝑡subscript𝑥𝑡𝜇conditionalsubscript𝑎𝑡subscript𝑥𝑡\rho_{t}=\pi(a_{t}|x_{t})/\mu(a_{t}|x_{t})italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) / italic_μ ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and the cumulative importance ratio from time step t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT between μ𝜇\muitalic_μ and π𝜋\piitalic_π as ρt1:t2=Πt=t1t2ρtsubscript𝜌:subscript𝑡1subscript𝑡2superscriptsubscriptΠ𝑡subscript𝑡1subscript𝑡2subscript𝜌𝑡\rho_{t_{1}:t_{2}}=\Pi_{t=t_{1}}^{t_{2}}\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_Π start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In cases where t1>t2subscript𝑡1subscript𝑡2t_{1}>t_{2}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we define ρt1:t2subscript𝜌:subscript𝑡1subscript𝑡2\rho_{t_{1}:t_{2}}italic_ρ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to be 1. Similarly, we denote the estimated importance ratio as ρ^^𝜌\hat{\rho}over^ start_ARG italic_ρ end_ARG, where the probability of the logging policy μ(a|x)𝜇conditional𝑎𝑥\mu(a|x)italic_μ ( italic_a | italic_x ) is replaced with its estimated version μ^(a|x;ϕ^)^𝜇conditional𝑎𝑥^italic-ϕ\hat{\mu}(a|x;\hat{\phi})over^ start_ARG italic_μ end_ARG ( italic_a | italic_x ; over^ start_ARG italic_ϕ end_ARG ) for the logging policy model μ^(;ϕ)^𝜇italic-ϕ\hat{\mu}(\cdot;\phi)over^ start_ARG italic_μ end_ARG ( ⋅ ; italic_ϕ ).

The policy value V^^𝑉\widehat{V}over^ start_ARG italic_V end_ARG is a point estimator, and there are various metrics to assess the quality of the provided estimator. One commonly used is the mean squared error (MSE), defined as MSE(V^)=𝔼μ[(V^Vπ)2]MSE^𝑉subscript𝔼𝜇delimited-[]superscript^𝑉superscript𝑉𝜋2\text{MSE}(\widehat{V})=\text{$\mathbb{E}$}_{\mu}[{(\widehat{V}-V^{\pi})^{2}}]MSE ( over^ start_ARG italic_V end_ARG ) = blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ ( over^ start_ARG italic_V end_ARG - italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] with the expectation taken with respect to the data sampled from μ𝜇\muitalic_μ. MSE quantifies the performance of a given estimator in finite samples. Another measure is the asymptotic variance, which assesses the efficiency of an estimator with a large number of samples. The estimator V^^𝑉\widehat{V}over^ start_ARG italic_V end_ARG is called asymptotically linear if there exists a function ψ𝜓\psiitalic_ψ of trajectory Tsubscript𝑇\mathcal{H}_{T}caligraphic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT such that

V^Vπ=1ni=1nψ(T(i))+op(1n).^𝑉superscript𝑉𝜋1𝑛superscriptsubscript𝑖1𝑛𝜓superscriptsubscript𝑇𝑖subscript𝑜𝑝1𝑛\widehat{V}-V^{\pi}=\frac{1}{n}\sum_{i=1}^{n}\psi(\mathcal{H}_{T}^{(i)})+o_{p}% (\frac{1}{\sqrt{n}}).over^ start_ARG italic_V end_ARG - italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ψ ( caligraphic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ) .

with 𝔼μ[ψ(T)]=0subscript𝔼𝜇delimited-[]𝜓subscript𝑇0\text{$\mathbb{E}$}_{\mu}{[\psi(\mathcal{H}_{T})]}=0blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ italic_ψ ( caligraphic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ] = 0 and Varμ[ψ(T)2]<subscriptVar𝜇delimited-[]𝜓superscriptsubscript𝑇2\mathrm{Var}_{\mu}{[\psi(\mathcal{H}_{T})^{2}]}<\inftyroman_Var start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ italic_ψ ( caligraphic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] < ∞. The quantity Varμ[ψ(T)2]subscriptVar𝜇delimited-[]𝜓superscriptsubscript𝑇2\mathrm{Var}_{\mu}{[\psi(\mathcal{H}_{T})^{2}]}roman_Var start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ italic_ψ ( caligraphic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] is denoted as the asymptotic variance of V^^𝑉\widehat{V}over^ start_ARG italic_V end_ARG, and the function ψ(T)𝜓subscript𝑇\psi(\mathcal{H}_{T})italic_ψ ( caligraphic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) is referred to as the influence function. In our study, we derive the specific form of the influence function for OPE estimators and conduct an analysis to minimize the asymptotic variance.

For our analysis, we use the following standard regularity assumptions for OPE problem.

Assumption 1 (Absolute Continuity).

For all state-action pair (x,a)𝒳×𝒜𝑥𝑎𝒳𝒜(x,a)\in\mathcal{X}\times\mathcal{A}( italic_x , italic_a ) ∈ caligraphic_X × caligraphic_A, if π(a|x)>0𝜋conditional𝑎𝑥0\pi(a|x)>0italic_π ( italic_a | italic_x ) > 0 then μ(a|x)>0𝜇conditional𝑎𝑥0\mu(a|x)>0italic_μ ( italic_a | italic_x ) > 0.

Assumption 2 (Square Integrable).

𝔼μ[ρ0:t2]<subscript𝔼𝜇delimited-[]superscriptsubscript𝜌:0𝑡2\text{$\mathbb{E}$}_{\mu}{[\rho_{0:t}^{2}]}<\inftyblackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] < ∞ for all t<T𝑡𝑇t<Titalic_t < italic_T.

Assumption 1 is crucial in order to avoid an infinite IPW and the OPE estimators which employ IPW are well-defined. Assumption 2 is necessary to guarantee that estimators utilizing the IPW method have finite variance.

3 Previous Methods for Off-Policy Evaluation

The doubly-robust type estimator for the OPE problem was first proposed by Dudík et al. (2011). The estimator depends on a regression model for value function that is estimated using a separate independent dataset, which may not always be available.

The works of Thomas and Brunskill (2016) and Wang et al. (2017) introduce estimators that improve efficiency by modifying the IPW component of the DR OPE estimator. These methods still require the independently estimated value function using external data. Farajtabar et al. (2018) employed the idea of Rubin and van der Laan (2008), which estimates the value function model by minimizing the estimation variance without external dataset. All these methods assume that the true logging policy is known and do not require estimation.

Several studies (Li et al., 2015; Raghu et al., 2018; Xie et al., 2019; Hanna et al., 2021) have proposed the IPW estimators for OPE utilizing the estimated logging policy and examined their theoretical and numerical properties of these estimators. However, these methods do not consider the value function, making them not doubly-robust and suboptimal in terms of asymptotic variance.

3.1 Standard Methods for Off-Policy Evaluation

There are three standard approaches to estimate Vπsuperscript𝑉𝜋V^{\pi}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT in OPE problem, and we provide a brief overview of these methods.

The first method is direct method (DM) (Dudík et al., 2011; Rothe, 2016), which directly approximates the state-action value function Q(x,a)𝑄𝑥𝑎Q(x,a)italic_Q ( italic_x , italic_a ) of the target policy. The DM does not use the information from the logging policy μ𝜇\muitalic_μ and relies heavily on the accuracy of the prediction of the value function. While the DM estimators often exhibits low variance, it is not guaranteed to be unbiased.

The second method is called inverse probability weighting (IPW) estimator (Horvitz and Thompson, 1952), which utilize the importance ratio term to correct the discrepancy between the target and logging policies. The IPW estimator is unbiased in case the logging policy is known, but it can exhibit a large variance when there is a significant difference between the logging and target policies and the importance ratio have large value. The Assumption 1 is required in order for the IPW estimator to be well-defined: one cannot know about the pair (x,a)𝑥𝑎(x,a)( italic_x , italic_a ) never explored by the logging policy.

The third approach, named doubly-robust (DR) estimator (Cassel et al., 1976; Robins and Rotnitzky, 1995; Dudík et al., 2011; Jiang and Li, 2016), combines the DM and IPW, leveraging their respective strengths to obtain favorable characteristics. The IPW estimator is a special case of DR with the value function fixed to zero. Our work focuses on the DR type estimators, as our main objective is to introduce an efficient method for simultaneously estimating the logging policy and the value function model.

3.2 Doubly-Robust Off-Policy Evaluation with Unknown Logging Policy

The standard DR OPE estimator with known logging policy is given by

V^DR=1ni=1nt=0T1γtρ0:t1(i)[ρt(i)[rt(i)Q^(xt(i),at(i);β)]+V^(xt(i);β)],superscript^𝑉DR1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑡0𝑇1superscript𝛾𝑡superscriptsubscript𝜌:0𝑡1𝑖delimited-[]superscriptsubscript𝜌𝑡𝑖delimited-[]superscriptsubscript𝑟𝑡𝑖^𝑄superscriptsubscript𝑥𝑡𝑖superscriptsubscript𝑎𝑡𝑖𝛽^𝑉superscriptsubscript𝑥𝑡𝑖𝛽\widehat{V}^{\text{DR}}=\frac{1}{n}\sum\limits_{i=1}^{n}\sum\limits_{t=0}^{T-1% }\gamma^{t}\rho_{0:{t-1}}^{(i)}\bigl{[}\rho_{t}^{(i)}[r_{t}^{(i)}-\widehat{Q}(% x_{t}^{(i)},a_{t}^{(i)};\beta)]+\widehat{V}(x_{t}^{(i)};\beta)\bigr{]},over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT [ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG italic_Q end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; italic_β ) ] + over^ start_ARG italic_V end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; italic_β ) ] , (1)

where Q^(;β)^𝑄𝛽\widehat{Q}(\cdot;\beta)over^ start_ARG italic_Q end_ARG ( ⋅ ; italic_β ) is a value function model of the state-action value function parameterized by β𝛽\betaitalic_β, and V^(x;β)=a𝒜π(a|x)Q^(x,a;β).^𝑉𝑥𝛽subscript𝑎𝒜𝜋conditional𝑎𝑥^𝑄𝑥𝑎𝛽\widehat{V}(x;\beta)=\sum_{a\in\mathcal{A}}\pi(a|x)\widehat{Q}(x,a;\beta).over^ start_ARG italic_V end_ARG ( italic_x ; italic_β ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_x ) over^ start_ARG italic_Q end_ARG ( italic_x , italic_a ; italic_β ) .

Here, Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG depends solely on the state-action pair. However, as mentioned in Section 2.1, the Qt(x,a)superscript𝑄𝑡𝑥𝑎Q^{t}(x,a)italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x , italic_a ) may vary for different time steps, even if the state-action pair remains the same when the horizon length T𝑇Titalic_T is finite. Hence, we cannot ensure that the model Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG can express the Qtsuperscript𝑄𝑡Q^{t}italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for all t𝑡titalic_t in the OPE problem with a finite T𝑇Titalic_T. To address this issue, we utilize the function class Q^(t,x,a)^𝑄𝑡𝑥𝑎\widehat{Q}(t,x,a)over^ start_ARG italic_Q end_ARG ( italic_t , italic_x , italic_a ), which incorporates t𝑡titalic_t as an input parameter in practice. Also, to ensure that the class of DR OPE estimator contains the IPW estimator, we assume that the class of Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG contains the constant functions. This can be easily satisfied by bringing the parameters ν0,ν1subscript𝜈0subscript𝜈1\nu_{0},\nu_{1}\in\mathbb{R}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R to extend the function class by ν0+ν1Q^subscript𝜈0subscript𝜈1^𝑄\nu_{0}+\nu_{1}\widehat{Q}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG.

The DR OPE estimator (1) utilizes the importance ratio using the true probability of the logging policy. When the logging policy μ𝜇\muitalic_μ is unknown we use the logging policy model μ^(;ϕ^)^𝜇^italic-ϕ\hat{\mu}(\cdot;\hat{\phi})over^ start_ARG italic_μ end_ARG ( ⋅ ; over^ start_ARG italic_ϕ end_ARG ) to approximate the unknown μ𝜇\muitalic_μ. We say that the model μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG is correctly specified if there exists a value of ϕitalic-ϕ\phiitalic_ϕ such that μ^(x,a;ϕ)=μ(x,a)^𝜇𝑥𝑎italic-ϕ𝜇𝑥𝑎\hat{\mu}(x,a;\phi)=\mu(x,a)over^ start_ARG italic_μ end_ARG ( italic_x , italic_a ; italic_ϕ ) = italic_μ ( italic_x , italic_a ) for all x𝑥xitalic_x and a𝑎aitalic_a. The correct specification of μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG is necessary for the consistent estimation of the policy value Vπsuperscript𝑉𝜋V^{\pi}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT, as correctly specifying the value function model Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG is difficult in many OPE problems.

In the following sections, our focus will be on the DR OPE estimator, which estimates both the logging policy parameter ϕitalic-ϕ\phiitalic_ϕ and the value function parameter β𝛽\betaitalic_β. Our goal is to build an efficient OPE estimator that exhibits lower asymptotic variance compared to existing methods.

4 Proposed Method: DRUnknown

In this section, we present our class of DRUnknown estimators. The central concept of DRUnknown is to first estimate the unknown logging policy parameter ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG for the IPW component of the DR estimator, and then learn the value function parameter β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG to minimize the asymptotic variance of the estimator. We utilize the influence function of the estimator to derive a feasible objective function to minimize the asymptotic variance of DRUnknown.

4.1 DRUnknown for Contextual Bandits

We first present the DRUnknown for contextual bandits problem with T=1𝑇1T=1italic_T = 1. Consider a logging policy model μ^(x,a;ϕ)^𝜇𝑥𝑎italic-ϕ\hat{\mu}(x,a;\phi)over^ start_ARG italic_μ end_ARG ( italic_x , italic_a ; italic_ϕ ) parameterized by ϕitalic-ϕ\phiitalic_ϕ. Let Δa(i)superscriptsubscriptΔ𝑎𝑖\Delta_{a}^{(i)}roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT represent the indicator for an action a𝑎aitalic_a chosen in trajectory i𝑖iitalic_i, and denote the partial derivative of μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG with respect to ϕitalic-ϕ\phiitalic_ϕ as μ^˙(a|x;ϕ)˙^𝜇conditional𝑎𝑥italic-ϕ\dot{\hat{\mu}}(a|x;\phi)over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG ( italic_a | italic_x ; italic_ϕ ). The maximum-likelihood estimator (MLE) ϕ^=ϕ^n^italic-ϕsubscript^italic-ϕ𝑛\hat{\phi}=\hat{\phi}_{n}over^ start_ARG italic_ϕ end_ARG = over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is given by the solution of the estimating equation Un(ϕ)subscript𝑈𝑛italic-ϕU_{n}(\phi)italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϕ ) given by

Un(ϕ)=i=1na𝒜Δa(i)μ^˙(a|x(i);ϕ)μ^(a|x(i);ϕ)=0.subscript𝑈𝑛italic-ϕsuperscriptsubscript𝑖1𝑛subscript𝑎𝒜superscriptsubscriptΔ𝑎𝑖˙^𝜇conditional𝑎superscript𝑥𝑖italic-ϕ^𝜇conditional𝑎superscript𝑥𝑖italic-ϕ0U_{n}(\phi)=\sum\limits_{i=1}^{n}\sum\limits_{a\in\mathcal{A}}\Delta_{a}^{(i)}% \frac{\dot{\hat{\mu}}(a|x^{(i)};\phi)}{\hat{\mu}(a|x^{(i)};\phi)}=0.italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϕ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT divide start_ARG over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG ( italic_a | italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; italic_ϕ ) end_ARG start_ARG over^ start_ARG italic_μ end_ARG ( italic_a | italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; italic_ϕ ) end_ARG = 0 .

If the logging policy model is correctly specified, the estimating equation is unbiased, and ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG is consistent for ϕitalic-ϕ\phiitalic_ϕ.

Plugging in the estimated value of ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG, we obtain the following class of DR OPE estimator

V^DR(β,ϕ^)=1ni=1n[π(i)μ^(i)(r(i)Q^(i)(β))+V^(i)(β)]superscript^𝑉DR𝛽^italic-ϕ1𝑛superscriptsubscript𝑖1𝑛delimited-[]superscript𝜋𝑖superscript^𝜇𝑖superscript𝑟𝑖superscript^𝑄𝑖𝛽superscript^𝑉𝑖𝛽\widehat{V}^{\text{DR}}(\beta,\hat{\phi})=\frac{1}{n}\sum\limits_{i=1}^{n}% \Bigl{[}\frac{\pi^{(i)}}{\hat{\mu}^{(i)}}(r^{(i)}-\widehat{Q}^{(i)}(\beta))+% \widehat{V}^{(i)}(\beta)\Bigr{]}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( italic_β , over^ start_ARG italic_ϕ end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT [ divide start_ARG italic_π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG ( italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β ) ) + over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β ) ]

where the notations π(i),μ^(i),Q^(i)(β)superscript𝜋𝑖superscript^𝜇𝑖superscript^𝑄𝑖𝛽\pi^{(i)},\hat{\mu}^{(i)},\widehat{Q}^{(i)}(\beta)italic_π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β ) and V^(i)(β)superscript^𝑉𝑖𝛽\widehat{V}^{(i)}(\beta)over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β ) denote the π(a(i)|x(i))𝜋conditionalsuperscript𝑎𝑖superscript𝑥𝑖\pi(a^{(i)}|x^{(i)})italic_π ( italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ), μ^(a(i)|x(i);ϕ^)^𝜇conditionalsuperscript𝑎𝑖superscript𝑥𝑖^italic-ϕ\hat{\mu}(a^{(i)}|x^{(i)};\hat{\phi})over^ start_ARG italic_μ end_ARG ( italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; over^ start_ARG italic_ϕ end_ARG ), Q^(x(i),a(i);β)^𝑄superscript𝑥𝑖superscript𝑎𝑖𝛽\widehat{Q}(x^{(i)},a^{(i)};\beta)over^ start_ARG italic_Q end_ARG ( italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; italic_β ) and V^(x(i);β)^𝑉superscript𝑥𝑖𝛽\widehat{V}(x^{(i)};\beta)over^ start_ARG italic_V end_ARG ( italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; italic_β ) respectively. Henceforth in the paper, if there is no ambiguity, we simplify terms containing x(i)superscript𝑥𝑖x^{(i)}italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and a(i)superscript𝑎𝑖a^{(i)}italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT by using only the superscript (i)𝑖(i)( italic_i ).

The expression of the bias and variance of the estimator V^DR(β,ϕ^)superscript^𝑉DR𝛽^italic-ϕ\widehat{V}^{\text{DR}}(\beta,\hat{\phi})over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( italic_β , over^ start_ARG italic_ϕ end_ARG ) is not simple in general, even for a fixed β𝛽\betaitalic_β. The estimator is not a mean of the independent and identically distributed (i.i.d) variables, due to the estimated ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG in the denominator. Consequently, estimating the regression parameter β𝛽\betaitalic_β to minimize the MSE of the proposed estimator poses challenges. Hence, we instead aim to find β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG minimizing the asymptotic variance of the estimator.

To attain this, we derive the influence function asymptotically equivalent to V^^𝑉\widehat{V}over^ start_ARG italic_V end_ARG, through a Taylor expansion with respect to ϕitalic-ϕ\phiitalic_ϕ and β𝛽\betaitalic_β. Denote the new regression function F𝐹Fitalic_F, the product of the target policy π𝜋\piitalic_π and the Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG with an additional estimating effect term of ϕitalic-ϕ\phiitalic_ϕ, given by

F(x,a;β,c,ϕ)=π(a|x)Q^(x,a;β)+cμ^˙(a|x;ϕ),𝐹𝑥𝑎𝛽𝑐italic-ϕ𝜋conditional𝑎𝑥^𝑄𝑥𝑎𝛽superscript𝑐top˙^𝜇conditional𝑎𝑥italic-ϕF(x,a;\beta,c,\phi)=\pi(a|x)\widehat{Q}(x,a;\beta)+c^{\top}\dot{\hat{\mu}}(a|x% ;\phi),italic_F ( italic_x , italic_a ; italic_β , italic_c , italic_ϕ ) = italic_π ( italic_a | italic_x ) over^ start_ARG italic_Q end_ARG ( italic_x , italic_a ; italic_β ) + italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG ( italic_a | italic_x ; italic_ϕ ) ,

for β𝛽\betaitalic_β and cdim(ϕ).𝑐superscriptdimitalic-ϕc\in\mathbb{R}^{\text{dim}(\phi)}.italic_c ∈ blackboard_R start_POSTSUPERSCRIPT dim ( italic_ϕ ) end_POSTSUPERSCRIPT . The influence function of V^DR(β^,ϕ^)superscript^𝑉DR^𝛽^italic-ϕ\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) with arbitrary estimator β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG can be formulated employing F𝐹Fitalic_F, as described in the following proposition.

Proposition 1 (Asymptotic Equivalence of V^^𝑉\widehat{V}over^ start_ARG italic_V end_ARG for CB).

Consider an estimator β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG converging to some βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in probability. If the model μ^(;ϕ)^𝜇italic-ϕ\hat{\mu}(\cdot;\phi)over^ start_ARG italic_μ end_ARG ( ⋅ ; italic_ϕ ) is correctly specified, the DR OPE estimator is asymptotically linear with influence function η𝜂\etaitalic_η:

V^DR(β^,ϕ^)=V~(β,c(β))+op(n1/2)=1ni=1nη(i)(β,c(β))+op(n1/2),superscript^𝑉DR^𝛽^italic-ϕ~𝑉superscript𝛽𝑐superscript𝛽subscript𝑜𝑝superscript𝑛121𝑛superscriptsubscript𝑖1𝑛superscript𝜂𝑖𝛽𝑐superscript𝛽subscript𝑜𝑝superscript𝑛12\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})=\widetilde{V}(\beta^{*},c(% \beta^{*}))+o_{p}(n^{-1/2})=\frac{1}{n}\sum\limits_{i=1}^{n}\eta^{(i)}(\beta,c% (\beta^{*}))+o_{p}(n^{-1/2}),over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) = over~ start_ARG italic_V end_ARG ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β , italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) ,

where c(β)𝑐𝛽c(\beta)italic_c ( italic_β ) is a vector that solely depends on β𝛽\betaitalic_β, and

η(i)(β,c)=1μ(i)[π(i)r(i)F(i)(β,c,ϕ)]+a𝒜F(x(i),a;β,c,ϕ).superscript𝜂𝑖𝛽𝑐1superscript𝜇𝑖delimited-[]superscript𝜋𝑖superscript𝑟𝑖superscript𝐹𝑖𝛽𝑐italic-ϕsubscript𝑎𝒜𝐹superscript𝑥𝑖𝑎𝛽𝑐italic-ϕ\eta^{(i)}(\beta,c)=\frac{1}{\mu^{(i)}}\bigl{[}\pi^{(i)}r^{(i)}-F^{(i)}(\beta,% c,\phi)\bigr{]}\\ +\sum\limits_{a\in\mathcal{A}}F(x^{(i)},a;\beta,c,\phi).italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β , italic_c ) = divide start_ARG 1 end_ARG start_ARG italic_μ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG [ italic_π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_F start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β , italic_c , italic_ϕ ) ] + ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_F ( italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_a ; italic_β , italic_c , italic_ϕ ) .

As the value of the vector c(β)𝑐superscript𝛽c(\beta^{*})italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is unknown, we estimate both β𝛽\betaitalic_β and c𝑐citalic_c that minimize the variance of V~(β,c)~𝑉𝛽𝑐\widetilde{V}(\beta,c)over~ start_ARG italic_V end_ARG ( italic_β , italic_c ). Denote the vector F(x;β,c,ϕ)=(F(x,a;β,c,ϕ))a𝒜𝐹𝑥𝛽𝑐italic-ϕsubscript𝐹𝑥𝑎𝛽𝑐italic-ϕ𝑎𝒜\vec{F}(x;\beta,c,\phi)=(F(x,a;\beta,c,\phi))_{a\in\mathcal{A}}over→ start_ARG italic_F end_ARG ( italic_x ; italic_β , italic_c , italic_ϕ ) = ( italic_F ( italic_x , italic_a ; italic_β , italic_c , italic_ϕ ) ) start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT and πQ(x)=(π(a|x)Q(x,a))a𝒜𝜋𝑄𝑥subscript𝜋conditional𝑎𝑥𝑄𝑥𝑎𝑎𝒜\pi\vec{Q}(x)=(\pi(a|x)Q(x,a))_{a\in\mathcal{A}}italic_π over→ start_ARG italic_Q end_ARG ( italic_x ) = ( italic_π ( italic_a | italic_x ) italic_Q ( italic_x , italic_a ) ) start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT, and the gradient matrix f(x;β,c,ϕ)=F(β,c).𝑓𝑥𝛽𝑐italic-ϕ𝐹𝛽𝑐f(x;\beta,c,\phi)=\displaystyle\frac{\partial\vec{F}}{\partial(\beta,c)}.italic_f ( italic_x ; italic_β , italic_c , italic_ϕ ) = divide start_ARG ∂ over→ start_ARG italic_F end_ARG end_ARG start_ARG ∂ ( italic_β , italic_c ) end_ARG . By the law of total variance, the variance of V~DR(β,c)superscript~𝑉DR𝛽𝑐\widetilde{V}^{\text{DR}}(\beta,c)over~ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( italic_β , italic_c ) can be expressed as the square of stochastic seminorm added by a constant independent of β𝛽\betaitalic_β and c𝑐citalic_c.

Proposition 2 (Variance of V~DRsuperscript~𝑉DR\widetilde{V}^{\text{DR}}over~ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT).

The variance of V~(β,c)~𝑉𝛽𝑐\widetilde{V}(\beta,c)over~ start_ARG italic_V end_ARG ( italic_β , italic_c ) is given by

nVar(V~(β,c))=Var(Vπ(x))+𝔼μπrπQ(x)Mμ2+𝔼μF(x;β,c,ϕ)πQ(x)Mμ2𝑛Var~𝑉𝛽𝑐Varsuperscript𝑉𝜋𝑥subscript𝔼𝜇superscriptsubscriptnorm𝜋𝑟𝜋𝑄𝑥subscript𝑀𝜇2subscript𝔼𝜇superscriptsubscriptnorm𝐹𝑥𝛽𝑐italic-ϕ𝜋𝑄𝑥subscript𝑀𝜇2n\mathrm{Var}{(\widetilde{V}(\beta,c))}=\mathrm{Var}(V^{\pi}(x))+\text{$% \mathbb{E}$}_{\mu}\displaystyle\left\|\pi\vec{r}-\pi\vec{Q}(x)\right\|_{M_{\mu% }}^{2}+\text{$\mathbb{E}$}_{\mu}\displaystyle\left\|\vec{F}(x;\beta,c,\phi)-% \pi\vec{Q}(x)\right\|_{M_{\mu}}^{2}italic_n roman_Var ( over~ start_ARG italic_V end_ARG ( italic_β , italic_c ) ) = roman_Var ( italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x ) ) + blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ italic_π over→ start_ARG italic_r end_ARG - italic_π over→ start_ARG italic_Q end_ARG ( italic_x ) ∥ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ over→ start_ARG italic_F end_ARG ( italic_x ; italic_β , italic_c , italic_ϕ ) - italic_π over→ start_ARG italic_Q end_ARG ( italic_x ) ∥ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where Mμ=diag(μ(a|x)1)a𝒜JKM_{\mu}=\operatorname{diag}(\mu(a|x)^{-1})_{a\in\mathcal{A}}-J_{K}italic_M start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT = roman_diag ( italic_μ ( italic_a | italic_x ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT - italic_J start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and JKK×Ksubscript𝐽𝐾superscript𝐾𝐾J_{K}\in\mathbb{R}^{K\times K}italic_J start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_K end_POSTSUPERSCRIPT the matrix of ones.

The Proposition 2 tells minimizing the square of seminorm

𝔼μF(x;β,c,ϕ)πQ(x)Mμ2subscript𝔼𝜇superscriptsubscriptnorm𝐹𝑥𝛽𝑐italic-ϕ𝜋𝑄𝑥subscript𝑀𝜇2\text{$\mathbb{E}$}_{\mu}{\left\|\vec{F}(x;\beta,c,\phi)-\pi\vec{Q}(x)\right\|% _{M_{\mu}}^{2}}blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ over→ start_ARG italic_F end_ARG ( italic_x ; italic_β , italic_c , italic_ϕ ) - italic_π over→ start_ARG italic_Q end_ARG ( italic_x ) ∥ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (2)

with respect to β𝛽\betaitalic_β and c𝑐citalic_c is equivalent to minimizing the variance of the V~(β,c)~𝑉𝛽𝑐\widetilde{V}(\beta,c)over~ start_ARG italic_V end_ARG ( italic_β , italic_c ). Denote the minimizer of (2) by (βopt,copt)subscript𝛽optsubscript𝑐opt(\beta_{\text{opt}},c_{\text{opt}})( italic_β start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT ), the zero of its gradient

𝔼μ[f(x;β,c,ϕ)Mμ(F(x;β,c,ϕ)πQ(x))]=0.subscript𝔼𝜇delimited-[]superscript𝑓top𝑥𝛽𝑐italic-ϕsubscript𝑀𝜇𝐹𝑥𝛽𝑐italic-ϕ𝜋𝑄𝑥0\begin{split}\text{$\mathbb{E}$}_{\mu}\bigl{[}f^{\top}(x;\beta,c,\phi)M_{\mu}(% \vec{F}(x;\beta,c,\phi)-\pi\vec{Q}(x))\bigr{]}=0.\end{split}start_ROW start_CELL blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x ; italic_β , italic_c , italic_ϕ ) italic_M start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( over→ start_ARG italic_F end_ARG ( italic_x ; italic_β , italic_c , italic_ϕ ) - italic_π over→ start_ARG italic_Q end_ARG ( italic_x ) ) ] = 0 . end_CELL end_ROW

We can observe that the solution satisfies copt=c(βopt)subscript𝑐opt𝑐subscript𝛽optc_{\text{opt}}=c(\beta_{\text{opt}})italic_c start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT = italic_c ( italic_β start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT ), so that the variance of V~(βopt,c(βopt))=V~(βopt,copt)~𝑉subscript𝛽opt𝑐subscript𝛽opt~𝑉subscript𝛽optsubscript𝑐opt\widetilde{V}(\beta_{\text{opt}},c(\beta_{\text{opt}}))=\widetilde{V}(\beta_{% \text{opt}},c_{\text{opt}})over~ start_ARG italic_V end_ARG ( italic_β start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT , italic_c ( italic_β start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT ) ) = over~ start_ARG italic_V end_ARG ( italic_β start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT ) is minimized at (βopt,copt)subscript𝛽optsubscript𝑐opt(\beta_{\text{opt}},c_{\text{opt}})( italic_β start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT ). Therefore by Proposition 1, the smallest asymptotic variance of DR OPE estimator V^DR(β^,ϕ^)superscript^𝑉DR^𝛽^italic-ϕ\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) under the correctly specified μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG, is achieved by β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG converging in probability to this βoptsubscript𝛽opt\beta_{\text{opt}}italic_β start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT.The following estimating equation Sn(β,c)subscript𝑆𝑛𝛽𝑐S_{n}(\beta,c)italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_β , italic_c ) is a tractable equation that jointly solves for (β,c)𝛽𝑐(\beta,c)( italic_β , italic_c ), with its solution β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG consistent for βoptsubscript𝛽opt\beta_{\text{opt}}italic_β start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT:

Sn(β,c)=i=1nf(i)(β,c)M(i)(F(i)(β,c)diag(π(a|x(i)))a𝒜r(i))=0S_{n}(\beta,c)=\sum\limits_{i=1}^{n}f^{(i)\top}(\beta,c)M^{(i)}\bigl{(}F^{(i)}% (\beta,c)-\operatorname{diag}(\pi(a|x^{(i)}))_{a\in\mathcal{A}}~{}\vec{r}^{(i)% }\bigr{)}=0italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_β , italic_c ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ( italic_β , italic_c ) italic_M start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_F start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β , italic_c ) - roman_diag ( italic_π ( italic_a | italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT over→ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) = 0 (3)

for M(i)=diag(μ^(a|x(i))1)a𝒜JKM^{(i)}=\displaystyle\operatorname{diag}(\hat{\mu}(a|x^{(i)})^{-1})_{a\in% \mathcal{A}}-J_{K}italic_M start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = roman_diag ( over^ start_ARG italic_μ end_ARG ( italic_a | italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT - italic_J start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and the pseudo reward r(i)=(Δa(i)μ^(a|x(i))r(i))a𝒜superscript𝑟𝑖subscriptsuperscriptsubscriptΔ𝑎𝑖^𝜇conditional𝑎superscript𝑥𝑖superscript𝑟𝑖𝑎𝒜\vec{r}^{(i)}=\displaystyle\bigl{(}\frac{\Delta_{a}^{(i)}}{\hat{\mu}(a|x^{(i)}% )}r^{(i)}\bigr{)}_{a\in\mathcal{A}}over→ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = ( divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_μ end_ARG ( italic_a | italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_ARG italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT. Plugging in the solution β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG of Sn(β,c)=0subscript𝑆𝑛𝛽𝑐0S_{n}(\beta,c)=0italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_β , italic_c ) = 0 as V^DR(β^,ϕ^)superscript^𝑉DR^𝛽^italic-ϕ\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ), we obtain our proposed estimator, DRUnknown. The c𝑐citalic_c serves as an auxiliary parameter that adjusts for the effect of estimating ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG, and does not have a direct role in the final estimator V^DR(β^,ϕ^)superscript^𝑉DR^𝛽^italic-ϕ\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ).

The proposed DRUnknown is doubly-robust within the statistical context, as it remains consistent if either the logging policy model μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG or the value function model Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG is correctly specified. We have addressed the scenario with the correct logging policy model above. Regarding the second scenario, the minimizer of (2) is (βopt,copt)=(β0,0)subscript𝛽optsubscript𝑐optsubscript𝛽00(\beta_{\text{opt}},c_{\text{opt}})=(\beta_{0},0)( italic_β start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT ) = ( italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 ) since Γ(β0)=0Γsubscript𝛽00\Gamma(\beta_{0})=0roman_Γ ( italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 0, where β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the true value of the parameter satisfying Q()=Q^(;β0).𝑄^𝑄subscript𝛽0Q(\cdot)=\widehat{Q}(\cdot;\beta_{0}).italic_Q ( ⋅ ) = over^ start_ARG italic_Q end_ARG ( ⋅ ; italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . Consequently, the solution of (3) converges to (β0,0)subscript𝛽00(\beta_{0},0)( italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 ), establishing the consistency of DRUnknown. This observation can be summarized as the following proposition.

Proposition 3 (Doubly-Robustness).

The DRUnknown V^DR(β^,ϕ^)superscript^𝑉DR^𝛽^italic-ϕ\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) for contextual bandits is a doubly robust: it converges to Vπsuperscript𝑉𝜋V^{\pi}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT in probability if either the logging policy model μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG or the value function model Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG is correctly specified.

4.1.1 Intuitive Understanding of DRUnknown for CB: extended function class

For an intuitive understanding of DRUnknown for contextual bandits, suppose that the target policy π𝜋\piitalic_π always have a nonzero value. Then, F𝐹Fitalic_F can be rewritten as F(x,a;β,ϕ)=π(a|x)Q~(x,a;β,c)𝐹𝑥𝑎𝛽italic-ϕ𝜋conditional𝑎𝑥~𝑄𝑥𝑎𝛽𝑐F(x,a;\beta,\phi)=\pi(a|x)\widetilde{Q}(x,a;\beta,c)italic_F ( italic_x , italic_a ; italic_β , italic_ϕ ) = italic_π ( italic_a | italic_x ) over~ start_ARG italic_Q end_ARG ( italic_x , italic_a ; italic_β , italic_c ) for the extended function class

Q~(x,a;β,c)=Q^(x,a;β,c)+cμ^˙(a|x;ϕ)π(a|x).~𝑄𝑥𝑎𝛽𝑐^𝑄𝑥𝑎𝛽𝑐superscript𝑐top˙^𝜇conditional𝑎𝑥italic-ϕ𝜋conditional𝑎𝑥\widetilde{Q}(x,a;\beta,c)=\widehat{Q}(x,a;\beta,c)+c^{\top}\displaystyle\frac% {\dot{\hat{\mu}}(a|x;\phi)}{\pi(a|x)}.over~ start_ARG italic_Q end_ARG ( italic_x , italic_a ; italic_β , italic_c ) = over^ start_ARG italic_Q end_ARG ( italic_x , italic_a ; italic_β , italic_c ) + italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG ( italic_a | italic_x ; italic_ϕ ) end_ARG start_ARG italic_π ( italic_a | italic_x ) end_ARG .

The objective function (2) we aim to minimize in DRUnknown can then be expressed as

𝔼μ(π(a|x)[Q~(x,a;β,c)Q(x,a)])a𝒜Mμ2.\text{$\mathbb{E}$}_{\mu}{\left\|\bigl{(}\pi(a|x)[\widetilde{Q}(x,a;\beta,c)-Q% (x,a)]\bigr{)}_{a\in\mathcal{A}}\right\|_{M_{\mu}}^{2}}.blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ ( italic_π ( italic_a | italic_x ) [ over~ start_ARG italic_Q end_ARG ( italic_x , italic_a ; italic_β , italic_c ) - italic_Q ( italic_x , italic_a ) ] ) start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The additional linear term cμ^˙(a|x;ϕ)π(a|x)superscript𝑐top˙^𝜇conditional𝑎𝑥italic-ϕ𝜋conditional𝑎𝑥\displaystyle c^{\top}\frac{\dot{\hat{\mu}}(a|x;\phi)}{\pi(a|x)}italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG ( italic_a | italic_x ; italic_ϕ ) end_ARG start_ARG italic_π ( italic_a | italic_x ) end_ARG from Q~~𝑄\widetilde{Q}over~ start_ARG italic_Q end_ARG can be seen as the projection of Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT onto the inverse probability tangent space, the linear space spanned by the score function of ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG.

Hence, the proposed estimator can be seen as employing a linear parameter c𝑐citalic_c within the expanded function class Q~~𝑄\widetilde{Q}over~ start_ARG italic_Q end_ARG to remove the effect of estimating ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG, while seeking the optimal β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG for Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG that minimizes the asymptotic variance.

4.2 DRUnknown for Reinforcement Learning

The proposed DRUnknown for RL is constructed similarly to the case of CB. However, its analysis is more complex, as the estimator is expressed as a weighted sum of terms observed from time step 00 to T1𝑇1T-1italic_T - 1. Below, we introduce the construction of DRUnknown for RL, commencing with the estimation of ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG.

The MLE ϕ^n=ϕ^subscript^italic-ϕ𝑛^italic-ϕ\hat{\phi}_{n}=\hat{\phi}over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_ϕ end_ARG of ϕitalic-ϕ\phiitalic_ϕ for RL is given by the solution of

Un(ϕ)=i=1nt=0T1a𝒜Δa,t(i)μ^˙(a|xt(i);ϕ)μ^(a|xt(i);ϕ)=0,subscript𝑈𝑛italic-ϕsuperscriptsubscript𝑖1𝑛superscriptsubscript𝑡0𝑇1subscript𝑎𝒜superscriptsubscriptΔ𝑎𝑡𝑖˙^𝜇conditional𝑎superscriptsubscript𝑥𝑡𝑖italic-ϕ^𝜇conditional𝑎superscriptsubscript𝑥𝑡𝑖italic-ϕ0U_{n}(\phi)=\sum\limits_{i=1}^{n}\sum\limits_{t=0}^{T-1}\sum\limits_{a\in% \mathcal{A}}\Delta_{a,t}^{(i)}\frac{\dot{\hat{\mu}}(a|x_{t}^{(i)};\phi)}{\hat{% \mu}(a|x_{t}^{(i)};\phi)}=0,italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϕ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_a , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT divide start_ARG over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG ( italic_a | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; italic_ϕ ) end_ARG start_ARG over^ start_ARG italic_μ end_ARG ( italic_a | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; italic_ϕ ) end_ARG = 0 ,

where Δa,t(i)superscriptsubscriptΔ𝑎𝑡𝑖\Delta_{a,t}^{(i)}roman_Δ start_POSTSUBSCRIPT italic_a , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is the indicator variable for selected action at time step t𝑡titalic_t from trajectory i𝑖iitalic_i. Substituting the MLE ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG into the DR estimator (1), the estimator is given by V^DR(β,ϕ^)=t=0T1γtV^t(β,ϕ^),superscript^𝑉DR𝛽^italic-ϕsuperscriptsubscript𝑡0𝑇1superscript𝛾𝑡subscript^𝑉𝑡𝛽^italic-ϕ\widehat{V}^{\text{DR}}(\beta,\hat{\phi})=\displaystyle\sum\limits_{t=0}^{T-1}% \gamma^{t}\widehat{V}_{t}(\beta,\hat{\phi}),over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( italic_β , over^ start_ARG italic_ϕ end_ARG ) = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β , over^ start_ARG italic_ϕ end_ARG ) , with

V^t(β,ϕ^)=1ni=1nρ^0:t1(i)[ρ^t(i)[rt(i)Q^(i)(β)]+V^(i)(β)].subscript^𝑉𝑡𝛽^italic-ϕ1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript^𝜌:0𝑡1𝑖delimited-[]superscriptsubscript^𝜌𝑡𝑖delimited-[]superscriptsubscript𝑟𝑡𝑖superscript^𝑄𝑖𝛽superscript^𝑉𝑖𝛽\widehat{V}_{t}(\beta,\hat{\phi})=\frac{1}{n}\sum\limits_{i=1}^{n}\hat{\rho}_{% 0:{t-1}}^{(i)}\bigl{[}\hat{\rho}_{t}^{(i)}[r_{t}^{(i)}-\widehat{Q}^{(i)}(\beta% )]+\widehat{V}^{(i)}(\beta)\bigr{]}.over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β , over^ start_ARG italic_ϕ end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT [ over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β ) ] + over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β ) ] .

The estimator of the value at each time step t𝑡titalic_t, V^t(β,ϕ^)subscript^𝑉𝑡𝛽^italic-ϕ\widehat{V}_{t}(\beta,\hat{\phi})over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β , over^ start_ARG italic_ϕ end_ARG ), takes the same form of DRUnknown for CB, weighted by ρ^0:t1(i)subscriptsuperscript^𝜌𝑖:0𝑡1\hat{\rho}^{(i)}_{0:{t-1}}over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT. However, additional analysis is needed in the case of RL, as the V^tsubscript^𝑉𝑡\widehat{V}_{t}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for each t𝑡titalic_t are stochastically correlated. Now, as done in CB, we define the value regression function Ft(x,a;β,c,ϕ)subscript𝐹𝑡𝑥𝑎𝛽𝑐italic-ϕF_{t}(x,a;\beta,c,\phi)italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , italic_a ; italic_β , italic_c , italic_ϕ ) for each t𝑡titalic_t, given by

Ft(x,a;β,c,ϕ)=ρ0:t1π(a|x)Q^(x,a;β)+γtcμ^˙(a|x;ϕ).subscript𝐹𝑡𝑥𝑎𝛽𝑐italic-ϕsubscript𝜌:0𝑡1𝜋conditional𝑎𝑥^𝑄𝑥𝑎𝛽superscript𝛾𝑡superscript𝑐top˙^𝜇conditional𝑎𝑥italic-ϕF_{t}(x,a;\beta,c,\phi)=\rho_{0:t-1}\pi(a|x)\widehat{Q}(x,a;\beta)+\gamma^{-t}% c^{\top}\dot{\hat{\mu}}(a|x;\phi).italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , italic_a ; italic_β , italic_c , italic_ϕ ) = italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT italic_π ( italic_a | italic_x ) over^ start_ARG italic_Q end_ARG ( italic_x , italic_a ; italic_β ) + italic_γ start_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG ( italic_a | italic_x ; italic_ϕ ) .

and we derive the influence function of the proposed estimator as stated in the following proposition.

Proposition 4 (Asymptotic Equivalence of V^^𝑉\widehat{V}over^ start_ARG italic_V end_ARG for RL).

Consider an estimator β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG converging to some βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in probability. If the logging policy model μ^(;ϕ)^𝜇italic-ϕ\hat{\mu}(\cdot;\phi)over^ start_ARG italic_μ end_ARG ( ⋅ ; italic_ϕ ) is correctly specified, then the proposed DR OPE estimator is asymptotically linear, with influence function η=t=0T1γtηt𝜂superscriptsubscript𝑡0𝑇1superscript𝛾𝑡subscript𝜂𝑡\eta=\displaystyle\sum\limits_{t=0}^{T-1}\gamma^{t}\eta_{t}italic_η = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

V^DR(β^,ϕ^)=V~(β,c(β))+op(n1/2)=1ni=1nη(i)(β,c(β))+op(n1/2),superscript^𝑉DR^𝛽^italic-ϕ~𝑉superscript𝛽𝑐superscript𝛽subscript𝑜𝑝superscript𝑛121𝑛superscriptsubscript𝑖1𝑛superscript𝜂𝑖𝛽𝑐superscript𝛽subscript𝑜𝑝superscript𝑛12\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})=\widetilde{V}(\beta^{*},c(% \beta^{*}))+o_{p}(n^{-1/2})=\frac{1}{n}\sum\limits_{i=1}^{n}\eta^{(i)}(\beta,c% (\beta^{*}))+o_{p}(n^{-1/2}),over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) = over~ start_ARG italic_V end_ARG ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β , italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) , (4)

where c(β)𝑐𝛽c(\beta)italic_c ( italic_β ) is a vector that solely depends on β𝛽\betaitalic_β, and

ηt(i)(β,c)=1μt(i)[ρ0:t1(i)πt(i)rt(i)Ft(i)(β,c,ϕ)]+a𝒜Ft(xt(i),a;β,c,ϕ).superscriptsubscript𝜂𝑡𝑖𝛽𝑐1subscriptsuperscript𝜇𝑖𝑡delimited-[]superscriptsubscript𝜌:0𝑡1𝑖subscriptsuperscript𝜋𝑖𝑡superscriptsubscript𝑟𝑡𝑖superscriptsubscript𝐹𝑡𝑖𝛽𝑐italic-ϕsubscript𝑎𝒜subscript𝐹𝑡superscriptsubscript𝑥𝑡𝑖𝑎𝛽𝑐italic-ϕ\eta_{t}^{(i)}(\beta,c)=\frac{1}{\mu^{(i)}_{t}}\bigl{[}\rho_{0:t-1}^{(i)}\pi^{% (i)}_{t}r_{t}^{(i)}-F_{t}^{(i)}(\beta,c,\phi)\bigr{]}+\sum\limits_{a\in% \mathcal{A}}F_{t}(x_{t}^{(i)},a;\beta,c,\phi).italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β , italic_c ) = divide start_ARG 1 end_ARG start_ARG italic_μ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β , italic_c , italic_ϕ ) ] + ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_a ; italic_β , italic_c , italic_ϕ ) .

Now, to derive the asymptotic variance of the proposed estimator for RL, we denote the vector Ft(x;β,c,ϕ)=(Ft(x,a;β,c,ϕ))a𝒜,subscript𝐹𝑡𝑥𝛽𝑐italic-ϕsubscriptsubscript𝐹𝑡𝑥𝑎𝛽𝑐italic-ϕ𝑎𝒜\vec{F}_{t}(x;\beta,c,\phi)=(F_{t}(x,a;\beta,c,\phi))_{a\in\mathcal{A}},over→ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ; italic_β , italic_c , italic_ϕ ) = ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , italic_a ; italic_β , italic_c , italic_ϕ ) ) start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT , πQt(x)=(π(a|x)Qt(x,a))a𝒜𝜋superscript𝑄𝑡𝑥subscript𝜋conditional𝑎𝑥superscript𝑄𝑡𝑥𝑎𝑎𝒜\pi\vec{Q}^{t}(x)=(\pi(a|x)Q^{t}(x,a))_{a\in\mathcal{A}}italic_π over→ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x ) = ( italic_π ( italic_a | italic_x ) italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x , italic_a ) ) start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT and the gradient matrix ft(x;β,c,ϕ)=Ft(β,c).subscript𝑓𝑡𝑥𝛽𝑐italic-ϕsubscript𝐹𝑡𝛽𝑐f_{t}(x;\beta,c,\phi)=\displaystyle\frac{\partial\vec{F}_{t}}{\partial(\beta,c% )}.italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ; italic_β , italic_c , italic_ϕ ) = divide start_ARG ∂ over→ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ ( italic_β , italic_c ) end_ARG .

The asymptotic variance of the DR OPE estimator for RL can be expressed with these notations as in the following proposition.

Proposition 5 (Variance of V~~𝑉\widetilde{V}over~ start_ARG italic_V end_ARG for RL).

The variance of V~(β,c)~𝑉𝛽𝑐\widetilde{V}(\beta,c)over~ start_ARG italic_V end_ARG ( italic_β , italic_c ) is given by

nVar(V~(β,c))=CT+t=0T1γ2t𝔼μFt(xt;β,c,ϕ)ρ0:t1πQt(xt)Mt,μ2𝑛Var~𝑉𝛽𝑐subscript𝐶𝑇superscriptsubscript𝑡0𝑇1superscript𝛾2𝑡subscript𝔼𝜇subscriptsuperscriptnormsubscript𝐹𝑡subscript𝑥𝑡𝛽𝑐italic-ϕsubscript𝜌:0𝑡1𝜋superscript𝑄𝑡subscript𝑥𝑡2subscript𝑀𝑡𝜇n\mathrm{Var}{(\widetilde{V}(\beta,c))}=C_{T}+\sum\limits_{t=0}^{T-1}\gamma^{2% t}\text{$\mathbb{E}$}_{\mu}\left\|\vec{F}_{t}(x_{t};\beta,c,\phi)-\rho_{0:t-1}% \pi\vec{Q}^{t}(x_{t})\right\|^{2}_{M_{t,\mu}}italic_n roman_Var ( over~ start_ARG italic_V end_ARG ( italic_β , italic_c ) ) = italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ over→ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β , italic_c , italic_ϕ ) - italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT italic_π over→ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_t , italic_μ end_POSTSUBSCRIPT end_POSTSUBSCRIPT

where Mt,μ=diag(μ(a|xt)1)a𝒜JKM_{t,\mu}=\operatorname{diag}(\mu(a|x_{t})^{-1})_{a\in\mathcal{A}}-J_{K}italic_M start_POSTSUBSCRIPT italic_t , italic_μ end_POSTSUBSCRIPT = roman_diag ( italic_μ ( italic_a | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT - italic_J start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and

CT=t=0T1γ2t𝔼μ[Vart[ρ0:t1Vt(xt)]+Vart+1[ρ0:trt]],subscript𝐶𝑇superscriptsubscript𝑡0𝑇1superscript𝛾2𝑡subscript𝔼𝜇delimited-[]subscriptVar𝑡delimited-[]subscript𝜌:0𝑡1superscript𝑉𝑡subscript𝑥𝑡subscriptVar𝑡1delimited-[]subscript𝜌:0𝑡subscript𝑟𝑡C_{T}=\sum\limits_{t=0}^{T-1}\gamma^{2t}\text{$\mathbb{E}$}_{\mu}\bigl{[}% \mathrm{Var}_{t}\left[\rho_{0:t-1}V^{t}(x_{t})\right]+\mathrm{Var}_{t+1}[\rho_% {0:t}r_{t}]\bigr{]},italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ roman_Var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] + roman_Var start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ] ,

with VartsubscriptVar𝑡\mathrm{Var}_{t}roman_Var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the variance conditioned on the history up to time step t1𝑡1t-1italic_t - 1, {x0,a0,,xt1,at1}subscript𝑥0subscript𝑎0subscript𝑥𝑡1subscript𝑎𝑡1\{x_{0},a_{0},\dots,x_{t-1},a_{t-1}\}{ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }.

Similar to the case of CB, the proposition implies that minimizing the weighted sum of seminorms with respect to β𝛽\betaitalic_β and c𝑐citalic_c is equivalent to minimizing the asymptotic variance of V^DRsuperscript^𝑉DR\widehat{V}^{\text{DR}}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT. The solution β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG of the following estimating equation is consistent for βoptsubscript𝛽opt\beta_{\text{opt}}italic_β start_POSTSUBSCRIPT opt end_POSTSUBSCRIPT, minimizing the asymptotic variance of V^DRsuperscript^𝑉DR\widehat{V}^{\text{DR}}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT by σ2(β)=Varμ[η(β,c(β))]superscript𝜎2superscript𝛽subscriptVar𝜇delimited-[]𝜂superscript𝛽𝑐superscript𝛽\sigma^{2}(\beta^{*})=\mathrm{Var}_{\mu}{[\eta(\beta^{*},c(\beta^{*}))]}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_Var start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ italic_η ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ].

Sn(β,c)=i=1nt=0T1γ2tft(i)(β,c,ϕ^)Mt(i)(Ft(i)(β,c)ρ^0:t1diag(πt(i))a𝒜rt:T1(i))=0,S_{n}(\beta,c)=\sum\limits_{i=1}^{n}\sum\limits_{t=0}^{T-1}\gamma^{2t}f_{t}^{(% i)\top}(\beta,c,\hat{\phi})M_{t}^{(i)}\bigl{(}\vec{F}_{t}^{(i)}(\beta,c)-\hat{% \rho}_{0:t-1}\operatorname{diag}(\pi_{t}^{(i)})_{a\in\mathcal{A}}~{}\vec{r}_{t% :T-1}^{(i)}\bigr{)}=0,italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_β , italic_c ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) ⊤ end_POSTSUPERSCRIPT ( italic_β , italic_c , over^ start_ARG italic_ϕ end_ARG ) italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( over→ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β , italic_c ) - over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT roman_diag ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT over→ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t : italic_T - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) = 0 , (5)

with the pseudo-reward rt:T1(i)=(Δa,t(i)μ^(a|xt(i))R¯t:T1(i))a𝒜subscriptsuperscript𝑟𝑖:𝑡𝑇1subscriptsuperscriptsubscriptΔ𝑎𝑡𝑖^𝜇conditional𝑎superscriptsubscript𝑥𝑡𝑖superscriptsubscript¯𝑅:𝑡𝑇1𝑖𝑎𝒜\displaystyle{\vec{r}^{(i)}_{t:T-1}}=\bigl{(}\frac{\Delta_{a,t}^{(i)}}{\hat{% \mu}(a|x_{t}^{(i)})}\bar{R}_{t:T-1}^{(i)}\bigr{)}_{a\in\mathcal{A}}over→ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t : italic_T - 1 end_POSTSUBSCRIPT = ( divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_a , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_μ end_ARG ( italic_a | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_ARG over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t : italic_T - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT for R¯t:T1(i)=rt(i)+τ=t+1T1γτtρ^t+1:τ(i)rτ(i),superscriptsubscript¯𝑅:𝑡𝑇1𝑖superscriptsubscript𝑟𝑡𝑖superscriptsubscript𝜏𝑡1𝑇1superscript𝛾𝜏𝑡superscriptsubscript^𝜌:𝑡1𝜏𝑖superscriptsubscript𝑟𝜏𝑖\bar{R}_{t:T-1}^{(i)}=r_{t}^{(i)}+\displaystyle\sum\limits_{\tau=t+1}^{T-1}% \gamma^{\tau-t}\hat{\rho}_{t+1:\tau}^{(i)}r_{\tau}^{(i)},over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t : italic_T - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_τ = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_t + 1 : italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , and Mt(i)=diag(μ^(a|xt(i))1)a𝒜JKM_{t}^{(i)}=\displaystyle\operatorname{diag}(\hat{\mu}(a|x_{t}^{(i)})^{-1})_{a% \in\mathcal{A}}-J_{K}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = roman_diag ( over^ start_ARG italic_μ end_ARG ( italic_a | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT - italic_J start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT.

5 Theoretical Properties

We now present the theoretical properties of the proposed estimator, with a focus on its asymptotic distribution. Combining Proposition 1, 2, 4 and 5, the DRUnknown has the following asymptotic normality for both contextual bandits and reinforcement learning problems.

Theorem 6 (Asymptotic distribution).

When the logging policy model μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG is correctly specified, the proposed DRUnknown estimator is asymptotically normal, given by

n(V^DR(β^,ϕ^)Vπ)𝑑𝒩(0,σ2)𝑑𝑛superscript^𝑉DR^𝛽^italic-ϕsuperscript𝑉𝜋𝒩0superscript𝜎2\sqrt{n}(\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})-V^{\pi})\xrightarrow{% d}\mathcal{N}(0,\sigma^{2})square-root start_ARG italic_n end_ARG ( over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) - italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_ARROW overitalic_d → end_ARROW caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

for σ2=σ2(β).superscript𝜎2superscript𝜎2superscript𝛽\sigma^{2}=\sigma^{2}(\beta^{*}).italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) .

To calculate the confidence interval of Vπsuperscript𝑉𝜋V^{\pi}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT, one needs to estimate the unknown variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. As we know that

σ2=σ2(β,c(β),ϕ)=Varμ[V~(β,c(β))]=nVarμ[η(β,c(β))],superscript𝜎2superscript𝜎2superscript𝛽𝑐superscript𝛽italic-ϕsubscriptVar𝜇delimited-[]~𝑉superscript𝛽𝑐superscript𝛽𝑛subscriptVar𝜇delimited-[]𝜂superscript𝛽𝑐superscript𝛽\begin{split}\sigma^{2}&=\sigma^{2}(\beta^{*},c(\beta^{*}),\phi)\\ &=\mathrm{Var}_{\mu}{[\widetilde{V}(\beta^{*},c(\beta^{*}))]}=n\mathrm{Var}_{% \mu}[\eta(\beta^{*},c(\beta^{*}))],\end{split}start_ROW start_CELL italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_ϕ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_Var start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ over~ start_ARG italic_V end_ARG ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ] = italic_n roman_Var start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ italic_η ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ] , end_CELL end_ROW

we have σ^2(β,c(β),ϕ)=1ni=1n(η(i)(β^,c^,ϕ^)η¯)2superscript^𝜎2𝛽𝑐superscript𝛽italic-ϕ1𝑛superscriptsubscript𝑖1𝑛superscriptsuperscript𝜂𝑖^𝛽^𝑐^italic-ϕ¯𝜂2\hat{\sigma}^{2}(\beta,c(\beta^{*}),\phi)=\displaystyle\frac{1}{n}\sum\limits_% {i=1}^{n}(\eta^{(i)}(\hat{\beta},\hat{c},\hat{\phi})-\bar{\eta})^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_β , italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_ϕ ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_c end_ARG , over^ start_ARG italic_ϕ end_ARG ) - over¯ start_ARG italic_η end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is a consistent estimator for σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, with η¯=1ni=1nη(i)(β^,c^,ϕ^)¯𝜂1𝑛superscriptsubscript𝑖1𝑛superscript𝜂𝑖^𝛽^𝑐^italic-ϕ\bar{\eta}=\displaystyle\frac{1}{n}\sum\limits_{i=1}^{n}\eta^{(i)}(\hat{\beta}% ,\hat{c},\hat{\phi})over¯ start_ARG italic_η end_ARG = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_c end_ARG , over^ start_ARG italic_ϕ end_ARG ) the average of η(i)superscript𝜂𝑖\eta^{(i)}italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and β^,c^^𝛽^𝑐\hat{\beta},\hat{c}over^ start_ARG italic_β end_ARG , over^ start_ARG italic_c end_ARG are the solutions of (5) and the MLE ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG. By the Slutzky’s theorem, we have

σ^2=σ^2(β^,c^,ϕ^)𝑝σ2,superscript^𝜎2superscript^𝜎2^𝛽^𝑐^italic-ϕ𝑝superscript𝜎2\hat{\sigma}^{2}=\hat{\sigma}^{2}(\hat{\beta},\hat{c},\hat{\phi})\xrightarrow{% p}\sigma^{2},over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_c end_ARG , over^ start_ARG italic_ϕ end_ARG ) start_ARROW overitalic_p → end_ARROW italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

and combining the result with the theorem we have the (1α)1𝛼(1-\alpha)( 1 - italic_α ) confidence interval as

[V^DR(β^,ϕ^)zα/2σ^2n,V^DR(β^,ϕ^)+zα/2σ^2n],superscript^𝑉DR^𝛽^italic-ϕsubscript𝑧𝛼2superscript^𝜎2𝑛superscript^𝑉DR^𝛽^italic-ϕsubscript𝑧𝛼2superscript^𝜎2𝑛\bigl{[}\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})-z_{\alpha/2}\frac{\hat% {\sigma}^{2}}{\sqrt{n}},\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})+z_{% \alpha/2}\frac{\hat{\sigma}^{2}}{\sqrt{n}}\bigr{]},[ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) - italic_z start_POSTSUBSCRIPT italic_α / 2 end_POSTSUBSCRIPT divide start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG , over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) + italic_z start_POSTSUBSCRIPT italic_α / 2 end_POSTSUBSCRIPT divide start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ] ,

with zα/2subscript𝑧𝛼2z_{\alpha/2}italic_z start_POSTSUBSCRIPT italic_α / 2 end_POSTSUBSCRIPT the α/2𝛼2\alpha/2italic_α / 2 standard Gaussian quantile.

When the value function model Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG is also correctly specified, the proposed DRUnknown is asymptotically equivalent to the DR OPE estimator using the true logging policy and the value function, and is locally efficient.

Proposition 7 (Local Efficiency).

When both the logging policy model μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG and the value function model Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG are correctly specified, The asymptotic variance of the proposed estimator achieves the semiparametric lower bound and is asymptotically optimal.

Proposition 7 states that DRUnknown can be asymptotically optimal when the model Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG is properly chosen. Also, for any choice of value function model, whether it is correct or not, the proposed estimator is asymptotically more efficient compared to existing OPE algorithms, including IPW, DR, and MRDR, in the OPE problem with an unknown logging policy.

Proposition 8 (Intrinsic Efficiency).

The proposed DR OPE estimator has the smallest asymptotic variance among the class of DR OPE using MLE for ϕitalic-ϕ\phiitalic_ϕ, when the logging policy model μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG is correctly specified. The same holds for arbitrary consistent estimator of ϕitalic-ϕ\phiitalic_ϕ.

Corollary 9 (Comparison to Existing Algorithms).

The proposed estimator has at least smaller asymptotic variance than IPW, DR and MRDR utilizing the same estimator for ϕitalic-ϕ\phiitalic_ϕ, when the logging policy model μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG is correctly specified.

6 Experimental Results

In this section, we compare the performance of the four estimators: (i) IPW (Horvitz and Thompson, 1952) (ii) MLIPW (Xie et al., 2019) (iii) MRDR (Farajtabar et al., 2018) and (iv) the proposed DRUnknown on CB and RL problems. IPW requires knowledge of the true logging policy μ𝜇\muitalic_μ and cannot be applied in our experimental scenario. Therefore it serves as a baseline for the other three estimators, and we compare the relative MSE of each estimator.

6.1 Contextual Bandits

6.1.1 Simulation Data

We use the simulation environments described as follows. We generate N=100𝑁100N=100italic_N = 100 test datasets with size n𝑛nitalic_n, given by Dj={(xij,aij,rij)}i=1n,j[N]formulae-sequencesubscript𝐷𝑗superscriptsubscriptsuperscriptsubscript𝑥𝑖𝑗superscriptsubscript𝑎𝑖𝑗superscriptsubscript𝑟𝑖𝑗𝑖1𝑛𝑗delimited-[]𝑁D_{j}=\{(x_{i}^{j},a_{i}^{j},r_{i}^{j})\}_{i=1}^{n},~{}j\in[N]italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_j ∈ [ italic_N ]. The context xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT for ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT sample from jthsuperscript𝑗thj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT dataset contains d=5𝑑5d=5italic_d = 5 dimensional K=10𝐾10K=10italic_K = 10 vectors randomly sampled from uniform distribution U(1/d,1/d)𝑈1𝑑1𝑑U(-1/\sqrt{d},1/\sqrt{d})italic_U ( - 1 / square-root start_ARG italic_d end_ARG , 1 / square-root start_ARG italic_d end_ARG ). The rewards are generated from the Gaussian distribution with mean exp(xijβ)superscriptsubscript𝑥𝑖limit-from𝑗top𝛽\exp(x_{i}^{j\top}\beta)roman_exp ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j ⊤ end_POSTSUPERSCRIPT italic_β ) and variance 1, where β𝛽\betaitalic_β is a fixed coefficient also sampled from the uniform distribution U(1/d,1/d)𝑈1𝑑1𝑑U(-1/\sqrt{d},1/\sqrt{d})italic_U ( - 1 / square-root start_ARG italic_d end_ARG , 1 / square-root start_ARG italic_d end_ARG ).

The logging policy μ𝜇\muitalic_μ and target policy π𝜋\piitalic_π follows the linear logistic model with random coefficients ϕμsubscriptitalic-ϕ𝜇\phi_{\mu}italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT and ϕπsubscriptitalic-ϕ𝜋\phi_{\pi}italic_ϕ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT as μ(a|x)=exp(xaϕμ)i=1Kexp(xiϕμ)𝜇conditional𝑎𝑥superscriptsubscript𝑥𝑎topsubscriptitalic-ϕ𝜇superscriptsubscript𝑖1𝐾superscriptsubscript𝑥𝑖topsubscriptitalic-ϕ𝜇\mu(a|x)=\displaystyle\frac{\exp(x_{a}^{\top}\phi_{\mu})}{\sum\limits_{i=1}^{K% }\exp(x_{i}^{\top}\phi_{\mu})}italic_μ ( italic_a | italic_x ) = divide start_ARG roman_exp ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ) end_ARG and π(a|x)=exp(xaϕπ)i=1Kexp(xiϕπ).𝜋conditional𝑎𝑥superscriptsubscript𝑥𝑎topsubscriptitalic-ϕ𝜋superscriptsubscript𝑖1𝐾superscriptsubscript𝑥𝑖topsubscriptitalic-ϕ𝜋\pi(a|x)=\displaystyle\frac{\exp(x_{a}^{\top}\phi_{\pi})}{\sum\limits_{i=1}^{K% }\exp(x_{i}^{\top}\phi_{\pi})}.italic_π ( italic_a | italic_x ) = divide start_ARG roman_exp ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) end_ARG . The logging policy model μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG also follows the linear logistic model, while the value function model Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG is defined by the linear regression model. Consequently, only the logging policy model is correctly specified.

The Table 1 and Figure 1 reports the relative MSE values of MLIPW, MRDR and the proposed DRUnknown, calculated by dividing their MSE by the MSE of IPW, for the different sizes n𝑛nitalic_n of each dataset. The proposed estimator achieves the lowest relative MSE. Figure 2 displays boxplots representing the estimated values of four methods, each computed with N=100𝑁100N=100italic_N = 100 repeats and a dataset size of n=10000𝑛10000n=10000italic_n = 10000.

sample size MLIPW MRDR DRUnknown
5000500050005000 0.8911 0.8126 0.8038
6000600060006000 0.8416 0.7957 0.7731
7000700070007000 0.8366 0.8043 0.7620
8000800080008000 0.8234 0.8073 0.7560
9000900090009000 0.8222 0.8200 0.7662
10000100001000010000 0.7575 0.7369 0.6792
Table 1: The relative MSE of the estimators on synthetic dataset with respect to the n𝑛nitalic_n, the number of samples.
Refer to caption
Figure 1: The relative MSE of the estimators on synthetic dataset with respect to the n𝑛nitalic_n, the number of samples.

6.1.2 UCI Machine Learning Repository Dataset

For the real data experiment, we transform six classification datasets of the UCI Machine Learning Repository into contextual bandit problems : glass, letter, zooimage, iris and handwritten (German, 1987; Slate, 1991; Forsyth, 1990; mis, 1990; Fisher, 1988; Alpaydin and Kaynak, 1998). Assigning the data to each class is considered as pulling an arm in the bandit. When the class is correct the reward is 1 and otherwise 0.

We construct the logging policy μ𝜇\muitalic_μ as follows: we train the policy μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using a linear logistic model on a separate dataset, and mix the policy with a random policy μrandomsubscript𝜇random\mu_{\text{random}}italic_μ start_POSTSUBSCRIPT random end_POSTSUBSCRIPT as μ=αμ0+(1α)μrandom𝜇𝛼subscript𝜇01𝛼subscript𝜇random\mu=\alpha\mu_{0}+(1-\alpha)\mu_{\text{random}}italic_μ = italic_α italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_μ start_POSTSUBSCRIPT random end_POSTSUBSCRIPT, for α(0,1)𝛼01\alpha\in(0,1)italic_α ∈ ( 0 , 1 ). The target policy π𝜋\piitalic_π is constructed as in the simulation data experiment. The logging policy class is defined as {μ^=αμ0+(1α)μrandom:α(0,1)}conditional-set^𝜇𝛼subscript𝜇01𝛼subscript𝜇random𝛼01\{\hat{\mu}=\alpha\mu_{0}+(1-\alpha)\mu_{\text{random}}:\alpha\in(0,1)\}{ over^ start_ARG italic_μ end_ARG = italic_α italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_μ start_POSTSUBSCRIPT random end_POSTSUBSCRIPT : italic_α ∈ ( 0 , 1 ) }, where α𝛼\alphaitalic_α serves as the parameter ϕitalic-ϕ\phiitalic_ϕ, and we use the constant value function model for Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG.

We generate N=100𝑁100N=100italic_N = 100 datasets, each with a size of n=10000𝑛10000n=10000italic_n = 10000, by randomly sampling contexts from the original dataset and selecting arms using the logging policy μ𝜇\muitalic_μ. Table 2 displays the relative MSE values of MLIPW, MRDR, and our proposed DRUnknown. Additionally, Figure 3 depicts the log-relative MSE of each estimator on the glass dataset as a function of the logging policy parameter α𝛼\alphaitalic_α. The DRUnknown consistently demonstrates the smallest relative MSE across all six datasets.

Refer to caption
Figure 2: Boxplot of estimated values from four estimators on simulation data with N=100𝑁100N=100italic_N = 100 and n=10000𝑛10000n=10000italic_n = 10000. The true target policy value Vπsuperscript𝑉𝜋V^{\pi}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is indicated by the blue dashed line.
dataset MLIPW MRDR DRUnknown
glass 0.9548 0.8308 0.8125
letter 0.9677 0.9323 0.9265
zoo 0.9987 0.9842 0.9583
image 0.8250 0.8245 0.8062
iris 0.6116 0.5715 0.5454
handwritten 0.9521 0.9432 0.9407
Table 2: The relative MSE of the estimators on UCI datasets, with α=0.4,n=10000formulae-sequence𝛼0.4𝑛10000\alpha=0.4,n=10000italic_α = 0.4 , italic_n = 10000.
Refer to caption
Figure 3: The logarithm of the relative MSE from three estimators on the glass dataset as a function of α𝛼\alphaitalic_α.

6.2 Reinforcement Learning

In this section, we provide the experimental results of OPE in the context of RL. We conducted the experiments on the ModelWin and ModelFail domains introduced in Thomas and Brunskill (2016). Detailed descriptions of these environments are available in the Appendix. As done in Section 6.1.2, we utilize a mixture of the separately trained optimal policy and the uniform random policy with a rate α(0,1)𝛼01\alpha\in(0,1)italic_α ∈ ( 0 , 1 ) as the logging policy μ𝜇\muitalic_μ. We employ a linear logistic policy for the target policy π𝜋\piitalic_π and a linear model as the value function model Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG.

Table 3 displays the relative MSE of each estimator in ModelWin with T=20𝑇20T=20italic_T = 20 and ModelFail environments. The proposed DRUnknown attains the lowest relative MSE in most scenarios, if the number of sample n𝑛nitalic_n is large enough. Figure 4 illustrates the cumulative distribution function (CDF) of squared error values (larger is better) on the ModelWin domain with T=20𝑇20T=20italic_T = 20 and n=20𝑛20n=20italic_n = 20. The DRUnknown exhibits the highest CDF values, indicating that the estimated values are more concentrated around the true value with high probability.

6.3 Conclusion

In conclusion, our study has introduced DRUnknown, a novel doubly-robust off-policy evaluation estimator for contextual bandits and reinforcement learning problems where both the logging policy and the value function remain unknown. Through a two-step estimation process, DRUnknown first estimates the logging policy using maximum likelihood estimator and subsequently estimates the value function model, aiming to minimize the asymptotic variance of the estimator while considering the impact of the logging policy estimation.

When the logging policy model is correctly specified, DRUnknown achieves the smallest asymptotic variance within the class that encompasses existing OPE estimators. Furthermore, if the value function model is also correctly specified, DRUnknown attains optimality by reaching the semiparametric lower bound for asymptotic variance.

To show the effectiveness of DRUnknown, we conducted experiments in both contextual bandits and reinforcement learning settings. The empirical results demonstrate the superior performance of DRUnknown compared to existing methods.

7 Broader Impact

This paper addresses the off-policy evaluation problem, aiming to contribute to the field of Machine Learning. It includes theoretical contributions and numerical experimental results, with no apparent societal or ethical issues that require further discussion.

ModelWin MLIPW MRDR DRUnknown
n=10𝑛10n=10italic_n = 10 1.1604 1.9963 0.8916
n=20𝑛20n=20italic_n = 20 1.0187 2.5577 0.8273
n=30𝑛30n=30italic_n = 30 1.1064 1.7349 0.8529
n=40𝑛40n=40italic_n = 40 1.0676 1.5270 0.9117
ModelFail MLIPW MRDR DRUnknown
n=10𝑛10n=10italic_n = 10 0.7794 0.7963 0.8476
n=20𝑛20n=20italic_n = 20 0.5608 0.5591 0.2951
n=30𝑛30n=30italic_n = 30 0.5532 0.5605 0.3154
n=40𝑛40n=40italic_n = 40 0.5273 0.5285 0.1451
Table 3: The realtive MSE of the estimators on ModelWin and ModelFail environments.
Refer to caption
Figure 4: The CDF of squared errors for 4 estimators in ModelWin.

References

  • Dudík et al. (2011) Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104, 2011.
  • Swaminathan et al. (2017) Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. Advances in Neural Information Processing Systems, 30, 2017.
  • Precup (2000) Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
  • Mahmood et al. (2014) A Rupam Mahmood, Hado P Van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation. Advances in neural information processing systems, 27, 2014.
  • Jiang and Li (2016) Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661. PMLR, 2016.
  • Thomas and Brunskill (2016) Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148. PMLR, 2016.
  • Wang et al. (2017) Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597. PMLR, 2017.
  • Farajtabar et al. (2018) Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447–1456. PMLR, 2018.
  • Su et al. (2020) Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning, pages 9167–9176. PMLR, 2020.
  • Cao et al. (2009) Weihua Cao, Anastasios A Tsiatis, and Marie Davidian. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika, 96(3):723–734, 2009.
  • Rubin and van der Laan (2008) Daniel B Rubin and Mark J van der Laan. Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis. The International Journal of Biostatistics, 4(1), 2008.
  • Li et al. (2015) Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation. In Artificial Intelligence and Statistics, pages 608–616. PMLR, 2015.
  • Raghu et al. (2018) Aniruddh Raghu, Omer Gottesman, Yao Liu, Matthieu Komorowski, Aldo Faisal, Finale Doshi-Velez, and Emma Brunskill. Behaviour policy estimation in off-policy policy evaluation: Calibration matters. arXiv preprint arXiv:1807.01066, 2018.
  • Xie et al. (2019) Yuan Xie, Boyi Liu, Qiang Liu, Zhaoran Wang, Yuan Zhou, and Jian Peng. Off-policy evaluation and learning from logged bandit feedback: Error reduction via surrogate policy. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=HklKui0ct7.
  • Hanna et al. (2021) Josiah P Hanna, Scott Niekum, and Peter Stone. Importance sampling in reinforcement learning with an estimated behavior policy. Machine Learning, 110(6):1267–1317, 2021.
  • Rothe (2016) Christoph Rothe. The value of knowing the propensity score for estimating average treatment effects. Available at SSRN 2797560, 2016.
  • Horvitz and Thompson (1952) Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
  • Cassel et al. (1976) Claes M Cassel, Carl E Särndal, and Jan H Wretman. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3):615–620, 1976.
  • Robins and Rotnitzky (1995) James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
  • German (1987) B. German. Glass Identification. UCI Machine Learning Repository, 1987. DOI: https://doi.org/10.24432/C5WW2P.
  • Slate (1991) David Slate. Letter Recognition. UCI Machine Learning Repository, 1991. DOI: https://doi.org/10.24432/C5ZP40.
  • Forsyth (1990) Richard Forsyth. Zoo. UCI Machine Learning Repository, 1990. DOI: https://doi.org/10.24432/C5R59V.
  • mis (1990) Image Segmentation. UCI Machine Learning Repository, 1990. DOI: https://doi.org/10.24432/C5GP4N.
  • Fisher (1988) R. A. Fisher. Iris. UCI Machine Learning Repository, 1988. DOI: https://doi.org/10.24432/C56C76.
  • Alpaydin and Kaynak (1998) E. Alpaydin and C. Kaynak. Optical Recognition of Handwritten Digits. UCI Machine Learning Repository, 1998. DOI: https://doi.org/10.24432/C50P49.
  • Kallus and Uehara (2020) Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient and robust off-policy evaluation. In International Conference on Machine Learning, pages 5078–5088. PMLR, 2020.

Appendix A Missing Proofs

A.1 Proof of Proposition 1

Proof.

By the Taylor expansion,

V^DR(β^,ϕ^)=V(β)+𝔼μ[V^β](β^β)+𝔼μ[V^ϕ](ϕ^ϕ)+op(n1/2),superscript^𝑉DR^𝛽^italic-ϕ𝑉superscript𝛽subscript𝔼𝜇superscriptdelimited-[]^𝑉𝛽top^𝛽superscript𝛽subscript𝔼𝜇superscriptdelimited-[]^𝑉italic-ϕtop^italic-ϕitalic-ϕsubscript𝑜𝑝superscript𝑛12\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})=V(\beta^{*})+\text{$\mathbb{E}% $}_{\mu}{[\frac{\partial\widehat{V}}{\partial\beta}]^{\top}}(\hat{\beta}-\beta% ^{*})+\text{$\mathbb{E}$}_{\mu}{[\frac{\partial\widehat{V}}{\partial\phi}]^{% \top}}(\hat{\phi}-\phi)+o_{p}(n^{-1/2}),over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) = italic_V ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG ∂ over^ start_ARG italic_V end_ARG end_ARG start_ARG ∂ italic_β end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG - italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG ∂ over^ start_ARG italic_V end_ARG end_ARG start_ARG ∂ italic_ϕ end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over^ start_ARG italic_ϕ end_ARG - italic_ϕ ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) ,

where

V(β)=1ni=1n[π(i)μ(i)(r(i)Q^(i)(β))+V^(i)(β)]𝑉𝛽1𝑛superscriptsubscript𝑖1𝑛delimited-[]superscript𝜋𝑖superscript𝜇𝑖superscript𝑟𝑖superscript^𝑄𝑖𝛽superscript^𝑉𝑖𝛽V(\beta)=\frac{1}{n}\sum\limits_{i=1}^{n}\Bigl{[}\frac{\pi^{(i)}}{\mu^{(i)}}(r% ^{(i)}-\widehat{Q}^{(i)}(\beta))+\widehat{V}^{(i)}(\beta)\Bigr{]}italic_V ( italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT [ divide start_ARG italic_π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG ( italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β ) ) + over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β ) ]

and β^𝑝β𝑝^𝛽superscript𝛽\hat{\beta}\xrightarrow{p}\beta^{*}over^ start_ARG italic_β end_ARG start_ARROW overitalic_p → end_ARROW italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for some βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. As μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG is correctly specified, we have

𝔼μ[V^β]=β𝔼μ[π(a|x)μ(a|x)Q^(x,a;β)+V^(x;β)]=0.subscript𝔼𝜇delimited-[]^𝑉𝛽𝛽subscript𝔼𝜇delimited-[]𝜋conditional𝑎𝑥𝜇conditional𝑎𝑥^𝑄𝑥𝑎superscript𝛽^𝑉𝑥superscript𝛽0\text{$\mathbb{E}$}_{\mu}{[\frac{\partial\widehat{V}}{\partial\beta}]}=\frac{% \partial}{\partial\beta}\text{$\mathbb{E}$}_{\mu}{[-\frac{\pi(a|x)}{\mu(a|x)}% \widehat{Q}(x,a;\beta^{*})+\widehat{V}(x;\beta^{*})]}=0.blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG ∂ over^ start_ARG italic_V end_ARG end_ARG start_ARG ∂ italic_β end_ARG ] = divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ - divide start_ARG italic_π ( italic_a | italic_x ) end_ARG start_ARG italic_μ ( italic_a | italic_x ) end_ARG over^ start_ARG italic_Q end_ARG ( italic_x , italic_a ; italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + over^ start_ARG italic_V end_ARG ( italic_x ; italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] = 0 .

For the following term,

𝔼μ[V^ϕ]=𝔼μ[πμ^˙μ2(rQ^(x,a;β)]=𝔼π[μ^˙μ(rQ^(x,a;β)]:=Γ(β).-\text{$\mathbb{E}$}_{\mu}{[\frac{\partial\widehat{V}}{\partial\phi}]}=-\text{% $\mathbb{E}$}_{\mu}{[-\frac{\pi\dot{\hat{\mu}}}{\mu^{2}}(r-\widehat{Q}(x,a;% \beta^{*})]}=\text{$\mathbb{E}$}_{\pi}{[-\frac{\dot{\hat{\mu}}}{\mu}(r-% \widehat{Q}(x,a;\beta^{*})]}:=\Gamma(\beta).- blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG ∂ over^ start_ARG italic_V end_ARG end_ARG start_ARG ∂ italic_ϕ end_ARG ] = - blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ - divide start_ARG italic_π over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG end_ARG start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_r - over^ start_ARG italic_Q end_ARG ( italic_x , italic_a ; italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ - divide start_ARG over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG end_ARG start_ARG italic_μ end_ARG ( italic_r - over^ start_ARG italic_Q end_ARG ( italic_x , italic_a ; italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] := roman_Γ ( italic_β ) .

For ϕ^ϕ^italic-ϕitalic-ϕ\hat{\phi}-\phiover^ start_ARG italic_ϕ end_ARG - italic_ϕ, from Un(ϕ^)=0subscript𝑈𝑛^italic-ϕ0U_{n}(\hat{\phi})=0italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( over^ start_ARG italic_ϕ end_ARG ) = 0 we have 0=U(ϕ)+𝔼[U˙(ϕ)](ϕ^ϕ)+op(n1/2),0𝑈italic-ϕ𝔼delimited-[]˙𝑈italic-ϕ^italic-ϕitalic-ϕsubscript𝑜𝑝superscript𝑛120=U(\phi)+\text{$\mathbb{E}$}{[\dot{U}(\phi)]}(\hat{\phi}-\phi)+o_{p}(n^{-1/2}),0 = italic_U ( italic_ϕ ) + blackboard_E [ over˙ start_ARG italic_U end_ARG ( italic_ϕ ) ] ( over^ start_ARG italic_ϕ end_ARG - italic_ϕ ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) , where

𝔼[U˙(ϕ)]=a𝒜𝔼μ[1μ(a|x)2μ^˙(a|x;ϕ)μ^˙(a|x;ϕ)]:=Σϕϕ𝔼delimited-[]˙𝑈italic-ϕsubscript𝑎𝒜subscript𝔼𝜇delimited-[]1𝜇superscriptconditional𝑎𝑥2˙^𝜇conditional𝑎𝑥italic-ϕ˙^𝜇superscriptconditional𝑎𝑥italic-ϕtopassignsubscriptΣitalic-ϕitalic-ϕ\text{$\mathbb{E}$}{[\dot{U}(\phi)]}=\displaystyle\sum\limits_{a\in\mathcal{A}% }\text{$\mathbb{E}$}_{\mu}{[\frac{1}{\mu(a|x)^{2}}\dot{\hat{\mu}}(a|x;\phi)% \dot{\hat{\mu}}(a|x;\phi)^{\top}]}:=\Sigma_{\phi\phi}blackboard_E [ over˙ start_ARG italic_U end_ARG ( italic_ϕ ) ] = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_μ ( italic_a | italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG ( italic_a | italic_x ; italic_ϕ ) over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG ( italic_a | italic_x ; italic_ϕ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] := roman_Σ start_POSTSUBSCRIPT italic_ϕ italic_ϕ end_POSTSUBSCRIPT

and

ϕ^ϕ=Σϕϕ1U(ϕ)+op(n1/2).^italic-ϕitalic-ϕsuperscriptsubscriptΣitalic-ϕitalic-ϕ1𝑈italic-ϕsubscript𝑜𝑝superscript𝑛12\hat{\phi}-\phi=-\Sigma_{\phi\phi}^{-1}U(\phi)+o_{p}(n^{-1/2}).over^ start_ARG italic_ϕ end_ARG - italic_ϕ = - roman_Σ start_POSTSUBSCRIPT italic_ϕ italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_U ( italic_ϕ ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) .

Plugging in the results above and denoting c(β)=Σϕϕ1Γ(β)𝑐𝛽superscriptsubscriptΣitalic-ϕitalic-ϕ1Γ𝛽c(\beta)=\Sigma_{\phi\phi}^{-1}\Gamma(\beta)italic_c ( italic_β ) = roman_Σ start_POSTSUBSCRIPT italic_ϕ italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Γ ( italic_β ), we have

V^DR(β^,ϕ^)=V(β)+Γ(β)Σϕϕ1U(ϕ)+op(n1/2)=V(β)+c(β)U(ϕ)+op(n1/2)=1ni=1nη(i)(β,c(β))+op(n1/2),superscript^𝑉DR^𝛽^italic-ϕ𝑉superscript𝛽Γsuperscriptsuperscript𝛽topsuperscriptsubscriptΣitalic-ϕitalic-ϕ1𝑈italic-ϕsubscript𝑜𝑝superscript𝑛12𝑉superscript𝛽𝑐superscriptsuperscript𝛽top𝑈italic-ϕsubscript𝑜𝑝superscript𝑛121𝑛superscriptsubscript𝑖1𝑛superscript𝜂𝑖superscript𝛽𝑐superscript𝛽subscript𝑜𝑝superscript𝑛12\begin{split}\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})&=V(\beta^{*})+% \Gamma(\beta^{*})^{\top}\Sigma_{\phi\phi}^{-1}U(\phi)+o_{p}(n^{-1/2})\\ &=V(\beta^{*})+c(\beta^{*})^{\top}U(\phi)+o_{p}(n^{-1/2})\\ &=\frac{1}{n}\sum\limits_{i=1}^{n}\eta^{(i)}(\beta^{*},c(\beta^{*}))+o_{p}(n^{% -1/2}),\end{split}start_ROW start_CELL over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) end_CELL start_CELL = italic_V ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + roman_Γ ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_ϕ italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_U ( italic_ϕ ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_V ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_U ( italic_ϕ ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) , end_CELL end_ROW

for η(i)(β,c)=1μ(a(i)|x(i))[π(a(i)|x(i))r(i)F(x,a;β,c,ϕ)]+a𝒜F(x(i),a;β,c,ϕ)superscript𝜂𝑖𝛽𝑐1𝜇conditionalsuperscript𝑎𝑖superscript𝑥𝑖delimited-[]𝜋conditionalsuperscript𝑎𝑖superscript𝑥𝑖superscript𝑟𝑖𝐹𝑥𝑎𝛽𝑐italic-ϕsubscript𝑎𝒜𝐹superscript𝑥𝑖𝑎𝛽𝑐italic-ϕ\eta^{(i)}(\beta,c)=\displaystyle\frac{1}{\mu(a^{(i)}|x^{(i)})}\bigl{[}\pi(a^{% (i)}|x^{(i)})r^{(i)}-F(x,a;\beta,c,\phi)\bigr{]}+\sum\limits_{a\in\mathcal{A}}% F(x^{(i)},a;\beta,c,\phi)italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β , italic_c ) = divide start_ARG 1 end_ARG start_ARG italic_μ ( italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_ARG [ italic_π ( italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_F ( italic_x , italic_a ; italic_β , italic_c , italic_ϕ ) ] + ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_F ( italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_a ; italic_β , italic_c , italic_ϕ ), since a𝒜μ^˙(a|x;ϕ)=0.subscript𝑎𝒜˙^𝜇conditional𝑎𝑥italic-ϕ0\sum\limits_{a\in\mathcal{A}}\dot{\hat{\mu}}(a|x;\phi)=0.∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG ( italic_a | italic_x ; italic_ϕ ) = 0 .

A.2 Proof of Proposition 2

Proof.

By the law of total variance, we decompose the variance of V~(β,c)~𝑉𝛽𝑐\widetilde{V}(\beta,c)over~ start_ARG italic_V end_ARG ( italic_β , italic_c ) with n=1𝑛1n=1italic_n = 1, given contexts x𝑥xitalic_x and rewards r𝑟ritalic_r:

Var(V~(β,c))=Var𝔼(V~(β,c))|x,r)+𝔼Var(V~(β,c))|x,r).\mathrm{Var}{(\widetilde{V}(\beta,c))}=\mathrm{Var}\text{$\mathbb{E}$}{(% \widetilde{V}(\beta,c))|x,r)}+\text{$\mathbb{E}$}\mathrm{Var}{(\widetilde{V}(% \beta,c))|x,r)}.roman_Var ( over~ start_ARG italic_V end_ARG ( italic_β , italic_c ) ) = roman_Var blackboard_E ( over~ start_ARG italic_V end_ARG ( italic_β , italic_c ) ) | italic_x , italic_r ) + blackboard_E roman_Var ( over~ start_ARG italic_V end_ARG ( italic_β , italic_c ) ) | italic_x , italic_r ) .

For the first term, we have 𝔼μ(V~(β,c))|x,r)=Vπ(x),\text{$\mathbb{E}$}_{\mu}{(\widetilde{V}(\beta,c))|x,r)}=V^{\pi}(x),blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( over~ start_ARG italic_V end_ARG ( italic_β , italic_c ) ) | italic_x , italic_r ) = italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x ) , and for the second term,

Var(V~(β,c))|x,r)=Varμ[1μ(a|x)(π(a|x)rF(x,a;β,c,ϕ))|x,r]=𝔼μ[1μ(a|x)2(π(a|x)rF(x,a;β,c,ϕ))2|x,r]𝔼μ[1μ(a|x)(π(a|x)rF(x,a;β,c,ϕ))|x,r]2=a𝒜1μ(a|x)(π(a|x)rF(x,a;β,c,ϕ))2[a𝒜(π(a|x)rF(x,a;β,c,ϕ))]2=F(x;β,c,ϕ)πrMμ2\begin{split}&\mathrm{Var}{(\widetilde{V}(\beta,c))|x,r)}=\mathrm{Var}_{\mu}% \bigl{[}\frac{1}{\mu(a|x)}(\pi(a|x)r-F(x,a;\beta,c,\phi))\bigr{|}x,r]\\ &=\text{$\mathbb{E}$}_{\mu}\bigl{[}\frac{1}{\mu(a|x)^{2}}(\pi(a|x)r-F(x,a;% \beta,c,\phi))^{2}\bigr{|}x,r]-\text{$\mathbb{E}$}_{\mu}\bigl{[}\frac{1}{\mu(a% |x)}(\pi(a|x)r-F(x,a;\beta,c,\phi))\bigr{|}x,r]^{2}\\ &=\sum\limits_{a\in\mathcal{A}}\frac{1}{\mu(a|x)}(\pi(a|x)r-F(x,a;\beta,c,\phi% ))^{2}-[\sum\limits_{a\in\mathcal{A}}(\pi(a|x)r-F(x,a;\beta,c,\phi))]^{2}\\ &=\displaystyle\left\|\vec{F}(x;\beta,c,\phi)-\pi\vec{r}\right\|_{M_{\mu}}^{2}% \end{split}start_ROW start_CELL end_CELL start_CELL roman_Var ( over~ start_ARG italic_V end_ARG ( italic_β , italic_c ) ) | italic_x , italic_r ) = roman_Var start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_μ ( italic_a | italic_x ) end_ARG ( italic_π ( italic_a | italic_x ) italic_r - italic_F ( italic_x , italic_a ; italic_β , italic_c , italic_ϕ ) ) | italic_x , italic_r ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_μ ( italic_a | italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_π ( italic_a | italic_x ) italic_r - italic_F ( italic_x , italic_a ; italic_β , italic_c , italic_ϕ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_x , italic_r ] - blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_μ ( italic_a | italic_x ) end_ARG ( italic_π ( italic_a | italic_x ) italic_r - italic_F ( italic_x , italic_a ; italic_β , italic_c , italic_ϕ ) ) | italic_x , italic_r ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_μ ( italic_a | italic_x ) end_ARG ( italic_π ( italic_a | italic_x ) italic_r - italic_F ( italic_x , italic_a ; italic_β , italic_c , italic_ϕ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - [ ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ( italic_π ( italic_a | italic_x ) italic_r - italic_F ( italic_x , italic_a ; italic_β , italic_c , italic_ϕ ) ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∥ over→ start_ARG italic_F end_ARG ( italic_x ; italic_β , italic_c , italic_ϕ ) - italic_π over→ start_ARG italic_r end_ARG ∥ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW

and since 𝔼[πQ(x)πr]=0𝔼delimited-[]𝜋𝑄𝑥𝜋𝑟0\text{$\mathbb{E}$}{[\pi\vec{Q}(x)-\pi\vec{r}]}=0blackboard_E [ italic_π over→ start_ARG italic_Q end_ARG ( italic_x ) - italic_π over→ start_ARG italic_r end_ARG ] = 0, we have

nVar(V~(β,c))=Var(Vπ(x))+𝔼μF(x;β,c,ϕ)πrMμ2=Var(Vπ(x))+𝔼μF(x;β,c,ϕ)πQ(x)Mμ2+𝔼μπQ(x)πrMμ2.𝑛Var~𝑉𝛽𝑐Varsuperscript𝑉𝜋𝑥subscript𝔼𝜇superscriptsubscriptdelimited-∥∥𝐹𝑥𝛽𝑐italic-ϕ𝜋𝑟subscript𝑀𝜇2Varsuperscript𝑉𝜋𝑥subscript𝔼𝜇superscriptsubscriptdelimited-∥∥𝐹𝑥𝛽𝑐italic-ϕ𝜋𝑄𝑥subscript𝑀𝜇2subscript𝔼𝜇superscriptsubscriptdelimited-∥∥𝜋𝑄𝑥𝜋𝑟subscript𝑀𝜇2\begin{split}n\mathrm{Var}{(\widetilde{V}(\beta,c))}&=\mathrm{Var}(V^{\pi}(x))% +\text{$\mathbb{E}$}_{\mu}\displaystyle\left\|\vec{F}(x;\beta,c,\phi)-\pi\vec{% r}\right\|_{M_{\mu}}^{2}\\ &=\mathrm{Var}(V^{\pi}(x))+\text{$\mathbb{E}$}_{\mu}\displaystyle\left\|\vec{F% }(x;\beta,c,\phi)-\pi\vec{Q}(x)\right\|_{M_{\mu}}^{2}+\text{$\mathbb{E}$}_{\mu% }\displaystyle\left\|\pi\vec{Q}(x)-\pi\vec{r}\right\|_{M_{\mu}}^{2}.\end{split}start_ROW start_CELL italic_n roman_Var ( over~ start_ARG italic_V end_ARG ( italic_β , italic_c ) ) end_CELL start_CELL = roman_Var ( italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x ) ) + blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ over→ start_ARG italic_F end_ARG ( italic_x ; italic_β , italic_c , italic_ϕ ) - italic_π over→ start_ARG italic_r end_ARG ∥ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_Var ( italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x ) ) + blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ over→ start_ARG italic_F end_ARG ( italic_x ; italic_β , italic_c , italic_ϕ ) - italic_π over→ start_ARG italic_Q end_ARG ( italic_x ) ∥ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ italic_π over→ start_ARG italic_Q end_ARG ( italic_x ) - italic_π over→ start_ARG italic_r end_ARG ∥ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW

A.3 Proof of Proposition 4

Proof.

The proof is similar to that of Proposition 1 for contextual bandits, with T=1𝑇1T=1italic_T = 1.

For each t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], applying the Taylor expansion, we have

V^t(β^,ϕ^)=Vt(β)+𝔼μ[V^tβ](β^β)+𝔼μ[V^tϕ](ϕ^ϕ)+op(n1/2),subscript^𝑉𝑡^𝛽^italic-ϕsubscript𝑉𝑡superscript𝛽subscript𝔼𝜇superscriptdelimited-[]subscript^𝑉𝑡𝛽top^𝛽superscript𝛽subscript𝔼𝜇superscriptdelimited-[]subscript^𝑉𝑡italic-ϕtop^italic-ϕitalic-ϕsubscript𝑜𝑝superscript𝑛12\widehat{V}_{t}(\hat{\beta},\hat{\phi})=V_{t}(\beta^{*})+\text{$\mathbb{E}$}_{% \mu}{[\frac{\partial\widehat{V}_{t}}{\partial\beta}]^{\top}}(\hat{\beta}-\beta% ^{*})+\text{$\mathbb{E}$}_{\mu}{[\frac{\partial\widehat{V}_{t}}{\partial\phi}]% ^{\top}}(\hat{\phi}-\phi)+o_{p}(n^{-1/2}),over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) = italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG ∂ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_β end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG - italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG ∂ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_ϕ end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over^ start_ARG italic_ϕ end_ARG - italic_ϕ ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) ,

where

Vt(β)=1ni=1nρ0:t1(i)[ρt(i)[rt(i)Q^t(i)(β)]+V^t(i)(β)]subscript𝑉𝑡𝛽1𝑛superscriptsubscript𝑖1𝑛subscriptsuperscript𝜌𝑖:0𝑡1delimited-[]subscriptsuperscript𝜌𝑖𝑡delimited-[]superscriptsubscript𝑟𝑡𝑖subscriptsuperscript^𝑄𝑖𝑡𝛽subscriptsuperscript^𝑉𝑖𝑡𝛽V_{t}(\beta)=\frac{1}{n}\sum\limits_{i=1}^{n}\rho^{(i)}_{0:{t-1}}\bigl{[}\rho^% {(i)}_{t}[r_{t}^{(i)}-\widehat{Q}^{(i)}_{t}(\beta)]+\widehat{V}^{(i)}_{t}(% \beta)\bigr{]}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT [ italic_ρ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β ) ] + over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β ) ]

and β^𝑝β𝑝^𝛽superscript𝛽\hat{\beta}\xrightarrow{p}\beta^{*}over^ start_ARG italic_β end_ARG start_ARROW overitalic_p → end_ARROW italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for some βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. As μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG is correctly specified, we have

𝔼μ[V^tβ]=β𝔼μ[ρ0:t1[π(at|xt)μ(at|xt)Q^(xt,at;β)+V^(xt;β)]]=0.subscript𝔼𝜇delimited-[]subscript^𝑉𝑡𝛽𝛽subscript𝔼𝜇delimited-[]subscript𝜌:0𝑡1delimited-[]𝜋conditionalsubscript𝑎𝑡subscript𝑥𝑡𝜇conditionalsubscript𝑎𝑡subscript𝑥𝑡^𝑄subscript𝑥𝑡subscript𝑎𝑡superscript𝛽^𝑉subscript𝑥𝑡superscript𝛽0\text{$\mathbb{E}$}_{\mu}{[\frac{\partial\widehat{V}_{t}}{\partial\beta}]}=% \frac{\partial}{\partial\beta}\displaystyle\text{$\mathbb{E}$}_{\mu}{[\rho_{0:% t-1}[-\frac{\pi(a_{t}|x_{t})}{\mu(a_{t}|x_{t})}\widehat{Q}(x_{t},a_{t};\beta^{% *})+\widehat{V}(x_{t};\beta^{*})]]}=0.blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG ∂ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_β end_ARG ] = divide start_ARG ∂ end_ARG start_ARG ∂ italic_β end_ARG blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT [ - divide start_ARG italic_π ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_μ ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG over^ start_ARG italic_Q end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + over^ start_ARG italic_V end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] ] = 0 .

For the following term,

𝔼μ[V^tϕ]=𝔼μ[ρ0:t1ϕ[ρt(rtQ^t(xt,at;β))+V^t(xt,β)]+ρ0:t1ρtϕ(rtQ^t(xt,at;β))]=𝔼μ[ρ0:t1ϕ𝔼μ[ρt(rtQ^t(xt,at;β))+V^t(xt,β)|t1]]+𝔼μ[ρ0:t1ρtϕ(rtQ^t(xt,at;β))]=𝔼μρ0:t1𝔼μ[ρtϕ(rtQ^t(xt,at;β))|t1] (the first term equals to zero)=𝔼μρ0:t1𝔼π[μ^˙(at|xt;ϕ)μ(at|xt)(Qt(xt,at)Q^t(xt,at;β))]:=Γt(β).subscript𝔼𝜇delimited-[]subscript^𝑉𝑡italic-ϕsubscript𝔼𝜇delimited-[]subscript𝜌:0𝑡1italic-ϕdelimited-[]subscript𝜌𝑡subscript𝑟𝑡subscript^𝑄𝑡subscript𝑥𝑡subscript𝑎𝑡superscript𝛽subscript^𝑉𝑡subscript𝑥𝑡superscript𝛽subscript𝜌:0𝑡1subscript𝜌𝑡italic-ϕsubscript𝑟𝑡subscript^𝑄𝑡subscript𝑥𝑡subscript𝑎𝑡superscript𝛽subscript𝔼𝜇delimited-[]subscript𝜌:0𝑡1italic-ϕsubscript𝔼𝜇delimited-[]subscript𝜌𝑡subscript𝑟𝑡subscript^𝑄𝑡subscript𝑥𝑡subscript𝑎𝑡superscript𝛽conditionalsubscript^𝑉𝑡subscript𝑥𝑡superscript𝛽subscript𝑡1subscript𝔼𝜇delimited-[]subscript𝜌:0𝑡1subscript𝜌𝑡italic-ϕsubscript𝑟𝑡subscript^𝑄𝑡subscript𝑥𝑡subscript𝑎𝑡superscript𝛽subscript𝔼𝜇subscript𝜌:0𝑡1subscript𝔼𝜇delimited-[]conditionalsubscript𝜌𝑡italic-ϕsubscript𝑟𝑡subscript^𝑄𝑡subscript𝑥𝑡subscript𝑎𝑡superscript𝛽subscript𝑡1 (the first term equals to zero)subscript𝔼𝜇subscript𝜌:0𝑡1subscript𝔼𝜋delimited-[]˙^𝜇conditionalsubscript𝑎𝑡subscript𝑥𝑡italic-ϕ𝜇conditionalsubscript𝑎𝑡subscript𝑥𝑡superscript𝑄𝑡subscript𝑥𝑡subscript𝑎𝑡superscript^𝑄𝑡subscript𝑥𝑡subscript𝑎𝑡superscript𝛽assignsubscriptΓ𝑡superscript𝛽\begin{split}\text{$\mathbb{E}$}_{\mu}{[\frac{\partial\widehat{V}_{t}}{% \partial\phi}]}&=\displaystyle\text{$\mathbb{E}$}_{\mu}{\bigl{[}\frac{\partial% \rho_{0:t-1}}{\partial\phi}[\rho_{t}(r_{t}-\widehat{Q}_{t}(x_{t},a_{t};\beta^{% *}))+\widehat{V}_{t}(x_{t},\beta^{*})]+\rho_{0:t-1}\frac{\partial\rho_{t}}{% \partial\phi}(r_{t}-\widehat{Q}_{t}(x_{t},a_{t};\beta^{*}))\bigr{]}}\\ &=\text{$\mathbb{E}$}_{\mu}\bigl{[}\frac{\partial\rho_{0:t-1}}{\partial\phi}% \text{$\mathbb{E}$}_{\mu}[\rho_{t}(r_{t}-\widehat{Q}_{t}(x_{t},a_{t};\beta^{*}% ))+\widehat{V}_{t}(x_{t},\beta^{*})|\mathcal{H}_{t-1}]\bigr{]}\\ &+\text{$\mathbb{E}$}_{\mu}\bigl{[}\rho_{0:t-1}\frac{\partial\rho_{t}}{% \partial\phi}(r_{t}-\widehat{Q}_{t}(x_{t},a_{t};\beta^{*}))\bigr{]}\\ &=\text{$\mathbb{E}$}_{\mu}\rho_{0:t-1}\text{$\mathbb{E}$}_{\mu}\bigl{[}\frac{% \partial\rho_{t}}{\partial\phi}(r_{t}-\widehat{Q}_{t}(x_{t},a_{t};\beta^{*}))|% \mathcal{H}_{t-1}\bigr{]}\text{ (the first term equals to zero)}\\ &=-\text{$\mathbb{E}$}_{\mu}\rho_{0:t-1}\text{$\mathbb{E}$}_{\pi}\bigl{[}\frac% {\dot{\hat{\mu}}(a_{t}|x_{t};\phi)}{\mu(a_{t}|x_{t})}(Q^{t}(x_{t},a_{t})-% \widehat{Q}^{t}(x_{t},a_{t};\beta^{*}))\bigr{]}:=\Gamma_{t}(\beta^{*}).\end{split}start_ROW start_CELL blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG ∂ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_ϕ end_ARG ] end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG ∂ italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_ϕ end_ARG [ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] + italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT divide start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_ϕ end_ARG ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG ∂ italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_ϕ end_ARG blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT divide start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_ϕ end_ARG ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_ϕ end_ARG ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) | caligraphic_H start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] (the first term equals to zero) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ divide start_ARG over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_ϕ ) end_ARG start_ARG italic_μ ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ( italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ] := roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) . end_CELL end_ROW

For ϕ^ϕ^italic-ϕitalic-ϕ\hat{\phi}-\phiover^ start_ARG italic_ϕ end_ARG - italic_ϕ, from Un(ϕ^)=0subscript𝑈𝑛^italic-ϕ0U_{n}(\hat{\phi})=0italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( over^ start_ARG italic_ϕ end_ARG ) = 0 we have 0=U(ϕ)+𝔼[U˙(ϕ)](ϕ^ϕ)+op(n1/2),0𝑈italic-ϕ𝔼delimited-[]˙𝑈italic-ϕ^italic-ϕitalic-ϕsubscript𝑜𝑝superscript𝑛120=U(\phi)+\text{$\mathbb{E}$}{[\dot{U}(\phi)]}(\hat{\phi}-\phi)+o_{p}(n^{-1/2}),0 = italic_U ( italic_ϕ ) + blackboard_E [ over˙ start_ARG italic_U end_ARG ( italic_ϕ ) ] ( over^ start_ARG italic_ϕ end_ARG - italic_ϕ ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) , where

t=0T1a𝒜𝔼μ[1μ(a|xt)2μ^˙(a|xt;ϕ)μ^˙(a|xt;ϕ)]:=Σϕϕassignsuperscriptsubscript𝑡0𝑇1subscript𝑎𝒜subscript𝔼𝜇delimited-[]1𝜇superscriptconditional𝑎subscript𝑥𝑡2˙^𝜇conditional𝑎subscript𝑥𝑡italic-ϕ˙^𝜇superscriptconditional𝑎subscript𝑥𝑡italic-ϕtopsubscriptΣitalic-ϕitalic-ϕ\displaystyle\sum\limits_{t=0}^{T-1}\sum\limits_{a\in\mathcal{A}}\text{$% \mathbb{E}$}_{\mu}{[\frac{1}{\mu(a|x_{t})^{2}}\dot{\hat{\mu}}(a|x_{t};\phi)% \dot{\hat{\mu}}(a|x_{t};\phi)^{\top}]}:=\Sigma_{\phi\phi}∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_μ ( italic_a | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG ( italic_a | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_ϕ ) over˙ start_ARG over^ start_ARG italic_μ end_ARG end_ARG ( italic_a | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_ϕ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] := roman_Σ start_POSTSUBSCRIPT italic_ϕ italic_ϕ end_POSTSUBSCRIPT

and

ϕ^ϕ=Σϕϕ1U(ϕ)+op(n1/2).^italic-ϕitalic-ϕsuperscriptsubscriptΣitalic-ϕitalic-ϕ1𝑈italic-ϕsubscript𝑜𝑝superscript𝑛12\hat{\phi}-\phi=-\Sigma_{\phi\phi}^{-1}U(\phi)+o_{p}(n^{-1/2}).over^ start_ARG italic_ϕ end_ARG - italic_ϕ = - roman_Σ start_POSTSUBSCRIPT italic_ϕ italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_U ( italic_ϕ ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) .
V^t(β^,ϕ^)=Vt(β)+Γt(β)Σϕϕ1Un(ϕ)+op(n1/2).subscript^𝑉𝑡^𝛽^italic-ϕsubscript𝑉𝑡superscript𝛽subscriptΓ𝑡superscriptsuperscript𝛽topsuperscriptsubscriptΣitalic-ϕitalic-ϕ1subscript𝑈𝑛italic-ϕsubscript𝑜𝑝superscript𝑛12\widehat{V}_{t}(\hat{\beta},\hat{\phi})=V_{t}(\beta^{*})+\Gamma_{t}(\beta^{*})% ^{\top}\Sigma_{\phi\phi}^{-1}U_{n}(\phi)+o_{p}(n^{-1/2}).over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) = italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_ϕ italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϕ ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) .

Plugging in the results above and denoting c(β)=t=0T1γtΣϕϕ1Γt(β)𝑐𝛽superscriptsubscript𝑡0𝑇1superscript𝛾𝑡superscriptsubscriptΣitalic-ϕitalic-ϕ1subscriptΓ𝑡𝛽c(\beta)=\displaystyle\sum\limits_{t=0}^{T-1}\gamma^{t}\Sigma_{\phi\phi}^{-1}% \Gamma_{t}(\beta)italic_c ( italic_β ) = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_ϕ italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β ), we have

V^DR(β^,ϕ^)=t=0T1γtV^t(β^,ϕ^)=V(β)+c(β)Un(ϕ)+op(n1/2)=1ni=1nη(i)(β,c(β))+op(n1/2),superscript^𝑉DR^𝛽^italic-ϕsuperscriptsubscript𝑡0𝑇1superscript𝛾𝑡subscript^𝑉𝑡^𝛽^italic-ϕ𝑉superscript𝛽𝑐superscriptsuperscript𝛽topsubscript𝑈𝑛italic-ϕsubscript𝑜𝑝superscript𝑛121𝑛superscriptsubscript𝑖1𝑛superscript𝜂𝑖superscript𝛽𝑐superscript𝛽subscript𝑜𝑝superscript𝑛12\begin{split}\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})=\sum\limits_{t=0}% ^{T-1}\gamma^{t}\widehat{V}_{t}(\hat{\beta},\hat{\phi})&=V(\beta^{*})+c(\beta^% {*})^{\top}U_{n}(\phi)+o_{p}(n^{-1/2})\\ &=\frac{1}{n}\sum\limits_{i=1}^{n}\eta^{(i)}(\beta^{*},c(\beta^{*}))+o_{p}(n^{% -1/2}),\end{split}start_ROW start_CELL over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG , over^ start_ARG italic_ϕ end_ARG ) end_CELL start_CELL = italic_V ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϕ ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + italic_o start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) , end_CELL end_ROW

for V(β)=t=0T1γtVt(β)𝑉𝛽superscriptsubscript𝑡0𝑇1superscript𝛾𝑡subscript𝑉𝑡𝛽V(\beta)=\sum\limits_{t=0}^{T-1}\gamma^{t}V_{t}(\beta)italic_V ( italic_β ) = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β ) and η(i)(β,c)=t=0T1γtηt(i)(β,c)superscript𝜂𝑖𝛽𝑐superscriptsubscript𝑡0𝑇1superscript𝛾𝑡subscriptsuperscript𝜂𝑖𝑡𝛽𝑐\eta^{(i)}(\beta,c)=\displaystyle\sum\limits_{t=0}^{T-1}\gamma^{t}\eta^{(i)}_{% t}(\beta,c)italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β , italic_c ) = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β , italic_c ) with

ηt(i)(β,c)=1μ(at(i)|xt(i))[ρ0:t1(i)π(at(i)|xt(i))rt(i)Ft(xt(i),at(i);β,c,ϕ)]+a𝒜Ft(xt(i),a;β,c,ϕ)superscriptsubscript𝜂𝑡𝑖𝛽𝑐1𝜇conditionalsuperscriptsubscript𝑎𝑡𝑖superscriptsubscript𝑥𝑡𝑖delimited-[]superscriptsubscript𝜌:0𝑡1𝑖𝜋conditionalsuperscriptsubscript𝑎𝑡𝑖superscriptsubscript𝑥𝑡𝑖superscriptsubscript𝑟𝑡𝑖subscript𝐹𝑡superscriptsubscript𝑥𝑡𝑖superscriptsubscript𝑎𝑡𝑖𝛽𝑐italic-ϕsubscript𝑎𝒜subscript𝐹𝑡superscriptsubscript𝑥𝑡𝑖𝑎𝛽𝑐italic-ϕ\eta_{t}^{(i)}(\beta,c)=\frac{1}{\mu(a_{t}^{(i)}|x_{t}^{(i)})}\bigl{[}\rho_{0:% t-1}^{(i)}\pi(a_{t}^{(i)}|x_{t}^{(i)})r_{t}^{(i)}-F_{t}(x_{t}^{(i)},a_{t}^{(i)% };\beta,c,\phi)\bigr{]}+\sum\limits_{a\in\mathcal{A}}F_{t}(x_{t}^{(i)},a;\beta% ,c,\phi)italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_β , italic_c ) = divide start_ARG 1 end_ARG start_ARG italic_μ ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_ARG [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT italic_π ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; italic_β , italic_c , italic_ϕ ) ] + ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_a ; italic_β , italic_c , italic_ϕ )

A.4 Proof of Proposition 5

Proof.

Section 4.1 of Jiang and Li [2016] introduces an inductive definition for a DR OPE estimator with a known logging policy and a fixed value function model. The Theorem 1 from the same paper provides the variance of this DR OPE estimator, presented as follows:>

nVar[V^]=t=0T1γ2t[𝔼μ[Vart[ρ0:t1Vt(xt)]+Vart+1[ρ0:trt]]+𝔼μ[Vart[ρ0:t(Qt(xt,at)Q^(xt,at))|xt]]],𝑛Vardelimited-[]^𝑉superscriptsubscript𝑡0𝑇1superscript𝛾2𝑡delimited-[]subscript𝔼𝜇delimited-[]subscriptVar𝑡delimited-[]subscript𝜌:0𝑡1superscript𝑉𝑡subscript𝑥𝑡subscriptVar𝑡1delimited-[]subscript𝜌:0𝑡subscript𝑟𝑡subscript𝔼𝜇delimited-[]subscriptVar𝑡delimited-[]conditionalsubscript𝜌:0𝑡superscript𝑄𝑡subscript𝑥𝑡subscript𝑎𝑡^𝑄subscript𝑥𝑡subscript𝑎𝑡subscript𝑥𝑡\begin{split}n\mathrm{Var}[\widehat{V}]=\sum\limits_{t=0}^{T-1}\gamma^{2t}% \bigl{[}\text{$\mathbb{E}$}_{\mu}\bigl{[}\mathrm{Var}_{t}\left[\rho_{0:t-1}V^{% t}(x_{t})\right]+\mathrm{Var}_{t+1}[\rho_{0:t}r_{t}]\bigr{]}+\text{$\mathbb{E}% $}_{\mu}\bigl{[}\mathrm{Var}_{t}[\rho_{0:t}(Q^{t}(x_{t},a_{t})-\widehat{Q}(x_{% t},a_{t}))|x_{t}]\bigr{]}\bigr{]},\end{split}start_ROW start_CELL italic_n roman_Var [ over^ start_ARG italic_V end_ARG ] = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ roman_Var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] + roman_Var start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ] + blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ roman_Var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - over^ start_ARG italic_Q end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ] ] , end_CELL end_ROW

where VartsubscriptVar𝑡\mathrm{Var}_{t}roman_Var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT refers to the conditional variance Var(|x0,a0,,xt1,at1)\mathrm{Var}(\cdot|x_{0},a_{0},\dots,x_{t-1},a_{t-1})roman_Var ( ⋅ | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ). The proof of the theorem still holds for a more general class of Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG, which takes the state-action trajectory (x0,a0,,xt,at)subscript𝑥0subscript𝑎0subscript𝑥𝑡subscript𝑎𝑡(x_{0},a_{0},\dots,x_{t},a_{t})( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and the time step t𝑡titalic_t as arguments. Therefore, following the same approach as the proof of Theorem 1, we obtain the following representation of our proposed DRUnknown for RL:

nVar[V^DR]=t=0T1γ2t[𝔼μ[Vart[ρ0:t1Vt(xt)]+Vart+1[ρ0:trt]]+𝔼μ[Vart[ρ0:tQt(xt,at)μ(at|xt)1Ft(xt,at,β,c,ϕ)|xt]]].𝑛Vardelimited-[]superscript^𝑉DRsuperscriptsubscript𝑡0𝑇1superscript𝛾2𝑡delimited-[]subscript𝔼𝜇delimited-[]subscriptVar𝑡delimited-[]subscript𝜌:0𝑡1superscript𝑉𝑡subscript𝑥𝑡subscriptVar𝑡1delimited-[]subscript𝜌:0𝑡subscript𝑟𝑡subscript𝔼𝜇delimited-[]subscriptVar𝑡delimited-[]subscript𝜌:0𝑡superscript𝑄𝑡subscript𝑥𝑡subscript𝑎𝑡|𝜇superscript|subscript𝑎𝑡subscript𝑥𝑡1subscript𝐹𝑡subscript𝑥𝑡subscript𝑎𝑡𝛽𝑐italic-ϕsubscript𝑥𝑡\begin{split}n\mathrm{Var}[\widehat{V}^{\text{DR}}]&=\sum\limits_{t=0}^{T-1}% \gamma^{2t}\bigl{[}\text{$\mathbb{E}$}_{\mu}\bigl{[}\mathrm{Var}_{t}\left[\rho% _{0:t-1}V^{t}(x_{t})\right]+\mathrm{Var}_{t+1}[\rho_{0:t}r_{t}]\bigr{]}\\ &+\text{$\mathbb{E}$}_{\mu}\bigl{[}\mathrm{Var}_{t}[\rho_{0:t}Q^{t}(x_{t},a_{t% })-\mu(a_{t}|x_{t})^{-1}F_{t}(x_{t},a_{t},\beta,c,\phi)|x_{t}]\bigr{]}\bigr{]}% .\end{split}start_ROW start_CELL italic_n roman_Var [ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT ] end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ roman_Var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] + roman_Var start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + blackboard_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [ roman_Var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_μ ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_β , italic_c , italic_ϕ ) | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ] ] . end_CELL end_ROW

The first term does not depend on the parameters β𝛽\betaitalic_β and c𝑐citalic_c, so we denote it as a constant

CT=t=0T1γ2t𝔼[Vart[ρ0:t1Vt(xt)]+Vart+1[ρ0:trt]].subscript𝐶𝑇superscriptsubscript𝑡0𝑇1superscript𝛾2𝑡𝔼delimited-[]subscriptVar𝑡delimited-[]subscript𝜌:0𝑡1superscript𝑉𝑡subscript𝑥𝑡subscriptVar𝑡1delimited-[]subscript𝜌:0𝑡subscript𝑟𝑡C_{T}=\sum\limits_{t=0}^{T-1}\gamma^{2t}\text{$\mathbb{E}$}\bigl{[}\mathrm{Var% }_{t}\left[\rho_{0:t-1}V^{t}(x_{t})\right]+\mathrm{Var}_{t+1}[\rho_{0:t}r_{t}]% \bigr{]}.italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT blackboard_E [ roman_Var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] + roman_Var start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ] .

The variance in the second term is conditioned on (x0,a0,,xt1,at1,xt)subscript𝑥0subscript𝑎0subscript𝑥𝑡1subscript𝑎𝑡1subscript𝑥𝑡(x_{0},a_{0},\dots,x_{t-1},a_{t-1},x_{t})( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and thus the randomness of this variable only incurs from the action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time step t𝑡titalic_t. Therefore, it can be calculated similarly to the proof of Proposition 2, and can be represented as a stochastic semi-norm as below, and we have the desired result.

Vart[ρ0:tQt(xt,at)μ(at|xt)1Ft(xt,at,β,c,ϕ)|xt]=Vart[a𝒜Δat[ρ0:t1π(a|xt)μ(a|xt)Qt(xt,a)μ(a|xt)1Ft(xt,β,c,ϕ)|xt]]=Ft(xt;β,c,ϕ)ρ0:t1πQt(xt)Mt,μ2subscriptVar𝑡delimited-[]subscript𝜌:0𝑡superscript𝑄𝑡subscript𝑥𝑡subscript𝑎𝑡conditional𝜇superscriptconditionalsubscript𝑎𝑡subscript𝑥𝑡1subscript𝐹𝑡subscript𝑥𝑡subscript𝑎𝑡𝛽𝑐italic-ϕsubscript𝑥𝑡subscriptVar𝑡delimited-[]subscript𝑎𝒜superscriptsubscriptΔ𝑎𝑡delimited-[]subscript𝜌:0𝑡1𝜋conditional𝑎subscript𝑥𝑡𝜇conditional𝑎subscript𝑥𝑡superscript𝑄𝑡subscript𝑥𝑡𝑎conditional𝜇superscriptconditional𝑎subscript𝑥𝑡1subscript𝐹𝑡subscript𝑥𝑡𝛽𝑐italic-ϕsubscript𝑥𝑡subscriptsuperscriptdelimited-∥∥subscript𝐹𝑡subscript𝑥𝑡𝛽𝑐italic-ϕsubscript𝜌:0𝑡1𝜋superscript𝑄𝑡subscript𝑥𝑡2subscript𝑀𝑡𝜇\begin{split}&\mathrm{Var}_{t}[\rho_{0:t}Q^{t}(x_{t},a_{t})-\mu(a_{t}|x_{t})^{% -1}F_{t}(x_{t},a_{t},\beta,c,\phi)|x_{t}]\\ &=\mathrm{Var}_{t}[\sum\limits_{a\in\mathcal{A}}\Delta_{a}^{t}[\rho_{0:t-1}% \frac{\pi(a|x_{t})}{\mu(a|x_{t})}Q^{t}(x_{t},a)-\mu(a|x_{t})^{-1}F_{t}(x_{t},% \beta,c,\phi)|x_{t}]]\\ &=\left\|\vec{F}_{t}(x_{t};\beta,c,\phi)-\rho_{0:t-1}\pi\vec{Q}^{t}(x_{t})% \right\|^{2}_{M_{t,\mu}}\end{split}start_ROW start_CELL end_CELL start_CELL roman_Var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_μ ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_β , italic_c , italic_ϕ ) | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_Var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT [ italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT divide start_ARG italic_π ( italic_a | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_μ ( italic_a | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) - italic_μ ( italic_a | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_β , italic_c , italic_ϕ ) | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∥ over→ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β , italic_c , italic_ϕ ) - italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT italic_π over→ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_t , italic_μ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW

A.5 Proof of Proposition 7

Proof.

Theorem 3 of Kallus and Uehara [2020] states that in the OPE problem, the DR estimator with the true logging policy μ𝜇\muitalic_μ and value function Q𝑄Qitalic_Q, given by

V^opt=1ni=1nt=0T1γtρ0:t1(i)[ρt(i)[rt(i)Qt(xt(i),at(i))]+Vt(xt(i))],superscript^𝑉opt1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑡0𝑇1superscript𝛾𝑡superscriptsubscript𝜌:0𝑡1𝑖delimited-[]superscriptsubscript𝜌𝑡𝑖delimited-[]superscriptsubscript𝑟𝑡𝑖superscript𝑄𝑡superscriptsubscript𝑥𝑡𝑖superscriptsubscript𝑎𝑡𝑖superscript𝑉𝑡superscriptsubscript𝑥𝑡𝑖\widehat{V}^{\text{opt}}=\frac{1}{n}\sum\limits_{i=1}^{n}\sum\limits_{t=0}^{T-% 1}\gamma^{t}\rho_{0:{t-1}}^{(i)}\bigl{[}\rho_{t}^{(i)}[r_{t}^{(i)}-Q^{t}(x_{t}% ^{(i)},a_{t}^{(i)})]+V^{t}(x_{t}^{(i)})\bigr{]},over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT opt end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT [ italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ] + italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ] ,

achieves the smallest asymptotic variance, reaching the semiparametric lower bound, for the case with a discount factor γ=1𝛾1\gamma=1italic_γ = 1. When the value function model is also correctly specified, the proposed DRUnknown is asymptotically equivalent to V^optsuperscript^𝑉opt\widehat{V}^{\text{opt}}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT opt end_POSTSUPERSCRIPT, as the Γt(β)=0subscriptΓ𝑡superscript𝛽0\Gamma_{t}(\beta^{*})=0roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0 for all t𝑡titalic_t and c(β)=0𝑐superscript𝛽0c(\beta^{*})=0italic_c ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0. For the general problem with γ<1𝛾1\gamma<1italic_γ < 1, we can modify the MDP by fixing the discount factor to 1, incorporating the time step t𝑡titalic_t into the state variable x𝑥xitalic_x, and changing the reward function R(x,a)𝑅𝑥𝑎R(x,a)italic_R ( italic_x , italic_a ) to γtR(x,a)superscript𝛾𝑡𝑅𝑥𝑎\gamma^{t}R(x,a)italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( italic_x , italic_a ). ∎

A.6 Proof of Proposition 8

Proof.

The IPW Horvitz and Thompson [1952], DR [Dudík et al., 2011, Jiang and Li, 2016], and MRDR [Farajtabar et al., 2018] estimators are originally designed for the OPE problem with a known logging policy μ𝜇\muitalic_μ. Therefore, for comparison, we assume that the logging policy model is estimated by MLE, as same as in our proposed estimator. We can observe that all three estimators are asymptotically equivalent to V~(β,c)~𝑉𝛽𝑐\widetilde{V}(\beta,c)over~ start_ARG italic_V end_ARG ( italic_β , italic_c ) for some value of β𝛽\betaitalic_β and fixed c=0𝑐0c=0italic_c = 0. As all three estimators cannot take the estimation effect of ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG into account, they all have c=0𝑐0c=0italic_c = 0.

The IPW sets the value of β𝛽\betaitalic_β such that the value function model Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG becomes zero, the DR estimator finds β𝛽\betaitalic_β by minimizing the least-squares error for the value function. MRDR minimizes the variance of the estimator only with respect to β𝛽\betaitalic_β, with c𝑐citalic_c fixed to zero. The class of estimators contains all these estimators, and the proposed DRUnknown achieves the smallest asymptotic variance among them, being at least more efficient. ∎

Appendix B Estimation of ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG with General Estimating Equation

The maximum likelihood estimator used to estimate ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG in this work is most efficient in many situations. However, the theoretical results in this paper are applicable to other estimating equations as given below,

Un(ϕ)=i=1nt=0T1a𝒜(Δa,t(i)μ(a|xt(i)))h(xt(i),a;ϕ)=0,subscript𝑈𝑛italic-ϕsuperscriptsubscript𝑖1𝑛superscriptsubscript𝑡0𝑇1subscript𝑎𝒜superscriptsubscriptΔ𝑎𝑡𝑖𝜇conditional𝑎superscriptsubscript𝑥𝑡𝑖superscriptsubscript𝑥𝑡𝑖𝑎italic-ϕ0U_{n}(\phi)=\sum\limits_{i=1}^{n}\sum\limits_{t=0}^{T-1}\sum\limits_{a\in% \mathcal{A}}(\Delta_{a,t}^{(i)}-\mu(a|x_{t}^{(i)}))h(x_{t}^{(i)},a;\phi)=0,italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϕ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_a , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_μ ( italic_a | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_a ; italic_ϕ ) = 0 ,

for any smooth function hhitalic_h. The equation Un(ϕ)=0subscript𝑈𝑛italic-ϕ0U_{n}(\phi)=0italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϕ ) = 0 is an unbiased estimating equation as long as the logging policy model is correctly specified. A possible choice is to assign more weight to the state-action pair with high probability in π𝜋\piitalic_π. This estimator remains consistent, satisfies the theoretical properties, and may be particularly useful for scenarios with a small-sized finite sample.

Appendix C Descriptions on Experimental Settings

C.1 Simulation Data

To build the synthetic dataset for the simulation experiment, we generate elements for the context vectors x𝑥xitalic_x and the coefficient vector β𝛽\betaitalic_β from the uniform distribution U(1/d,1/d)𝑈1𝑑1𝑑U(-1/\sqrt{d},1/\sqrt{d})italic_U ( - 1 / square-root start_ARG italic_d end_ARG , 1 / square-root start_ARG italic_d end_ARG ). The reward mean is determined by the nonlinear function exp(xtβ)superscript𝑥𝑡𝛽\exp(x^{t}\beta)roman_exp ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_β ), making the linear value function model incorrectly specified.

For the logging policy μ𝜇\muitalic_μ and the target policy π𝜋\piitalic_π, we generate the coefficients ϕμsubscriptitalic-ϕ𝜇\phi_{\mu}italic_ϕ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT and ϕπsubscriptitalic-ϕ𝜋\phi_{\pi}italic_ϕ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT from the uniform distribution U(1/d,1/d)𝑈1𝑑1𝑑U(-1/\sqrt{d},1/\sqrt{d})italic_U ( - 1 / square-root start_ARG italic_d end_ARG , 1 / square-root start_ARG italic_d end_ARG ) and U(2/d,2/d)𝑈2𝑑2𝑑U(-2/\sqrt{d},2/\sqrt{d})italic_U ( - 2 / square-root start_ARG italic_d end_ARG , 2 / square-root start_ARG italic_d end_ARG ), respectively.

C.2 UCI Dataset

The six datasets used in these experiments were initially designed for the classification problem. To transform the problem into a bandit setting, we interpret the assignment of the class label as the selection of an arm in a bandit, with a reward of 1 if the class is correct and 0 if incorrect. We train the classifier μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which returns the probability of each class label given a context vector x𝑥xitalic_x. We treat μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as a policy and combine it with a uniform random policy with rates α𝛼\alphaitalic_α and 1α1𝛼1-\alpha1 - italic_α. The value function model is constant and can be regarded as an intercept value.

C.3 ModelWin and ModelFail

C.3.1 ModelWin

ModelWin consists of three states, starting from state s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. When choosing action a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the agent moves to s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with a probability of 0.6 and to s3subscript𝑠3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with a probability of 0.4. Conversely, action a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT leads to s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with a probability of 0.4 and to s3subscript𝑠3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with a probability of 0.6. In states s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and s3subscript𝑠3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, both actions return the agent to s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with a probability of 1. If the agent visits s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT or s3subscript𝑠3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, it receives rewards of 1 and -1, respectively. The horizon is fixed at T=20𝑇20T=20italic_T = 20.

C.3.2 ModelFail

ModelFail is an MDP with 4 states, but the learner cannot observe the current state of the agent. Starting from s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, action a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT leads to the upper middle state s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to the lower middle state s3subscript𝑠3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. From both, any action moves to the terminal state s4subscript𝑠4s_{4}italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. If the transition is from the upper state, a reward of 1 is received; otherwise, a reward of -1. The horizon is always T=2𝑇2T=2italic_T = 2.

For both environments, the target policy at s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT selects action a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with a probability of 0.7 and a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with a probability of 0.3. The logging policy chooses action a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with a probability of 0.75 and a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with a probability of 0.25. The parameter of the logging policy model μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG for both environments is the probability of choosing a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and the value function model Q^^𝑄\widehat{Q}over^ start_ARG italic_Q end_ARG is given by the linear model with intercepts.

Appendix D Limiations

This paper primarily focuses on the asymptotic properties of the proposed estimator for the OPE problem with an unknown logging policy. We do not extensively explore the estimator’s behavior with a finite sample. This aligns with similar statistical works, such as Cao et al. [2009], which addresses estimated missing mechanisms without providing finite-sample theory. Additionally, current DR OPE methods do not address scenarios with an unknown logging policy.