Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy

Kyungbok Lee
Graduate School of Data Science
Seoul National University
[email protected]
&Myunghee Cho Paik

Shepherd23 Inc.
[email protected]

Abstract

We introduce a novel doubly-robust (DR) off-policy evaluation (OPE) estimator for Markov decision processes, DRUnknown, designed for situations where both the logging policy and the value function are unknown. The proposed estimator initially estimates the logging policy and then estimates the value function model by minimizing the asymptotic variance of the estimator while considering the estimating effect of the logging policy. When the logging policy model is correctly specified, DRUnknown achieves the smallest asymptotic variance within the class containing existing OPE estimators. When the value function model is also correctly specified, DRUnknown is optimal as its asymptotic variance reaches the semiparametric lower bound. We present experimental results conducted in contextual bandits and reinforcement learning to compare the performance of DRUnknown with that of existing methods.

1 Introduction

In various decision-making problems, estimating the value, the expected reward of a policy is a crucial question that needs to be addressed. Online evaluation requiring a comprehensive evaluation of policy value can be expensive and may not be applicable to multiple target policies.

Alternatively, off-policy evaluation (OPE) refers to a technique that estimates the value of a target policy by utilizing log data generated from a different logging policy. This approach has attracted considerable interest in the domains of contextual bandits (CB) (Dudík et al., 2011; Swaminathan et al., 2017) and reinforcement learning (RL) (Precup, 2000; Mahmood et al., 2014; Jiang and Li, 2016).

Several off-policy evaluation algorithms (Dudík et al., 2011; Thomas and Brunskill, 2016; Wang et al., 2017; Farajtabar et al., 2018; Su et al., 2020) currently in use rely on having complete knowledge of the logging policy in order to utilize inverse probability weighting (IPW). However, in situations where information about the data logger is not available, such as when using human decision data, it becomes necessary to estimate the logging policy. Applying existing estimators directly to cases where the logging policy is unknown can compromise the desired properties of these methods, as they fail to take into account the impact of logging policy estimation.

In this paper, we present a novel efficient estimator called DRUnknown, which simultaneously estimate both the logging policy model and the value function model. Our proposed approach, without the need for external data, effectively captures the interdependency between the two models.

In the field of conventional statistics, Cao et al. (2009) employs the influence function to estimate parameters in the doubly-robust (DR) estimator for population means when there are missing observations. The study focused on situations where data is incomplete and the missing mechanisms are unknown and estimated. To advance this concept, we utilized the influence function approach to address the OPE problem. Here, we treat the rewards of unselected actions as missing values and estimate the unknown logging policy. This enables us to derive the asymptotic variance of the OPE estimator and estimate the parameters that minimize it.

The proposed OPE estimator estimates both the logging policy model and the value function model, and is referred to as doubly-robust due to its consistency if either the model for the logging policy or the value function is correctly specified. When the logging policy is correctly specified, the proposed estimator is the most efficient among the class of DR OPE estimators with an estimated logging policy, and is at least as efficient as existing methods. Moreover, if the value function model is correctly specified, the estimator locally efficient, achieving the semiparametric lower bound for asymptotic variance and is asymptotically optimal. In order to demonstrate the effectiveness of the proposed estimator, we conducted simulation experiments to compare its performance with previous methods in contextual bandits and reinforcement learning problems.

The main contributions of this paper are as follows:

•

We present a new DR OPE estimator for Markov decision process, for the case where both the logging policy and the value function are unknown. The proposed estimator is consistent when either the logging policy model or the value function model is correctly specified.
•

We propose a method to estimate both the logging policy model parameter $\hat{\phi}$ and the value function model parameter $\hat{\beta}$ simultaneously. We use the maximum likelihood estimator (MLE) for $\hat{\phi}$ and minimize the asymptotic variance to estimate $\hat{\beta}$ , accounting for the impact of estimating $\hat{\phi}$ .
•

The proposed estimator has the smallest asymptotic variance among estimators using MLE for $\hat{\phi}$ when the logging policy model is correctly specified. When the value function model is also correctly specified the proposed estimator is optimal, as its asymptotic variance reaches the semiparametric lower bound.
•

Simulations on contextual bandits and reinforcement learning problems show that the proposed estimator consistently shows smaller mean-squared errors compared to the benchmarks methods.

2 Problem Setup

In this paper, we model the decision-making problem as a reinforcement learning framework in which the learner’s interaction with the system is represented as a Markov Decision Process (MDP). In this section, we provide definitions of MDP and the corresponding off-policy evaluation problem for our study.

2.1 Markov Decision Processes

An MDP is defined as a tuple ( $\mathcal{X}$ , $\mathcal{A}$ , $R,P,P_{0},\gamma$ ), where $\mathcal{X}$ and $\mathcal{A}$ represent the state and the finite action spaces with $|\mathcal{A}|=K$ , $R(x,a)$ denotes the distribution of the random variable $r(x,a)$ of the bounded immediate reward of taking action $a$ in state $x$ , $P(\cdot|x,a)$ is the transition probability distribution. $P_{0}:X\rightarrow[0,1]$ is the initial state distribution, and $\gamma\in[0,1]$ is the discounting factor. The learner utilizes a policy $\pi:\mathcal{X}\times\mathcal{A}\rightarrow[0,1]$ , a stochastic map** from states to actions. Here, $\pi(a|x)$ represents the probability of taking action $a$ in state $x$ .

Let $\mathcal{H}_{T}$ = $\{(x_{t},a_{t},r_{t})\}_{t=0}^{T-1}$ represent a $T$ -step trajectory containing state-action-rewards generated by policy $\pi$ . Denote $R_{t_{1}:t_{2}}=\sum\limits_{t=t_{1}}^{t_{2}}\gamma^{t-t_{1}}r_{t}$ as the sum of discounted return of from time step $t_{1}$ to $t_{2}$ . Our goal is to estimate the policy value, the expected return of trajectories generated by policy $\pi$ , i.e., $V^{\pi,T}=\text{$\mathbb{E}$}_{\pi}[R_{0:T-1}]$ . Here $\text{$\mathbb{E}$}_{\pi}$ denotes the expectation over the trajectory sampled from $\pi$ . We also define the value function of a policy $\pi$ at state $x$ and at state-action pair $(x,a)$ from time step $t$ by $V^{\pi,t}(x)=\text{$\mathbb{E}$}_{\pi}[R_{t:T-1}|x_{t}=x]$ and $Q^{\pi,t}(x,a)=\text{$\mathbb{E}$}_{\pi}[R_{t:T-1}|x_{t}=x,a_{t}=a]$ , respectively.

Throughout this paper, we fix the trajectory length as $T$ . We also omit the $\pi$ from $V^{\pi,t}(s)$ and $Q^{\pi,t}(s,a)$ to simplify them as $V^{t}(s)$ and $Q^{t}(s,a)$ , respectively, since we are only interested in the value of $\pi$ . Note that, unlike the case with $T=\infty$ , where the $Q$ and $V$ value functions remain unchanged, the value functions $Q^{t}$ and $V^{t}$ can vary for each time step $t$ in general.

The contextual bandits (CB) is a special case of the MDP, in which the trajectory length $T$ equals to $1$ , and the transition dynamics $P$ do not exist. In CB, the state is represented as the context, and the action is called as the arm.

2.2 Off-Policy Evaluation Problem

In an OPE problem, we are given a dataset $\mathcal{D}_{n}=\{\mathcal{H}_{T}^{(i)}\}_{i=1}^{n}$ comprising $n$ independent $T$ -step trajectories drawn from a logging policy $\mu$ . We use the superscript $(i)$ to refer to the data from the $i^{\text{th}}$ trajectory, denoted as $\mathcal{H}_{T}^{(i)}$ . The objective is to estimate the policy value $V^{\pi}$ of a separate target policy $\pi$ . We define the importance ratio $\rho_{t}=\pi(a_{t}|x_{t})/\mu(a_{t}|x_{t})$ , and the cumulative importance ratio from time step $t_{1}$ to $t_{2}$ between $\mu$ and $\pi$ as $\rho_{t_{1}:t_{2}}=\Pi_{t=t_{1}}^{t_{2}}\rho_{t}$ . In cases where $t_{1}>t_{2}$ , we define $\rho_{t_{1}:t_{2}}$ to be 1. Similarly, we denote the estimated importance ratio as $\hat{\rho}$ , where the probability of the logging policy $\mu(a|x)$ is replaced with its estimated version $\hat{\mu}(a|x;\hat{\phi})$ for the logging policy model $\hat{\mu}(\cdot;\phi)$ .

The policy value $\widehat{V}$ is a point estimator, and there are various metrics to assess the quality of the provided estimator. One commonly used is the mean squared error (MSE), defined as $\text{MSE}(\widehat{V})=\text{$\mathbb{E}$}_{\mu}[{(\widehat{V}-V^{\pi})^{2}}]$ with the expectation taken with respect to the data sampled from $\mu$ . MSE quantifies the performance of a given estimator in finite samples. Another measure is the asymptotic variance, which assesses the efficiency of an estimator with a large number of samples. The estimator $\widehat{V}$ is called asymptotically linear if there exists a function $\psi$ of trajectory $\mathcal{H}_{T}$ such that

\widehat{V}-V^{\pi}=\frac{1}{n}\sum_{i=1}^{n}\psi(\mathcal{H}_{T}^{(i)})+o_{p}% (\frac{1}{\sqrt{n}}).

with $\text{$\mathbb{E}$}_{\mu}{[\psi(\mathcal{H}_{T})]}=0$ and $\mathrm{Var}_{\mu}{[\psi(\mathcal{H}_{T})^{2}]}<\infty$ . The quantity $\mathrm{Var}_{\mu}{[\psi(\mathcal{H}_{T})^{2}]}$ is denoted as the asymptotic variance of $\widehat{V}$ , and the function $\psi(\mathcal{H}_{T})$ is referred to as the influence function. In our study, we derive the specific form of the influence function for OPE estimators and conduct an analysis to minimize the asymptotic variance.

For our analysis, we use the following standard regularity assumptions for OPE problem.

Assumption 1 (Absolute Continuity).

For all state-action pair $(x,a)\in\mathcal{X}\times\mathcal{A}$ , if $\pi(a|x)>0$ then $\mu(a|x)>0$ .

Assumption 2 (Square Integrable).

$\text{$\mathbb{E}$}_{\mu}{[\rho_{0:t}^{2}]}<\infty$ for all $t<T$ .

Assumption 1 is crucial in order to avoid an infinite IPW and the OPE estimators which employ IPW are well-defined. Assumption 2 is necessary to guarantee that estimators utilizing the IPW method have finite variance.

3 Previous Methods for Off-Policy Evaluation

The doubly-robust type estimator for the OPE problem was first proposed by Dudík et al. (2011). The estimator depends on a regression model for value function that is estimated using a separate independent dataset, which may not always be available.

The works of Thomas and Brunskill (2016) and Wang et al. (2017) introduce estimators that improve efficiency by modifying the IPW component of the DR OPE estimator. These methods still require the independently estimated value function using external data. Farajtabar et al. (2018) employed the idea of Rubin and van der Laan (2008), which estimates the value function model by minimizing the estimation variance without external dataset. All these methods assume that the true logging policy is known and do not require estimation.

Several studies (Li et al., 2015; Raghu et al., 2018; Xie et al., 2019; Hanna et al., 2021) have proposed the IPW estimators for OPE utilizing the estimated logging policy and examined their theoretical and numerical properties of these estimators. However, these methods do not consider the value function, making them not doubly-robust and suboptimal in terms of asymptotic variance.

3.1 Standard Methods for Off-Policy Evaluation

There are three standard approaches to estimate $V^{\pi}$ in OPE problem, and we provide a brief overview of these methods.

The first method is direct method (DM) (Dudík et al., 2011; Rothe, 2016), which directly approximates the state-action value function $Q(x,a)$ of the target policy. The DM does not use the information from the logging policy $\mu$ and relies heavily on the accuracy of the prediction of the value function. While the DM estimators often exhibits low variance, it is not guaranteed to be unbiased.

The second method is called inverse probability weighting (IPW) estimator (Horvitz and Thompson, 1952), which utilize the importance ratio term to correct the discrepancy between the target and logging policies. The IPW estimator is unbiased in case the logging policy is known, but it can exhibit a large variance when there is a significant difference between the logging and target policies and the importance ratio have large value. The Assumption 1 is required in order for the IPW estimator to be well-defined: one cannot know about the pair $(x,a)$ never explored by the logging policy.

The third approach, named doubly-robust (DR) estimator (Cassel et al., 1976; Robins and Rotnitzky, 1995; Dudík et al., 2011; Jiang and Li, 2016), combines the DM and IPW, leveraging their respective strengths to obtain favorable characteristics. The IPW estimator is a special case of DR with the value function fixed to zero. Our work focuses on the DR type estimators, as our main objective is to introduce an efficient method for simultaneously estimating the logging policy and the value function model.

3.2 Doubly-Robust Off-Policy Evaluation with Unknown Logging Policy

The standard DR OPE estimator with known logging policy is given by

\widehat{V}^{\text{DR}}=\frac{1}{n}\sum\limits_{i=1}^{n}\sum\limits_{t=0}^{T-1% }\gamma^{t}\rho_{0:{t-1}}^{(i)}\bigl{[}\rho_{t}^{(i)}[r_{t}^{(i)}-\widehat{Q}(% x_{t}^{(i)},a_{t}^{(i)};\beta)]+\widehat{V}(x_{t}^{(i)};\beta)\bigr{]},

(1)

where $\widehat{Q}(\cdot;\beta)$ is a value function model of the state-action value function parameterized by $\beta$ , and $\widehat{V}(x;\beta)=\sum_{a\in\mathcal{A}}\pi(a|x)\widehat{Q}(x,a;\beta).$

Here, $\widehat{Q}$ depends solely on the state-action pair. However, as mentioned in Section 2.1, the $Q^{t}(x,a)$ may vary for different time steps, even if the state-action pair remains the same when the horizon length $T$ is finite. Hence, we cannot ensure that the model $\widehat{Q}$ can express the $Q^{t}$ for all $t$ in the OPE problem with a finite $T$ . To address this issue, we utilize the function class $\widehat{Q}(t,x,a)$ , which incorporates $t$ as an input parameter in practice. Also, to ensure that the class of DR OPE estimator contains the IPW estimator, we assume that the class of $\widehat{Q}$ contains the constant functions. This can be easily satisfied by bringing the parameters $\nu_{0},\nu_{1}\in\mathbb{R}$ to extend the function class by $\nu_{0}+\nu_{1}\widehat{Q}$ .

The DR OPE estimator (1) utilizes the importance ratio using the true probability of the logging policy. When the logging policy $\mu$ is unknown we use the logging policy model $\hat{\mu}(\cdot;\hat{\phi})$ to approximate the unknown $\mu$ . We say that the model $\hat{\mu}$ is correctly specified if there exists a value of $\phi$ such that $\hat{\mu}(x,a;\phi)=\mu(x,a)$ for all $x$ and $a$ . The correct specification of $\hat{\mu}$ is necessary for the consistent estimation of the policy value $V^{\pi}$ , as correctly specifying the value function model $\widehat{Q}$ is difficult in many OPE problems.

In the following sections, our focus will be on the DR OPE estimator, which estimates both the logging policy parameter $\phi$ and the value function parameter $\beta$ . Our goal is to build an efficient OPE estimator that exhibits lower asymptotic variance compared to existing methods.

4 Proposed Method: DRUnknown

In this section, we present our class of DRUnknown estimators. The central concept of DRUnknown is to first estimate the unknown logging policy parameter $\hat{\phi}$ for the IPW component of the DR estimator, and then learn the value function parameter $\hat{\beta}$ to minimize the asymptotic variance of the estimator. We utilize the influence function of the estimator to derive a feasible objective function to minimize the asymptotic variance of DRUnknown.

4.1 DRUnknown for Contextual Bandits

We first present the DRUnknown for contextual bandits problem with $T=1$ . Consider a logging policy model $\hat{\mu}(x,a;\phi)$ parameterized by $\phi$ . Let $\Delta_{a}^{(i)}$ represent the indicator for an action $a$ chosen in trajectory $i$ , and denote the partial derivative of $\hat{\mu}$ with respect to $\phi$ as $\dot{\hat{\mu}}(a|x;\phi)$ . The maximum-likelihood estimator (MLE) $\hat{\phi}=\hat{\phi}_{n}$ is given by the solution of the estimating equation $U_{n}(\phi)$ given by

U_{n}(\phi)=\sum\limits_{i=1}^{n}\sum\limits_{a\in\mathcal{A}}\Delta_{a}^{(i)}% \frac{\dot{\hat{\mu}}(a|x^{(i)};\phi)}{\hat{\mu}(a|x^{(i)};\phi)}=0.

If the logging policy model is correctly specified, the estimating equation is unbiased, and $\hat{\phi}$ is consistent for $\phi$ .

Plugging in the estimated value of $\hat{\phi}$ , we obtain the following class of DR OPE estimator

\widehat{V}^{\text{DR}}(\beta,\hat{\phi})=\frac{1}{n}\sum\limits_{i=1}^{n}% \Bigl{[}\frac{\pi^{(i)}}{\hat{\mu}^{(i)}}(r^{(i)}-\widehat{Q}^{(i)}(\beta))+% \widehat{V}^{(i)}(\beta)\Bigr{]}

where the notations $\pi^{(i)},\hat{\mu}^{(i)},\widehat{Q}^{(i)}(\beta)$ and $\widehat{V}^{(i)}(\beta)$ denote the $\pi(a^{(i)}|x^{(i)})$ , $\hat{\mu}(a^{(i)}|x^{(i)};\hat{\phi})$ , $\widehat{Q}(x^{(i)},a^{(i)};\beta)$ and $\widehat{V}(x^{(i)};\beta)$ respectively. Henceforth in the paper, if there is no ambiguity, we simplify terms containing $x^{(i)}$ and $a^{(i)}$ by using only the superscript $(i)$ .

The expression of the bias and variance of the estimator $\widehat{V}^{\text{DR}}(\beta,\hat{\phi})$ is not simple in general, even for a fixed $\beta$ . The estimator is not a mean of the independent and identically distributed (i.i.d) variables, due to the estimated $\hat{\phi}$ in the denominator. Consequently, estimating the regression parameter $\beta$ to minimize the MSE of the proposed estimator poses challenges. Hence, we instead aim to find $\hat{\beta}$ minimizing the asymptotic variance of the estimator.

To attain this, we derive the influence function asymptotically equivalent to $\widehat{V}$ , through a Taylor expansion with respect to $\phi$ and $\beta$ . Denote the new regression function $F$ , the product of the target policy $\pi$ and the $\widehat{Q}$ with an additional estimating effect term of $\phi$ , given by

F(x,a;\beta,c,\phi)=\pi(a|x)\widehat{Q}(x,a;\beta)+c^{\top}\dot{\hat{\mu}}(a|x% ;\phi),

for $\beta$ and $c\in\mathbb{R}^{\text{dim}(\phi)}.$ The influence function of $\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})$ with arbitrary estimator $\hat{\beta}$ can be formulated employing $F$ , as described in the following proposition.

Proposition 1 (Asymptotic Equivalence of $\widehat{V}$ for CB).

Consider an estimator $\hat{\beta}$ converging to some $\beta^{*}$ in probability. If the model $\hat{\mu}(\cdot;\phi)$ is correctly specified, the DR OPE estimator is asymptotically linear with influence function $\eta$ :

\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})=\widetilde{V}(\beta^{*},c(% \beta^{*}))+o_{p}(n^{-1/2})=\frac{1}{n}\sum\limits_{i=1}^{n}\eta^{(i)}(\beta,c% (\beta^{*}))+o_{p}(n^{-1/2}),

where $c(\beta)$ is a vector that solely depends on $\beta$ , and

\eta^{(i)}(\beta,c)=\frac{1}{\mu^{(i)}}\bigl{[}\pi^{(i)}r^{(i)}-F^{(i)}(\beta,% c,\phi)\bigr{]}\\ +\sum\limits_{a\in\mathcal{A}}F(x^{(i)},a;\beta,c,\phi).

As the value of the vector $c(\beta^{*})$ is unknown, we estimate both $\beta$ and $c$ that minimize the variance of $\widetilde{V}(\beta,c)$ . Denote the vector $\vec{F}(x;\beta,c,\phi)=(F(x,a;\beta,c,\phi))_{a\in\mathcal{A}}$ and $\pi\vec{Q}(x)=(\pi(a|x)Q(x,a))_{a\in\mathcal{A}}$ , and the gradient matrix $f(x;\beta,c,\phi)=\displaystyle\frac{\partial\vec{F}}{\partial(\beta,c)}.$ By the law of total variance, the variance of $\widetilde{V}^{\text{DR}}(\beta,c)$ can be expressed as the square of stochastic seminorm added by a constant independent of $\beta$ and $c$ .

Proposition 2 (Variance of $\widetilde{V}^{\text{DR}}$ ).

The variance of $\widetilde{V}(\beta,c)$ is given by

n\mathrm{Var}{(\widetilde{V}(\beta,c))}=\mathrm{Var}(V^{\pi}(x))+\text{$% \mathbb{E}$}_{\mu}\displaystyle\left\|\pi\vec{r}-\pi\vec{Q}(x)\right\|_{M_{\mu% }}^{2}+\text{$\mathbb{E}$}_{\mu}\displaystyle\left\|\vec{F}(x;\beta,c,\phi)-% \pi\vec{Q}(x)\right\|_{M_{\mu}}^{2}

where $M_{\mu}=\operatorname{diag}(\mu(a|x)^{-1})_{a\in\mathcal{A}}-J_{K}$ and $J_{K}\in\mathbb{R}^{K\times K}$ the matrix of ones.

The Proposition 2 tells minimizing the square of seminorm

\text{$\mathbb{E}$}_{\mu}{\left\|\vec{F}(x;\beta,c,\phi)-\pi\vec{Q}(x)\right\|% _{M_{\mu}}^{2}}

(2)

with respect to $\beta$ and $c$ is equivalent to minimizing the variance of the $\widetilde{V}(\beta,c)$ . Denote the minimizer of (2) by $(\beta_{\text{opt}},c_{\text{opt}})$ , the zero of its gradient

\begin{split}\text{$\mathbb{E}$}_{\mu}\bigl{[}f^{\top}(x;\beta,c,\phi)M_{\mu}(% \vec{F}(x;\beta,c,\phi)-\pi\vec{Q}(x))\bigr{]}=0.\end{split}

We can observe that the solution satisfies $c_{\text{opt}}=c(\beta_{\text{opt}})$ , so that the variance of $\widetilde{V}(\beta_{\text{opt}},c(\beta_{\text{opt}}))=\widetilde{V}(\beta_{% \text{opt}},c_{\text{opt}})$ is minimized at $(\beta_{\text{opt}},c_{\text{opt}})$ . Therefore by Proposition 1, the smallest asymptotic variance of DR OPE estimator $\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})$ under the correctly specified $\hat{\mu}$ , is achieved by $\hat{\beta}$ converging in probability to this $\beta_{\text{opt}}$ .The following estimating equation $S_{n}(\beta,c)$ is a tractable equation that jointly solves for $(\beta,c)$ , with its solution $\hat{\beta}$ consistent for $\beta_{\text{opt}}$ :

S_{n}(\beta,c)=\sum\limits_{i=1}^{n}f^{(i)\top}(\beta,c)M^{(i)}\bigl{(}F^{(i)}% (\beta,c)-\operatorname{diag}(\pi(a|x^{(i)}))_{a\in\mathcal{A}}~{}\vec{r}^{(i)% }\bigr{)}=0

(3)

for $M^{(i)}=\displaystyle\operatorname{diag}(\hat{\mu}(a|x^{(i)})^{-1})_{a\in% \mathcal{A}}-J_{K}$ and the pseudo reward $\vec{r}^{(i)}=\displaystyle\bigl{(}\frac{\Delta_{a}^{(i)}}{\hat{\mu}(a|x^{(i)}% )}r^{(i)}\bigr{)}_{a\in\mathcal{A}}$ . Plugging in the solution $\hat{\beta}$ of $S_{n}(\beta,c)=0$ as $\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})$ , we obtain our proposed estimator, DRUnknown. The $c$ serves as an auxiliary parameter that adjusts for the effect of estimating $\hat{\phi}$ , and does not have a direct role in the final estimator $\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})$ .

The proposed DRUnknown is doubly-robust within the statistical context, as it remains consistent if either the logging policy model $\hat{\mu}$ or the value function model $\widehat{Q}$ is correctly specified. We have addressed the scenario with the correct logging policy model above. Regarding the second scenario, the minimizer of (2) is $(\beta_{\text{opt}},c_{\text{opt}})=(\beta_{0},0)$ since $\Gamma(\beta_{0})=0$ , where $\beta_{0}$ is the true value of the parameter satisfying $Q(\cdot)=\widehat{Q}(\cdot;\beta_{0}).$ Consequently, the solution of (3) converges to $(\beta_{0},0)$ , establishing the consistency of DRUnknown. This observation can be summarized as the following proposition.

Proposition 3 (Doubly-Robustness).

The DRUnknown $\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})$ for contextual bandits is a doubly robust: it converges to $V^{\pi}$ in probability if either the logging policy model $\hat{\mu}$ or the value function model $\widehat{Q}$ is correctly specified.

4.1.1 Intuitive Understanding of DRUnknown for CB: extended function class

For an intuitive understanding of DRUnknown for contextual bandits, suppose that the target policy $\pi$ always have a nonzero value. Then, $F$ can be rewritten as $F(x,a;\beta,\phi)=\pi(a|x)\widetilde{Q}(x,a;\beta,c)$ for the extended function class

\widetilde{Q}(x,a;\beta,c)=\widehat{Q}(x,a;\beta,c)+c^{\top}\displaystyle\frac% {\dot{\hat{\mu}}(a|x;\phi)}{\pi(a|x)}.

The objective function (2) we aim to minimize in DRUnknown can then be expressed as

\text{$\mathbb{E}$}_{\mu}{\left\|\bigl{(}\pi(a|x)[\widetilde{Q}(x,a;\beta,c)-Q% (x,a)]\bigr{)}_{a\in\mathcal{A}}\right\|_{M_{\mu}}^{2}}.

The additional linear term $\displaystyle c^{\top}\frac{\dot{\hat{\mu}}(a|x;\phi)}{\pi(a|x)}$ from $\widetilde{Q}$ can be seen as the projection of $Q^{\pi}$ onto the inverse probability tangent space, the linear space spanned by the score function of $\hat{\phi}$ .

Hence, the proposed estimator can be seen as employing a linear parameter $c$ within the expanded function class $\widetilde{Q}$ to remove the effect of estimating $\hat{\phi}$ , while seeking the optimal $\hat{\beta}$ for $\widehat{Q}$ that minimizes the asymptotic variance.

4.2 DRUnknown for Reinforcement Learning

The proposed DRUnknown for RL is constructed similarly to the case of CB. However, its analysis is more complex, as the estimator is expressed as a weighted sum of terms observed from time step $0$ to $T-1$ . Below, we introduce the construction of DRUnknown for RL, commencing with the estimation of $\hat{\phi}$ .

The MLE $\hat{\phi}_{n}=\hat{\phi}$ of $\phi$ for RL is given by the solution of

U_{n}(\phi)=\sum\limits_{i=1}^{n}\sum\limits_{t=0}^{T-1}\sum\limits_{a\in% \mathcal{A}}\Delta_{a,t}^{(i)}\frac{\dot{\hat{\mu}}(a|x_{t}^{(i)};\phi)}{\hat{% \mu}(a|x_{t}^{(i)};\phi)}=0,

where $\Delta_{a,t}^{(i)}$ is the indicator variable for selected action at time step $t$ from trajectory $i$ . Substituting the MLE $\hat{\phi}$ into the DR estimator (1), the estimator is given by $\widehat{V}^{\text{DR}}(\beta,\hat{\phi})=\displaystyle\sum\limits_{t=0}^{T-1}% \gamma^{t}\widehat{V}_{t}(\beta,\hat{\phi}),$ with

\widehat{V}_{t}(\beta,\hat{\phi})=\frac{1}{n}\sum\limits_{i=1}^{n}\hat{\rho}_{% 0:{t-1}}^{(i)}\bigl{[}\hat{\rho}_{t}^{(i)}[r_{t}^{(i)}-\widehat{Q}^{(i)}(\beta% )]+\widehat{V}^{(i)}(\beta)\bigr{]}.

The estimator of the value at each time step $t$ , $\widehat{V}_{t}(\beta,\hat{\phi})$ , takes the same form of DRUnknown for CB, weighted by $\hat{\rho}^{(i)}_{0:{t-1}}$ . However, additional analysis is needed in the case of RL, as the $\widehat{V}_{t}$ for each $t$ are stochastically correlated. Now, as done in CB, we define the value regression function $F_{t}(x,a;\beta,c,\phi)$ for each $t$ , given by

F_{t}(x,a;\beta,c,\phi)=\rho_{0:t-1}\pi(a|x)\widehat{Q}(x,a;\beta)+\gamma^{-t}% c^{\top}\dot{\hat{\mu}}(a|x;\phi).

and we derive the influence function of the proposed estimator as stated in the following proposition.

Proposition 4 (Asymptotic Equivalence of $\widehat{V}$ for RL).

Consider an estimator $\hat{\beta}$ converging to some $\beta^{*}$ in probability. If the logging policy model $\hat{\mu}(\cdot;\phi)$ is correctly specified, then the proposed DR OPE estimator is asymptotically linear, with influence function $\eta=\displaystyle\sum\limits_{t=0}^{T-1}\gamma^{t}\eta_{t}$ :

\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})=\widetilde{V}(\beta^{*},c(% \beta^{*}))+o_{p}(n^{-1/2})=\frac{1}{n}\sum\limits_{i=1}^{n}\eta^{(i)}(\beta,c% (\beta^{*}))+o_{p}(n^{-1/2}),

(4)

where $c(\beta)$ is a vector that solely depends on $\beta$ , and

\eta_{t}^{(i)}(\beta,c)=\frac{1}{\mu^{(i)}_{t}}\bigl{[}\rho_{0:t-1}^{(i)}\pi^{% (i)}_{t}r_{t}^{(i)}-F_{t}^{(i)}(\beta,c,\phi)\bigr{]}+\sum\limits_{a\in% \mathcal{A}}F_{t}(x_{t}^{(i)},a;\beta,c,\phi).

Now, to derive the asymptotic variance of the proposed estimator for RL, we denote the vector $\vec{F}_{t}(x;\beta,c,\phi)=(F_{t}(x,a;\beta,c,\phi))_{a\in\mathcal{A}},$ $\pi\vec{Q}^{t}(x)=(\pi(a|x)Q^{t}(x,a))_{a\in\mathcal{A}}$ and the gradient matrix $f_{t}(x;\beta,c,\phi)=\displaystyle\frac{\partial\vec{F}_{t}}{\partial(\beta,c% )}.$

The asymptotic variance of the DR OPE estimator for RL can be expressed with these notations as in the following proposition.

Proposition 5 (Variance of $\widetilde{V}$ for RL).

The variance of $\widetilde{V}(\beta,c)$ is given by

n\mathrm{Var}{(\widetilde{V}(\beta,c))}=C_{T}+\sum\limits_{t=0}^{T-1}\gamma^{2% t}\text{$\mathbb{E}$}_{\mu}\left\|\vec{F}_{t}(x_{t};\beta,c,\phi)-\rho_{0:t-1}% \pi\vec{Q}^{t}(x_{t})\right\|^{2}_{M_{t,\mu}}

where $M_{t,\mu}=\operatorname{diag}(\mu(a|x_{t})^{-1})_{a\in\mathcal{A}}-J_{K}$ and

C_{T}=\sum\limits_{t=0}^{T-1}\gamma^{2t}\text{$\mathbb{E}$}_{\mu}\bigl{[}% \mathrm{Var}_{t}\left[\rho_{0:t-1}V^{t}(x_{t})\right]+\mathrm{Var}_{t+1}[\rho_% {0:t}r_{t}]\bigr{]},

with $\mathrm{Var}_{t}$ the variance conditioned on the history up to time step $t-1$ , $\{x_{0},a_{0},\dots,x_{t-1},a_{t-1}\}$ .

Similar to the case of CB, the proposition implies that minimizing the weighted sum of seminorms with respect to $\beta$ and $c$ is equivalent to minimizing the asymptotic variance of $\widehat{V}^{\text{DR}}$ . The solution $\hat{\beta}$ of the following estimating equation is consistent for $\beta_{\text{opt}}$ , minimizing the asymptotic variance of $\widehat{V}^{\text{DR}}$ by $\sigma^{2}(\beta^{*})=\mathrm{Var}_{\mu}{[\eta(\beta^{*},c(\beta^{*}))]}$ .

S_{n}(\beta,c)=\sum\limits_{i=1}^{n}\sum\limits_{t=0}^{T-1}\gamma^{2t}f_{t}^{(% i)\top}(\beta,c,\hat{\phi})M_{t}^{(i)}\bigl{(}\vec{F}_{t}^{(i)}(\beta,c)-\hat{% \rho}_{0:t-1}\operatorname{diag}(\pi_{t}^{(i)})_{a\in\mathcal{A}}~{}\vec{r}_{t% :T-1}^{(i)}\bigr{)}=0,

(5)

with the pseudo-reward $\displaystyle{\vec{r}^{(i)}_{t:T-1}}=\bigl{(}\frac{\Delta_{a,t}^{(i)}}{\hat{% \mu}(a|x_{t}^{(i)})}\bar{R}_{t:T-1}^{(i)}\bigr{)}_{a\in\mathcal{A}}$ for $\bar{R}_{t:T-1}^{(i)}=r_{t}^{(i)}+\displaystyle\sum\limits_{\tau=t+1}^{T-1}% \gamma^{\tau-t}\hat{\rho}_{t+1:\tau}^{(i)}r_{\tau}^{(i)},$ and $M_{t}^{(i)}=\displaystyle\operatorname{diag}(\hat{\mu}(a|x_{t}^{(i)})^{-1})_{a% \in\mathcal{A}}-J_{K}$ .

5 Theoretical Properties

We now present the theoretical properties of the proposed estimator, with a focus on its asymptotic distribution. Combining Proposition 1, 2, 4 and 5, the DRUnknown has the following asymptotic normality for both contextual bandits and reinforcement learning problems.

Theorem 6 (Asymptotic distribution).

When the logging policy model $\hat{\mu}$ is correctly specified, the proposed DRUnknown estimator is asymptotically normal, given by

\sqrt{n}(\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})-V^{\pi})\xrightarrow{% d}\mathcal{N}(0,\sigma^{2})

for $\sigma^{2}=\sigma^{2}(\beta^{*}).$

To calculate the confidence interval of $V^{\pi}$ , one needs to estimate the unknown variance $\sigma^{2}$ . As we know that

\begin{split}\sigma^{2}&=\sigma^{2}(\beta^{*},c(\beta^{*}),\phi)\\ &=\mathrm{Var}_{\mu}{[\widetilde{V}(\beta^{*},c(\beta^{*}))]}=n\mathrm{Var}_{% \mu}[\eta(\beta^{*},c(\beta^{*}))],\end{split}

we have $\hat{\sigma}^{2}(\beta,c(\beta^{*}),\phi)=\displaystyle\frac{1}{n}\sum\limits_% {i=1}^{n}(\eta^{(i)}(\hat{\beta},\hat{c},\hat{\phi})-\bar{\eta})^{2}$ is a consistent estimator for $\sigma^{2}$ , with $\bar{\eta}=\displaystyle\frac{1}{n}\sum\limits_{i=1}^{n}\eta^{(i)}(\hat{\beta}% ,\hat{c},\hat{\phi})$ the average of $\eta^{(i)}$ and $\hat{\beta},\hat{c}$ are the solutions of (5) and the MLE $\hat{\phi}$ . By the Slutzky’s theorem, we have

\hat{\sigma}^{2}=\hat{\sigma}^{2}(\hat{\beta},\hat{c},\hat{\phi})\xrightarrow{% p}\sigma^{2},

and combining the result with the theorem we have the $(1-\alpha)$ confidence interval as

\bigl{[}\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})-z_{\alpha/2}\frac{\hat% {\sigma}^{2}}{\sqrt{n}},\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})+z_{% \alpha/2}\frac{\hat{\sigma}^{2}}{\sqrt{n}}\bigr{]},

with $z_{\alpha/2}$ the $\alpha/2$ standard Gaussian quantile.

When the value function model $\widehat{Q}$ is also correctly specified, the proposed DRUnknown is asymptotically equivalent to the DR OPE estimator using the true logging policy and the value function, and is locally efficient.

Proposition 7 (Local Efficiency).

When both the logging policy model $\hat{\mu}$ and the value function model $\widehat{Q}$ are correctly specified, The asymptotic variance of the proposed estimator achieves the semiparametric lower bound and is asymptotically optimal.

Proposition 7 states that DRUnknown can be asymptotically optimal when the model $\widehat{Q}$ is properly chosen. Also, for any choice of value function model, whether it is correct or not, the proposed estimator is asymptotically more efficient compared to existing OPE algorithms, including IPW, DR, and MRDR, in the OPE problem with an unknown logging policy.

Proposition 8 (Intrinsic Efficiency).

The proposed DR OPE estimator has the smallest asymptotic variance among the class of DR OPE using MLE for $\phi$ , when the logging policy model $\hat{\mu}$ is correctly specified. The same holds for arbitrary consistent estimator of $\phi$ .

Corollary 9 (Comparison to Existing Algorithms).

The proposed estimator has at least smaller asymptotic variance than IPW, DR and MRDR utilizing the same estimator for $\phi$ , when the logging policy model $\hat{\mu}$ is correctly specified.

6 Experimental Results

In this section, we compare the performance of the four estimators: (i) IPW (Horvitz and Thompson, 1952) (ii) MLIPW (Xie et al., 2019) (iii) MRDR (Farajtabar et al., 2018) and (iv) the proposed DRUnknown on CB and RL problems. IPW requires knowledge of the true logging policy $\mu$ and cannot be applied in our experimental scenario. Therefore it serves as a baseline for the other three estimators, and we compare the relative MSE of each estimator.

6.1 Contextual Bandits

6.1.1 Simulation Data

We use the simulation environments described as follows. We generate $N=100$ test datasets with size $n$ , given by $D_{j}=\{(x_{i}^{j},a_{i}^{j},r_{i}^{j})\}_{i=1}^{n},~{}j\in[N]$ . The context $x_{i}^{j}$ for $i^{\text{th}}$ sample from $j^{\text{th}}$ dataset contains $d=5$ dimensional $K=10$ vectors randomly sampled from uniform distribution $U(-1/\sqrt{d},1/\sqrt{d})$ . The rewards are generated from the Gaussian distribution with mean $\exp(x_{i}^{j\top}\beta)$ and variance 1, where $\beta$ is a fixed coefficient also sampled from the uniform distribution $U(-1/\sqrt{d},1/\sqrt{d})$ .

The logging policy $\mu$ and target policy $\pi$ follows the linear logistic model with random coefficients $\phi_{\mu}$ and $\phi_{\pi}$ as $\mu(a|x)=\displaystyle\frac{\exp(x_{a}^{\top}\phi_{\mu})}{\sum\limits_{i=1}^{K% }\exp(x_{i}^{\top}\phi_{\mu})}$ and $\pi(a|x)=\displaystyle\frac{\exp(x_{a}^{\top}\phi_{\pi})}{\sum\limits_{i=1}^{K% }\exp(x_{i}^{\top}\phi_{\pi})}.$ The logging policy model $\hat{\mu}$ also follows the linear logistic model, while the value function model $\hat{Q}$ is defined by the linear regression model. Consequently, only the logging policy model is correctly specified.

The Table 1 and Figure 1 reports the relative MSE values of MLIPW, MRDR and the proposed DRUnknown, calculated by dividing their MSE by the MSE of IPW, for the different sizes $n$ of each dataset. The proposed estimator achieves the lowest relative MSE. Figure 2 displays boxplots representing the estimated values of four methods, each computed with $N=100$ repeats and a dataset size of $n=10000$ .

sample size	MLIPW	MRDR	DRUnknown
$5000$	0.8911	0.8126	0.8038
$6000$	0.8416	0.7957	0.7731
$7000$	0.8366	0.8043	0.7620
$8000$	0.8234	0.8073	0.7560
$9000$	0.8222	0.8200	0.7662
$10000$	0.7575	0.7369	0.6792

Table 1: The relative MSE of the estimators on synthetic dataset with respect to the

n

, the number of samples.

Refer to caption — Figure 1: The relative MSE of the estimators on synthetic dataset with respect to the $n$ , the number of samples.

6.1.2 UCI Machine Learning Repository Dataset

For the real data experiment, we transform six classification datasets of the UCI Machine Learning Repository into contextual bandit problems : glass, letter, zooimage, iris and handwritten (German, 1987; Slate, 1991; Forsyth, 1990; mis, 1990; Fisher, 1988; Alpaydin and Kaynak, 1998). Assigning the data to each class is considered as pulling an arm in the bandit. When the class is correct the reward is 1 and otherwise 0.

We construct the logging policy $\mu$ as follows: we train the policy $\mu_{0}$ using a linear logistic model on a separate dataset, and mix the policy with a random policy $\mu_{\text{random}}$ as $\mu=\alpha\mu_{0}+(1-\alpha)\mu_{\text{random}}$ , for $\alpha\in(0,1)$ . The target policy $\pi$ is constructed as in the simulation data experiment. The logging policy class is defined as $\{\hat{\mu}=\alpha\mu_{0}+(1-\alpha)\mu_{\text{random}}:\alpha\in(0,1)\}$ , where $\alpha$ serves as the parameter $\phi$ , and we use the constant value function model for $\widehat{Q}$ .

We generate $N=100$ datasets, each with a size of $n=10000$ , by randomly sampling contexts from the original dataset and selecting arms using the logging policy $\mu$ . Table 2 displays the relative MSE values of MLIPW, MRDR, and our proposed DRUnknown. Additionally, Figure 3 depicts the log-relative MSE of each estimator on the glass dataset as a function of the logging policy parameter $\alpha$ . The DRUnknown consistently demonstrates the smallest relative MSE across all six datasets.

dataset	MLIPW	MRDR	DRUnknown
glass	0.9548	0.8308	0.8125
letter	0.9677	0.9323	0.9265
zoo	0.9987	0.9842	0.9583
image	0.8250	0.8245	0.8062
iris	0.6116	0.5715	0.5454
handwritten	0.9521	0.9432	0.9407

Table 2: The relative MSE of the estimators on UCI datasets, with

\alpha=0.4,n=10000

6.2 Reinforcement Learning

In this section, we provide the experimental results of OPE in the context of RL. We conducted the experiments on the ModelWin and ModelFail domains introduced in Thomas and Brunskill (2016). Detailed descriptions of these environments are available in the Appendix. As done in Section 6.1.2, we utilize a mixture of the separately trained optimal policy and the uniform random policy with a rate $\alpha\in(0,1)$ as the logging policy $\mu$ . We employ a linear logistic policy for the target policy $\pi$ and a linear model as the value function model $\widehat{Q}$ .

Table 3 displays the relative MSE of each estimator in ModelWin with $T=20$ and ModelFail environments. The proposed DRUnknown attains the lowest relative MSE in most scenarios, if the number of sample $n$ is large enough. Figure 4 illustrates the cumulative distribution function (CDF) of squared error values (larger is better) on the ModelWin domain with $T=20$ and $n=20$ . The DRUnknown exhibits the highest CDF values, indicating that the estimated values are more concentrated around the true value with high probability.

6.3 Conclusion

In conclusion, our study has introduced DRUnknown, a novel doubly-robust off-policy evaluation estimator for contextual bandits and reinforcement learning problems where both the logging policy and the value function remain unknown. Through a two-step estimation process, DRUnknown first estimates the logging policy using maximum likelihood estimator and subsequently estimates the value function model, aiming to minimize the asymptotic variance of the estimator while considering the impact of the logging policy estimation.

When the logging policy model is correctly specified, DRUnknown achieves the smallest asymptotic variance within the class that encompasses existing OPE estimators. Furthermore, if the value function model is also correctly specified, DRUnknown attains optimality by reaching the semiparametric lower bound for asymptotic variance.

To show the effectiveness of DRUnknown, we conducted experiments in both contextual bandits and reinforcement learning settings. The empirical results demonstrate the superior performance of DRUnknown compared to existing methods.

7 Broader Impact

This paper addresses the off-policy evaluation problem, aiming to contribute to the field of Machine Learning. It includes theoretical contributions and numerical experimental results, with no apparent societal or ethical issues that require further discussion.

ModelWin	MLIPW	MRDR	DRUnknown
$n=10$	1.1604	1.9963	0.8916
$n=20$	1.0187	2.5577	0.8273
$n=30$	1.1064	1.7349	0.8529
$n=40$	1.0676	1.5270	0.9117

ModelFail	MLIPW	MRDR	DRUnknown
$n=10$	0.7794	0.7963	0.8476
$n=20$	0.5608	0.5591	0.2951
$n=30$	0.5532	0.5605	0.3154
$n=40$	0.5273	0.5285	0.1451

Table 3: The realtive MSE of the estimators on ModelWin and ModelFail environments.

References

Dudík et al. (2011) Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104, 2011.
Swaminathan et al. (2017) Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. Advances in Neural Information Processing Systems, 30, 2017.
Precup (2000) Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
Mahmood et al. (2014) A Rupam Mahmood, Hado P Van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation. Advances in neural information processing systems, 27, 2014.
Jiang and Li (2016) Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661. PMLR, 2016.
Thomas and Brunskill (2016) Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148. PMLR, 2016.
Wang et al. (2017) Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597. PMLR, 2017.
Farajtabar et al. (2018) Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447–1456. PMLR, 2018.
Su et al. (2020) Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning, pages 9167–9176. PMLR, 2020.
Cao et al. (2009) Weihua Cao, Anastasios A Tsiatis, and Marie Davidian. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika, 96(3):723–734, 2009.
Rubin and van der Laan (2008) Daniel B Rubin and Mark J van der Laan. Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis. The International Journal of Biostatistics, 4(1), 2008.
Li et al. (2015) Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation. In Artificial Intelligence and Statistics, pages 608–616. PMLR, 2015.
Raghu et al. (2018) Aniruddh Raghu, Omer Gottesman, Yao Liu, Matthieu Komorowski, Aldo Faisal, Finale Doshi-Velez, and Emma Brunskill. Behaviour policy estimation in off-policy policy evaluation: Calibration matters. arXiv preprint arXiv:1807.01066, 2018.
Xie et al. (2019) Yuan Xie, Boyi Liu, Qiang Liu, Zhaoran Wang, Yuan Zhou, and Jian Peng. Off-policy evaluation and learning from logged bandit feedback: Error reduction via surrogate policy. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=HklKui0ct7.
Hanna et al. (2021) Josiah P Hanna, Scott Niekum, and Peter Stone. Importance sampling in reinforcement learning with an estimated behavior policy. Machine Learning, 110(6):1267–1317, 2021.
Rothe (2016) Christoph Rothe. The value of knowing the propensity score for estimating average treatment effects. Available at SSRN 2797560, 2016.
Horvitz and Thompson (1952) Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
Cassel et al. (1976) Claes M Cassel, Carl E Särndal, and Jan H Wretman. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3):615–620, 1976.
Robins and Rotnitzky (1995) James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
German (1987) B. German. Glass Identification. UCI Machine Learning Repository, 1987. DOI: https://doi.org/10.24432/C5WW2P.
Slate (1991) David Slate. Letter Recognition. UCI Machine Learning Repository, 1991. DOI: https://doi.org/10.24432/C5ZP40.
Forsyth (1990) Richard Forsyth. Zoo. UCI Machine Learning Repository, 1990. DOI: https://doi.org/10.24432/C5R59V.
mis (1990) Image Segmentation. UCI Machine Learning Repository, 1990. DOI: https://doi.org/10.24432/C5GP4N.
Fisher (1988) R. A. Fisher. Iris. UCI Machine Learning Repository, 1988. DOI: https://doi.org/10.24432/C56C76.
Alpaydin and Kaynak (1998) E. Alpaydin and C. Kaynak. Optical Recognition of Handwritten Digits. UCI Machine Learning Repository, 1998. DOI: https://doi.org/10.24432/C50P49.
Kallus and Uehara (2020) Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient and robust off-policy evaluation. In International Conference on Machine Learning, pages 5078–5088. PMLR, 2020.

Appendix A Missing Proofs

A.1 Proof of Proposition 1

Proof.

By the Taylor expansion,

\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})=V(\beta^{*})+\text{$\mathbb{E}% $}_{\mu}{[\frac{\partial\widehat{V}}{\partial\beta}]^{\top}}(\hat{\beta}-\beta% ^{*})+\text{$\mathbb{E}$}_{\mu}{[\frac{\partial\widehat{V}}{\partial\phi}]^{% \top}}(\hat{\phi}-\phi)+o_{p}(n^{-1/2}),

where

V(\beta)=\frac{1}{n}\sum\limits_{i=1}^{n}\Bigl{[}\frac{\pi^{(i)}}{\mu^{(i)}}(r% ^{(i)}-\widehat{Q}^{(i)}(\beta))+\widehat{V}^{(i)}(\beta)\Bigr{]}

and $\hat{\beta}\xrightarrow{p}\beta^{*}$ for some $\beta^{*}$ . As $\hat{\mu}$ is correctly specified, we have

\text{$\mathbb{E}$}_{\mu}{[\frac{\partial\widehat{V}}{\partial\beta}]}=\frac{% \partial}{\partial\beta}\text{$\mathbb{E}$}_{\mu}{[-\frac{\pi(a|x)}{\mu(a|x)}% \widehat{Q}(x,a;\beta^{*})+\widehat{V}(x;\beta^{*})]}=0.

For the following term,

-\text{$\mathbb{E}$}_{\mu}{[\frac{\partial\widehat{V}}{\partial\phi}]}=-\text{% $\mathbb{E}$}_{\mu}{[-\frac{\pi\dot{\hat{\mu}}}{\mu^{2}}(r-\widehat{Q}(x,a;% \beta^{*})]}=\text{$\mathbb{E}$}_{\pi}{[-\frac{\dot{\hat{\mu}}}{\mu}(r-% \widehat{Q}(x,a;\beta^{*})]}:=\Gamma(\beta).

For $\hat{\phi}-\phi$ , from $U_{n}(\hat{\phi})=0$ we have $0=U(\phi)+\text{$\mathbb{E}$}{[\dot{U}(\phi)]}(\hat{\phi}-\phi)+o_{p}(n^{-1/2}),$ where

\text{$\mathbb{E}$}{[\dot{U}(\phi)]}=\displaystyle\sum\limits_{a\in\mathcal{A}% }\text{$\mathbb{E}$}_{\mu}{[\frac{1}{\mu(a|x)^{2}}\dot{\hat{\mu}}(a|x;\phi)% \dot{\hat{\mu}}(a|x;\phi)^{\top}]}:=\Sigma_{\phi\phi}

and

\hat{\phi}-\phi=-\Sigma_{\phi\phi}^{-1}U(\phi)+o_{p}(n^{-1/2}).

Plugging in the results above and denoting $c(\beta)=\Sigma_{\phi\phi}^{-1}\Gamma(\beta)$ , we have

\begin{split}\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})&=V(\beta^{*})+% \Gamma(\beta^{*})^{\top}\Sigma_{\phi\phi}^{-1}U(\phi)+o_{p}(n^{-1/2})\\ &=V(\beta^{*})+c(\beta^{*})^{\top}U(\phi)+o_{p}(n^{-1/2})\\ &=\frac{1}{n}\sum\limits_{i=1}^{n}\eta^{(i)}(\beta^{*},c(\beta^{*}))+o_{p}(n^{% -1/2}),\end{split}

for $\eta^{(i)}(\beta,c)=\displaystyle\frac{1}{\mu(a^{(i)}|x^{(i)})}\bigl{[}\pi(a^{% (i)}|x^{(i)})r^{(i)}-F(x,a;\beta,c,\phi)\bigr{]}+\sum\limits_{a\in\mathcal{A}}% F(x^{(i)},a;\beta,c,\phi)$ , since $\sum\limits_{a\in\mathcal{A}}\dot{\hat{\mu}}(a|x;\phi)=0.$ ∎

A.2 Proof of Proposition 2

Proof.

By the law of total variance, we decompose the variance of $\widetilde{V}(\beta,c)$ with $n=1$ , given contexts $x$ and rewards $r$ :

\mathrm{Var}{(\widetilde{V}(\beta,c))}=\mathrm{Var}\text{$\mathbb{E}$}{(% \widetilde{V}(\beta,c))|x,r)}+\text{$\mathbb{E}$}\mathrm{Var}{(\widetilde{V}(% \beta,c))|x,r)}.

For the first term, we have $\text{$\mathbb{E}$}_{\mu}{(\widetilde{V}(\beta,c))|x,r)}=V^{\pi}(x),$ and for the second term,

\begin{split}&\mathrm{Var}{(\widetilde{V}(\beta,c))|x,r)}=\mathrm{Var}_{\mu}% \bigl{[}\frac{1}{\mu(a|x)}(\pi(a|x)r-F(x,a;\beta,c,\phi))\bigr{|}x,r]\\ &=\text{$\mathbb{E}$}_{\mu}\bigl{[}\frac{1}{\mu(a|x)^{2}}(\pi(a|x)r-F(x,a;% \beta,c,\phi))^{2}\bigr{|}x,r]-\text{$\mathbb{E}$}_{\mu}\bigl{[}\frac{1}{\mu(a% |x)}(\pi(a|x)r-F(x,a;\beta,c,\phi))\bigr{|}x,r]^{2}\\ &=\sum\limits_{a\in\mathcal{A}}\frac{1}{\mu(a|x)}(\pi(a|x)r-F(x,a;\beta,c,\phi% ))^{2}-[\sum\limits_{a\in\mathcal{A}}(\pi(a|x)r-F(x,a;\beta,c,\phi))]^{2}\\ &=\displaystyle\left\|\vec{F}(x;\beta,c,\phi)-\pi\vec{r}\right\|_{M_{\mu}}^{2}% \end{split}

and since $\text{$\mathbb{E}$}{[\pi\vec{Q}(x)-\pi\vec{r}]}=0$ , we have

\begin{split}n\mathrm{Var}{(\widetilde{V}(\beta,c))}&=\mathrm{Var}(V^{\pi}(x))% +\text{$\mathbb{E}$}_{\mu}\displaystyle\left\|\vec{F}(x;\beta,c,\phi)-\pi\vec{% r}\right\|_{M_{\mu}}^{2}\\ &=\mathrm{Var}(V^{\pi}(x))+\text{$\mathbb{E}$}_{\mu}\displaystyle\left\|\vec{F% }(x;\beta,c,\phi)-\pi\vec{Q}(x)\right\|_{M_{\mu}}^{2}+\text{$\mathbb{E}$}_{\mu% }\displaystyle\left\|\pi\vec{Q}(x)-\pi\vec{r}\right\|_{M_{\mu}}^{2}.\end{split}

∎

A.3 Proof of Proposition 4

Proof.

The proof is similar to that of Proposition 1 for contextual bandits, with $T=1$ .

For each $t\in[T]$ , applying the Taylor expansion, we have

\widehat{V}_{t}(\hat{\beta},\hat{\phi})=V_{t}(\beta^{*})+\text{$\mathbb{E}$}_{% \mu}{[\frac{\partial\widehat{V}_{t}}{\partial\beta}]^{\top}}(\hat{\beta}-\beta% ^{*})+\text{$\mathbb{E}$}_{\mu}{[\frac{\partial\widehat{V}_{t}}{\partial\phi}]% ^{\top}}(\hat{\phi}-\phi)+o_{p}(n^{-1/2}),

where

V_{t}(\beta)=\frac{1}{n}\sum\limits_{i=1}^{n}\rho^{(i)}_{0:{t-1}}\bigl{[}\rho^% {(i)}_{t}[r_{t}^{(i)}-\widehat{Q}^{(i)}_{t}(\beta)]+\widehat{V}^{(i)}_{t}(% \beta)\bigr{]}

and $\hat{\beta}\xrightarrow{p}\beta^{*}$ for some $\beta^{*}$ . As $\hat{\mu}$ is correctly specified, we have

\text{$\mathbb{E}$}_{\mu}{[\frac{\partial\widehat{V}_{t}}{\partial\beta}]}=% \frac{\partial}{\partial\beta}\displaystyle\text{$\mathbb{E}$}_{\mu}{[\rho_{0:% t-1}[-\frac{\pi(a_{t}|x_{t})}{\mu(a_{t}|x_{t})}\widehat{Q}(x_{t},a_{t};\beta^{% *})+\widehat{V}(x_{t};\beta^{*})]]}=0.

For the following term,

\begin{split}\text{$\mathbb{E}$}_{\mu}{[\frac{\partial\widehat{V}_{t}}{% \partial\phi}]}&=\displaystyle\text{$\mathbb{E}$}_{\mu}{\bigl{[}\frac{\partial% \rho_{0:t-1}}{\partial\phi}[\rho_{t}(r_{t}-\widehat{Q}_{t}(x_{t},a_{t};\beta^{% *}))+\widehat{V}_{t}(x_{t},\beta^{*})]+\rho_{0:t-1}\frac{\partial\rho_{t}}{% \partial\phi}(r_{t}-\widehat{Q}_{t}(x_{t},a_{t};\beta^{*}))\bigr{]}}\\ &=\text{$\mathbb{E}$}_{\mu}\bigl{[}\frac{\partial\rho_{0:t-1}}{\partial\phi}% \text{$\mathbb{E}$}_{\mu}[\rho_{t}(r_{t}-\widehat{Q}_{t}(x_{t},a_{t};\beta^{*}% ))+\widehat{V}_{t}(x_{t},\beta^{*})|\mathcal{H}_{t-1}]\bigr{]}\\ &+\text{$\mathbb{E}$}_{\mu}\bigl{[}\rho_{0:t-1}\frac{\partial\rho_{t}}{% \partial\phi}(r_{t}-\widehat{Q}_{t}(x_{t},a_{t};\beta^{*}))\bigr{]}\\ &=\text{$\mathbb{E}$}_{\mu}\rho_{0:t-1}\text{$\mathbb{E}$}_{\mu}\bigl{[}\frac{% \partial\rho_{t}}{\partial\phi}(r_{t}-\widehat{Q}_{t}(x_{t},a_{t};\beta^{*}))|% \mathcal{H}_{t-1}\bigr{]}\text{ (the first term equals to zero)}\\ &=-\text{$\mathbb{E}$}_{\mu}\rho_{0:t-1}\text{$\mathbb{E}$}_{\pi}\bigl{[}\frac% {\dot{\hat{\mu}}(a_{t}|x_{t};\phi)}{\mu(a_{t}|x_{t})}(Q^{t}(x_{t},a_{t})-% \widehat{Q}^{t}(x_{t},a_{t};\beta^{*}))\bigr{]}:=\Gamma_{t}(\beta^{*}).\end{split}

For $\hat{\phi}-\phi$ , from $U_{n}(\hat{\phi})=0$ we have $0=U(\phi)+\text{$\mathbb{E}$}{[\dot{U}(\phi)]}(\hat{\phi}-\phi)+o_{p}(n^{-1/2}),$ where

\displaystyle\sum\limits_{t=0}^{T-1}\sum\limits_{a\in\mathcal{A}}\text{$% \mathbb{E}$}_{\mu}{[\frac{1}{\mu(a|x_{t})^{2}}\dot{\hat{\mu}}(a|x_{t};\phi)% \dot{\hat{\mu}}(a|x_{t};\phi)^{\top}]}:=\Sigma_{\phi\phi}

and

\hat{\phi}-\phi=-\Sigma_{\phi\phi}^{-1}U(\phi)+o_{p}(n^{-1/2}).

\widehat{V}_{t}(\hat{\beta},\hat{\phi})=V_{t}(\beta^{*})+\Gamma_{t}(\beta^{*})% ^{\top}\Sigma_{\phi\phi}^{-1}U_{n}(\phi)+o_{p}(n^{-1/2}).

Plugging in the results above and denoting $c(\beta)=\displaystyle\sum\limits_{t=0}^{T-1}\gamma^{t}\Sigma_{\phi\phi}^{-1}% \Gamma_{t}(\beta)$ , we have

\begin{split}\widehat{V}^{\text{DR}}(\hat{\beta},\hat{\phi})=\sum\limits_{t=0}% ^{T-1}\gamma^{t}\widehat{V}_{t}(\hat{\beta},\hat{\phi})&=V(\beta^{*})+c(\beta^% {*})^{\top}U_{n}(\phi)+o_{p}(n^{-1/2})\\ &=\frac{1}{n}\sum\limits_{i=1}^{n}\eta^{(i)}(\beta^{*},c(\beta^{*}))+o_{p}(n^{% -1/2}),\end{split}

for $V(\beta)=\sum\limits_{t=0}^{T-1}\gamma^{t}V_{t}(\beta)$ and $\eta^{(i)}(\beta,c)=\displaystyle\sum\limits_{t=0}^{T-1}\gamma^{t}\eta^{(i)}_{% t}(\beta,c)$ with

\eta_{t}^{(i)}(\beta,c)=\frac{1}{\mu(a_{t}^{(i)}|x_{t}^{(i)})}\bigl{[}\rho_{0:% t-1}^{(i)}\pi(a_{t}^{(i)}|x_{t}^{(i)})r_{t}^{(i)}-F_{t}(x_{t}^{(i)},a_{t}^{(i)% };\beta,c,\phi)\bigr{]}+\sum\limits_{a\in\mathcal{A}}F_{t}(x_{t}^{(i)},a;\beta% ,c,\phi)

∎

A.4 Proof of Proposition 5

Proof.

Section 4.1 of Jiang and Li [2016] introduces an inductive definition for a DR OPE estimator with a known logging policy and a fixed value function model. The Theorem 1 from the same paper provides the variance of this DR OPE estimator, presented as follows:>

\begin{split}n\mathrm{Var}[\widehat{V}]=\sum\limits_{t=0}^{T-1}\gamma^{2t}% \bigl{[}\text{$\mathbb{E}$}_{\mu}\bigl{[}\mathrm{Var}_{t}\left[\rho_{0:t-1}V^{% t}(x_{t})\right]+\mathrm{Var}_{t+1}[\rho_{0:t}r_{t}]\bigr{]}+\text{$\mathbb{E}% $}_{\mu}\bigl{[}\mathrm{Var}_{t}[\rho_{0:t}(Q^{t}(x_{t},a_{t})-\widehat{Q}(x_{% t},a_{t}))|x_{t}]\bigr{]}\bigr{]},\end{split}

where $\mathrm{Var}_{t}$ refers to the conditional variance $\mathrm{Var}(\cdot|x_{0},a_{0},\dots,x_{t-1},a_{t-1})$ . The proof of the theorem still holds for a more general class of $\widehat{Q}$ , which takes the state-action trajectory $(x_{0},a_{0},\dots,x_{t},a_{t})$ and the time step $t$ as arguments. Therefore, following the same approach as the proof of Theorem 1, we obtain the following representation of our proposed DRUnknown for RL:

\begin{split}n\mathrm{Var}[\widehat{V}^{\text{DR}}]&=\sum\limits_{t=0}^{T-1}% \gamma^{2t}\bigl{[}\text{$\mathbb{E}$}_{\mu}\bigl{[}\mathrm{Var}_{t}\left[\rho% _{0:t-1}V^{t}(x_{t})\right]+\mathrm{Var}_{t+1}[\rho_{0:t}r_{t}]\bigr{]}\\ &+\text{$\mathbb{E}$}_{\mu}\bigl{[}\mathrm{Var}_{t}[\rho_{0:t}Q^{t}(x_{t},a_{t% })-\mu(a_{t}|x_{t})^{-1}F_{t}(x_{t},a_{t},\beta,c,\phi)|x_{t}]\bigr{]}\bigr{]}% .\end{split}

The first term does not depend on the parameters $\beta$ and $c$ , so we denote it as a constant

C_{T}=\sum\limits_{t=0}^{T-1}\gamma^{2t}\text{$\mathbb{E}$}\bigl{[}\mathrm{Var% }_{t}\left[\rho_{0:t-1}V^{t}(x_{t})\right]+\mathrm{Var}_{t+1}[\rho_{0:t}r_{t}]% \bigr{]}.

The variance in the second term is conditioned on $(x_{0},a_{0},\dots,x_{t-1},a_{t-1},x_{t})$ , and thus the randomness of this variable only incurs from the action $a_{t}$ at time step $t$ . Therefore, it can be calculated similarly to the proof of Proposition 2, and can be represented as a stochastic semi-norm as below, and we have the desired result.

\begin{split}&\mathrm{Var}_{t}[\rho_{0:t}Q^{t}(x_{t},a_{t})-\mu(a_{t}|x_{t})^{% -1}F_{t}(x_{t},a_{t},\beta,c,\phi)|x_{t}]\\ &=\mathrm{Var}_{t}[\sum\limits_{a\in\mathcal{A}}\Delta_{a}^{t}[\rho_{0:t-1}% \frac{\pi(a|x_{t})}{\mu(a|x_{t})}Q^{t}(x_{t},a)-\mu(a|x_{t})^{-1}F_{t}(x_{t},% \beta,c,\phi)|x_{t}]]\\ &=\left\|\vec{F}_{t}(x_{t};\beta,c,\phi)-\rho_{0:t-1}\pi\vec{Q}^{t}(x_{t})% \right\|^{2}_{M_{t,\mu}}\end{split}

∎

A.5 Proof of Proposition 7

Proof.

Theorem 3 of Kallus and Uehara [2020] states that in the OPE problem, the DR estimator with the true logging policy $\mu$ and value function $Q$ , given by

\widehat{V}^{\text{opt}}=\frac{1}{n}\sum\limits_{i=1}^{n}\sum\limits_{t=0}^{T-% 1}\gamma^{t}\rho_{0:{t-1}}^{(i)}\bigl{[}\rho_{t}^{(i)}[r_{t}^{(i)}-Q^{t}(x_{t}% ^{(i)},a_{t}^{(i)})]+V^{t}(x_{t}^{(i)})\bigr{]},

achieves the smallest asymptotic variance, reaching the semiparametric lower bound, for the case with a discount factor $\gamma=1$ . When the value function model is also correctly specified, the proposed DRUnknown is asymptotically equivalent to $\widehat{V}^{\text{opt}}$ , as the $\Gamma_{t}(\beta^{*})=0$ for all $t$ and $c(\beta^{*})=0$ . For the general problem with $\gamma<1$ , we can modify the MDP by fixing the discount factor to 1, incorporating the time step $t$ into the state variable $x$ , and changing the reward function $R(x,a)$ to $\gamma^{t}R(x,a)$ . ∎

A.6 Proof of Proposition 8

Proof.

The IPW Horvitz and Thompson [1952], DR [Dudík et al., 2011, Jiang and Li, 2016], and MRDR [Farajtabar et al., 2018] estimators are originally designed for the OPE problem with a known logging policy $\mu$ . Therefore, for comparison, we assume that the logging policy model is estimated by MLE, as same as in our proposed estimator. We can observe that all three estimators are asymptotically equivalent to $\widetilde{V}(\beta,c)$ for some value of $\beta$ and fixed $c=0$ . As all three estimators cannot take the estimation effect of $\hat{\phi}$ into account, they all have $c=0$ .

The IPW sets the value of $\beta$ such that the value function model $\widehat{Q}$ becomes zero, the DR estimator finds $\beta$ by minimizing the least-squares error for the value function. MRDR minimizes the variance of the estimator only with respect to $\beta$ , with $c$ fixed to zero. The class of estimators contains all these estimators, and the proposed DRUnknown achieves the smallest asymptotic variance among them, being at least more efficient. ∎

Appendix B Estimation of $\hat{\phi}$ with General Estimating Equation

The maximum likelihood estimator used to estimate $\hat{\phi}$ in this work is most efficient in many situations. However, the theoretical results in this paper are applicable to other estimating equations as given below,

U_{n}(\phi)=\sum\limits_{i=1}^{n}\sum\limits_{t=0}^{T-1}\sum\limits_{a\in% \mathcal{A}}(\Delta_{a,t}^{(i)}-\mu(a|x_{t}^{(i)}))h(x_{t}^{(i)},a;\phi)=0,

for any smooth function $h$ . The equation $U_{n}(\phi)=0$ is an unbiased estimating equation as long as the logging policy model is correctly specified. A possible choice is to assign more weight to the state-action pair with high probability in $\pi$ . This estimator remains consistent, satisfies the theoretical properties, and may be particularly useful for scenarios with a small-sized finite sample.

Appendix C Descriptions on Experimental Settings

C.1 Simulation Data

To build the synthetic dataset for the simulation experiment, we generate elements for the context vectors $x$ and the coefficient vector $\beta$ from the uniform distribution $U(-1/\sqrt{d},1/\sqrt{d})$ . The reward mean is determined by the nonlinear function $\exp(x^{t}\beta)$ , making the linear value function model incorrectly specified.

For the logging policy $\mu$ and the target policy $\pi$ , we generate the coefficients $\phi_{\mu}$ and $\phi_{\pi}$ from the uniform distribution $U(-1/\sqrt{d},1/\sqrt{d})$ and $U(-2/\sqrt{d},2/\sqrt{d})$ , respectively.

C.2 UCI Dataset

The six datasets used in these experiments were initially designed for the classification problem. To transform the problem into a bandit setting, we interpret the assignment of the class label as the selection of an arm in a bandit, with a reward of 1 if the class is correct and 0 if incorrect. We train the classifier $\mu_{0}$ , which returns the probability of each class label given a context vector $x$ . We treat $\mu_{0}$ as a policy and combine it with a uniform random policy with rates $\alpha$ and $1-\alpha$ . The value function model is constant and can be regarded as an intercept value.

C.3 ModelWin and ModelFail

C.3.1 ModelWin

ModelWin consists of three states, starting from state $s_{1}$ . When choosing action $a_{1}$ , the agent moves to $s_{2}$ with a probability of 0.6 and to $s_{3}$ with a probability of 0.4. Conversely, action $a_{2}$ leads to $s_{2}$ with a probability of 0.4 and to $s_{3}$ with a probability of 0.6. In states $s_{2}$ and $s_{3}$ , both actions return the agent to $s_{1}$ with a probability of 1. If the agent visits $s_{2}$ or $s_{3}$ , it receives rewards of 1 and -1, respectively. The horizon is fixed at $T=20$ .

C.3.2 ModelFail

ModelFail is an MDP with 4 states, but the learner cannot observe the current state of the agent. Starting from $s_{1}$ , action $a_{1}$ leads to the upper middle state $s_{2}$ , and $a_{2}$ to the lower middle state $s_{3}$ . From both, any action moves to the terminal state $s_{4}$ . If the transition is from the upper state, a reward of 1 is received; otherwise, a reward of -1. The horizon is always $T=2$ .

For both environments, the target policy at $s_{1}$ selects action $a_{1}$ with a probability of 0.7 and $a_{2}$ with a probability of 0.3. The logging policy chooses action $a_{1}$ with a probability of 0.75 and $a_{2}$ with a probability of 0.25. The parameter of the logging policy model $\hat{\mu}$ for both environments is the probability of choosing $a_{1}$ , and the value function model $\widehat{Q}$ is given by the linear model with intercepts.

Appendix D Limiations

This paper primarily focuses on the asymptotic properties of the proposed estimator for the OPE problem with an unknown logging policy. We do not extensively explore the estimator’s behavior with a finite sample. This aligns with similar statistical works, such as Cao et al. [2009], which addresses estimated missing mechanisms without providing finite-sample theory. Additionally, current DR OPE methods do not address scenarios with an unknown logging policy.

Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy

Abstract

1 Introduction

2 Problem Setup

2.1 Markov Decision Processes

2.2 Off-Policy Evaluation Problem

Assumption 1 (Absolute Continuity).

Assumption 2 (Square Integrable).

3 Previous Methods for Off-Policy Evaluation

3.1 Standard Methods for Off-Policy Evaluation

3.2 Doubly-Robust Off-Policy Evaluation with Unknown Logging Policy

4 Proposed Method: DRUnknown

4.1 DRUnknown for Contextual Bandits

Proposition 1 (Asymptotic Equivalence of V^^𝑉\widehat{V}over^ start_ARG italic_V end_ARG for CB).

Proposition 2 (Variance of V~DRsuperscript~𝑉DR\widetilde{V}^{\text{DR}}over~ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT DR end_POSTSUPERSCRIPT).

Proposition 3 (Doubly-Robustness).

4.1.1 Intuitive Understanding of DRUnknown for CB: extended function class

4.2 DRUnknown for Reinforcement Learning

Proposition 4 (Asymptotic Equivalence of V^^𝑉\widehat{V}over^ start_ARG italic_V end_ARG for RL).

Proposition 5 (Variance of V~~𝑉\widetilde{V}over~ start_ARG italic_V end_ARG for RL).

5 Theoretical Properties

Theorem 6 (Asymptotic distribution).

Proposition 7 (Local Efficiency).

Proposition 8 (Intrinsic Efficiency).

Corollary 9 (Comparison to Existing Algorithms).

6 Experimental Results

6.1 Contextual Bandits

6.1.1 Simulation Data

6.1.2 UCI Machine Learning Repository Dataset

6.2 Reinforcement Learning

6.3 Conclusion

7 Broader Impact

References

Appendix A Missing Proofs

A.1 Proof of Proposition 1

Proof.

A.2 Proof of Proposition 2

Proof.

A.3 Proof of Proposition 4

Proof.

A.4 Proof of Proposition 5

Proof.

A.5 Proof of Proposition 7

Proof.

A.6 Proof of Proposition 8

Proof.

Appendix B Estimation of ϕ^^italic-ϕ\hat{\phi}over^ start_ARG italic_ϕ end_ARG with General Estimating Equation

Appendix C Descriptions on Experimental Settings

C.1 Simulation Data

C.2 UCI Dataset

C.3 ModelWin and ModelFail

C.3.1 ModelWin

C.3.2 ModelFail

Appendix D Limiations

Proposition 1 (Asymptotic Equivalence of $\widehat{V}$ for CB).

Proposition 2 (Variance of $\widetilde{V}^{\text{DR}}$ ).

Proposition 4 (Asymptotic Equivalence of $\widehat{V}$ for RL).

Proposition 5 (Variance of $\widetilde{V}$ for RL).

Appendix B Estimation of $\hat{\phi}$ with General Estimating Equation