Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy
Abstract
We introduce a novel doubly-robust (DR) off-policy evaluation (OPE) estimator for Markov decision processes, DRUnknown, designed for situations where both the logging policy and the value function are unknown. The proposed estimator initially estimates the logging policy and then estimates the value function model by minimizing the asymptotic variance of the estimator while considering the estimating effect of the logging policy. When the logging policy model is correctly specified, DRUnknown achieves the smallest asymptotic variance within the class containing existing OPE estimators. When the value function model is also correctly specified, DRUnknown is optimal as its asymptotic variance reaches the semiparametric lower bound. We present experimental results conducted in contextual bandits and reinforcement learning to compare the performance of DRUnknown with that of existing methods.
1 Introduction
In various decision-making problems, estimating the value, the expected reward of a policy is a crucial question that needs to be addressed. Online evaluation requiring a comprehensive evaluation of policy value can be expensive and may not be applicable to multiple target policies.
Alternatively, off-policy evaluation (OPE) refers to a technique that estimates the value of a target policy by utilizing log data generated from a different logging policy. This approach has attracted considerable interest in the domains of contextual bandits (CB) (Dudík et al., 2011; Swaminathan et al., 2017) and reinforcement learning (RL) (Precup, 2000; Mahmood et al., 2014; Jiang and Li, 2016).
Several off-policy evaluation algorithms (Dudík et al., 2011; Thomas and Brunskill, 2016; Wang et al., 2017; Farajtabar et al., 2018; Su et al., 2020) currently in use rely on having complete knowledge of the logging policy in order to utilize inverse probability weighting (IPW). However, in situations where information about the data logger is not available, such as when using human decision data, it becomes necessary to estimate the logging policy. Applying existing estimators directly to cases where the logging policy is unknown can compromise the desired properties of these methods, as they fail to take into account the impact of logging policy estimation.
In this paper, we present a novel efficient estimator called DRUnknown, which simultaneously estimate both the logging policy model and the value function model. Our proposed approach, without the need for external data, effectively captures the interdependency between the two models.
In the field of conventional statistics, Cao et al. (2009) employs the influence function to estimate parameters in the doubly-robust (DR) estimator for population means when there are missing observations. The study focused on situations where data is incomplete and the missing mechanisms are unknown and estimated. To advance this concept, we utilized the influence function approach to address the OPE problem. Here, we treat the rewards of unselected actions as missing values and estimate the unknown logging policy. This enables us to derive the asymptotic variance of the OPE estimator and estimate the parameters that minimize it.
The proposed OPE estimator estimates both the logging policy model and the value function model, and is referred to as doubly-robust due to its consistency if either the model for the logging policy or the value function is correctly specified. When the logging policy is correctly specified, the proposed estimator is the most efficient among the class of DR OPE estimators with an estimated logging policy, and is at least as efficient as existing methods. Moreover, if the value function model is correctly specified, the estimator locally efficient, achieving the semiparametric lower bound for asymptotic variance and is asymptotically optimal. In order to demonstrate the effectiveness of the proposed estimator, we conducted simulation experiments to compare its performance with previous methods in contextual bandits and reinforcement learning problems.
The main contributions of this paper are as follows:
-
•
We present a new DR OPE estimator for Markov decision process, for the case where both the logging policy and the value function are unknown. The proposed estimator is consistent when either the logging policy model or the value function model is correctly specified.
-
•
We propose a method to estimate both the logging policy model parameter and the value function model parameter simultaneously. We use the maximum likelihood estimator (MLE) for and minimize the asymptotic variance to estimate , accounting for the impact of estimating .
-
•
The proposed estimator has the smallest asymptotic variance among estimators using MLE for when the logging policy model is correctly specified. When the value function model is also correctly specified the proposed estimator is optimal, as its asymptotic variance reaches the semiparametric lower bound.
-
•
Simulations on contextual bandits and reinforcement learning problems show that the proposed estimator consistently shows smaller mean-squared errors compared to the benchmarks methods.
2 Problem Setup
In this paper, we model the decision-making problem as a reinforcement learning framework in which the learner’s interaction with the system is represented as a Markov Decision Process (MDP). In this section, we provide definitions of MDP and the corresponding off-policy evaluation problem for our study.
2.1 Markov Decision Processes
An MDP is defined as a tuple (, , ), where and represent the state and the finite action spaces with , denotes the distribution of the random variable of the bounded immediate reward of taking action in state , is the transition probability distribution. is the initial state distribution, and is the discounting factor. The learner utilizes a policy , a stochastic map** from states to actions. Here, represents the probability of taking action in state .
Let = represent a -step trajectory containing state-action-rewards generated by policy . Denote as the sum of discounted return of from time step to . Our goal is to estimate the policy value, the expected return of trajectories generated by policy , i.e., . Here denotes the expectation over the trajectory sampled from . We also define the value function of a policy at state and at state-action pair from time step by and , respectively.
Throughout this paper, we fix the trajectory length as . We also omit the from and to simplify them as and , respectively, since we are only interested in the value of . Note that, unlike the case with , where the and value functions remain unchanged, the value functions and can vary for each time step in general.
The contextual bandits (CB) is a special case of the MDP, in which the trajectory length equals to , and the transition dynamics do not exist. In CB, the state is represented as the context, and the action is called as the arm.
2.2 Off-Policy Evaluation Problem
In an OPE problem, we are given a dataset comprising independent -step trajectories drawn from a logging policy . We use the superscript to refer to the data from the trajectory, denoted as . The objective is to estimate the policy value of a separate target policy . We define the importance ratio , and the cumulative importance ratio from time step to between and as . In cases where , we define to be 1. Similarly, we denote the estimated importance ratio as , where the probability of the logging policy is replaced with its estimated version for the logging policy model .
The policy value is a point estimator, and there are various metrics to assess the quality of the provided estimator. One commonly used is the mean squared error (MSE), defined as with the expectation taken with respect to the data sampled from . MSE quantifies the performance of a given estimator in finite samples. Another measure is the asymptotic variance, which assesses the efficiency of an estimator with a large number of samples. The estimator is called asymptotically linear if there exists a function of trajectory such that
with and . The quantity is denoted as the asymptotic variance of , and the function is referred to as the influence function. In our study, we derive the specific form of the influence function for OPE estimators and conduct an analysis to minimize the asymptotic variance.
For our analysis, we use the following standard regularity assumptions for OPE problem.
Assumption 1 (Absolute Continuity).
For all state-action pair , if then .
Assumption 2 (Square Integrable).
for all .
3 Previous Methods for Off-Policy Evaluation
The doubly-robust type estimator for the OPE problem was first proposed by Dudík et al. (2011). The estimator depends on a regression model for value function that is estimated using a separate independent dataset, which may not always be available.
The works of Thomas and Brunskill (2016) and Wang et al. (2017) introduce estimators that improve efficiency by modifying the IPW component of the DR OPE estimator. These methods still require the independently estimated value function using external data. Farajtabar et al. (2018) employed the idea of Rubin and van der Laan (2008), which estimates the value function model by minimizing the estimation variance without external dataset. All these methods assume that the true logging policy is known and do not require estimation.
Several studies (Li et al., 2015; Raghu et al., 2018; Xie et al., 2019; Hanna et al., 2021) have proposed the IPW estimators for OPE utilizing the estimated logging policy and examined their theoretical and numerical properties of these estimators. However, these methods do not consider the value function, making them not doubly-robust and suboptimal in terms of asymptotic variance.
3.1 Standard Methods for Off-Policy Evaluation
There are three standard approaches to estimate in OPE problem, and we provide a brief overview of these methods.
The first method is direct method (DM) (Dudík et al., 2011; Rothe, 2016), which directly approximates the state-action value function of the target policy. The DM does not use the information from the logging policy and relies heavily on the accuracy of the prediction of the value function. While the DM estimators often exhibits low variance, it is not guaranteed to be unbiased.
The second method is called inverse probability weighting (IPW) estimator (Horvitz and Thompson, 1952), which utilize the importance ratio term to correct the discrepancy between the target and logging policies. The IPW estimator is unbiased in case the logging policy is known, but it can exhibit a large variance when there is a significant difference between the logging and target policies and the importance ratio have large value. The Assumption 1 is required in order for the IPW estimator to be well-defined: one cannot know about the pair never explored by the logging policy.
The third approach, named doubly-robust (DR) estimator (Cassel et al., 1976; Robins and Rotnitzky, 1995; Dudík et al., 2011; Jiang and Li, 2016), combines the DM and IPW, leveraging their respective strengths to obtain favorable characteristics. The IPW estimator is a special case of DR with the value function fixed to zero. Our work focuses on the DR type estimators, as our main objective is to introduce an efficient method for simultaneously estimating the logging policy and the value function model.
3.2 Doubly-Robust Off-Policy Evaluation with Unknown Logging Policy
The standard DR OPE estimator with known logging policy is given by
(1) |
where is a value function model of the state-action value function parameterized by , and
Here, depends solely on the state-action pair. However, as mentioned in Section 2.1, the may vary for different time steps, even if the state-action pair remains the same when the horizon length is finite. Hence, we cannot ensure that the model can express the for all in the OPE problem with a finite . To address this issue, we utilize the function class , which incorporates as an input parameter in practice. Also, to ensure that the class of DR OPE estimator contains the IPW estimator, we assume that the class of contains the constant functions. This can be easily satisfied by bringing the parameters to extend the function class by .
The DR OPE estimator (1) utilizes the importance ratio using the true probability of the logging policy. When the logging policy is unknown we use the logging policy model to approximate the unknown . We say that the model is correctly specified if there exists a value of such that for all and . The correct specification of is necessary for the consistent estimation of the policy value , as correctly specifying the value function model is difficult in many OPE problems.
In the following sections, our focus will be on the DR OPE estimator, which estimates both the logging policy parameter and the value function parameter . Our goal is to build an efficient OPE estimator that exhibits lower asymptotic variance compared to existing methods.
4 Proposed Method: DRUnknown
In this section, we present our class of DRUnknown estimators. The central concept of DRUnknown is to first estimate the unknown logging policy parameter for the IPW component of the DR estimator, and then learn the value function parameter to minimize the asymptotic variance of the estimator. We utilize the influence function of the estimator to derive a feasible objective function to minimize the asymptotic variance of DRUnknown.
4.1 DRUnknown for Contextual Bandits
We first present the DRUnknown for contextual bandits problem with . Consider a logging policy model parameterized by . Let represent the indicator for an action chosen in trajectory , and denote the partial derivative of with respect to as . The maximum-likelihood estimator (MLE) is given by the solution of the estimating equation given by
If the logging policy model is correctly specified, the estimating equation is unbiased, and is consistent for .
Plugging in the estimated value of , we obtain the following class of DR OPE estimator
where the notations and denote the , , and respectively. Henceforth in the paper, if there is no ambiguity, we simplify terms containing and by using only the superscript .
The expression of the bias and variance of the estimator is not simple in general, even for a fixed . The estimator is not a mean of the independent and identically distributed (i.i.d) variables, due to the estimated in the denominator. Consequently, estimating the regression parameter to minimize the MSE of the proposed estimator poses challenges. Hence, we instead aim to find minimizing the asymptotic variance of the estimator.
To attain this, we derive the influence function asymptotically equivalent to , through a Taylor expansion with respect to and . Denote the new regression function , the product of the target policy and the with an additional estimating effect term of , given by
for and The influence function of with arbitrary estimator can be formulated employing , as described in the following proposition.
Proposition 1 (Asymptotic Equivalence of for CB).
Consider an estimator converging to some in probability. If the model is correctly specified, the DR OPE estimator is asymptotically linear with influence function :
where is a vector that solely depends on , and
As the value of the vector is unknown, we estimate both and that minimize the variance of . Denote the vector and , and the gradient matrix By the law of total variance, the variance of can be expressed as the square of stochastic seminorm added by a constant independent of and .
Proposition 2 (Variance of ).
The variance of is given by
where and the matrix of ones.
The Proposition 2 tells minimizing the square of seminorm
(2) |
with respect to and is equivalent to minimizing the variance of the . Denote the minimizer of (2) by , the zero of its gradient
We can observe that the solution satisfies , so that the variance of is minimized at . Therefore by Proposition 1, the smallest asymptotic variance of DR OPE estimator under the correctly specified , is achieved by converging in probability to this .The following estimating equation is a tractable equation that jointly solves for , with its solution consistent for :
(3) |
for and the pseudo reward . Plugging in the solution of as , we obtain our proposed estimator, DRUnknown. The serves as an auxiliary parameter that adjusts for the effect of estimating , and does not have a direct role in the final estimator .
The proposed DRUnknown is doubly-robust within the statistical context, as it remains consistent if either the logging policy model or the value function model is correctly specified. We have addressed the scenario with the correct logging policy model above. Regarding the second scenario, the minimizer of (2) is since , where is the true value of the parameter satisfying Consequently, the solution of (3) converges to , establishing the consistency of DRUnknown. This observation can be summarized as the following proposition.
Proposition 3 (Doubly-Robustness).
The DRUnknown for contextual bandits is a doubly robust: it converges to in probability if either the logging policy model or the value function model is correctly specified.
4.1.1 Intuitive Understanding of DRUnknown for CB: extended function class
For an intuitive understanding of DRUnknown for contextual bandits, suppose that the target policy always have a nonzero value. Then, can be rewritten as for the extended function class
The objective function (2) we aim to minimize in DRUnknown can then be expressed as
The additional linear term from can be seen as the projection of onto the inverse probability tangent space, the linear space spanned by the score function of .
Hence, the proposed estimator can be seen as employing a linear parameter within the expanded function class to remove the effect of estimating , while seeking the optimal for that minimizes the asymptotic variance.
4.2 DRUnknown for Reinforcement Learning
The proposed DRUnknown for RL is constructed similarly to the case of CB. However, its analysis is more complex, as the estimator is expressed as a weighted sum of terms observed from time step to . Below, we introduce the construction of DRUnknown for RL, commencing with the estimation of .
The MLE of for RL is given by the solution of
where is the indicator variable for selected action at time step from trajectory . Substituting the MLE into the DR estimator (1), the estimator is given by with
The estimator of the value at each time step , , takes the same form of DRUnknown for CB, weighted by . However, additional analysis is needed in the case of RL, as the for each are stochastically correlated. Now, as done in CB, we define the value regression function for each , given by
and we derive the influence function of the proposed estimator as stated in the following proposition.
Proposition 4 (Asymptotic Equivalence of for RL).
Consider an estimator converging to some in probability. If the logging policy model is correctly specified, then the proposed DR OPE estimator is asymptotically linear, with influence function :
(4) |
where is a vector that solely depends on , and
Now, to derive the asymptotic variance of the proposed estimator for RL, we denote the vector and the gradient matrix
The asymptotic variance of the DR OPE estimator for RL can be expressed with these notations as in the following proposition.
Proposition 5 (Variance of for RL).
The variance of is given by
where and
with the variance conditioned on the history up to time step , .
Similar to the case of CB, the proposition implies that minimizing the weighted sum of seminorms with respect to and is equivalent to minimizing the asymptotic variance of . The solution of the following estimating equation is consistent for , minimizing the asymptotic variance of by .
(5) |
with the pseudo-reward for and .
5 Theoretical Properties
We now present the theoretical properties of the proposed estimator, with a focus on its asymptotic distribution. Combining Proposition 1, 2, 4 and 5, the DRUnknown has the following asymptotic normality for both contextual bandits and reinforcement learning problems.
Theorem 6 (Asymptotic distribution).
When the logging policy model is correctly specified, the proposed DRUnknown estimator is asymptotically normal, given by
for
To calculate the confidence interval of , one needs to estimate the unknown variance . As we know that
we have is a consistent estimator for , with the average of and are the solutions of (5) and the MLE . By the Slutzky’s theorem, we have
and combining the result with the theorem we have the confidence interval as
with the standard Gaussian quantile.
When the value function model is also correctly specified, the proposed DRUnknown is asymptotically equivalent to the DR OPE estimator using the true logging policy and the value function, and is locally efficient.
Proposition 7 (Local Efficiency).
When both the logging policy model and the value function model are correctly specified, The asymptotic variance of the proposed estimator achieves the semiparametric lower bound and is asymptotically optimal.
Proposition 7 states that DRUnknown can be asymptotically optimal when the model is properly chosen. Also, for any choice of value function model, whether it is correct or not, the proposed estimator is asymptotically more efficient compared to existing OPE algorithms, including IPW, DR, and MRDR, in the OPE problem with an unknown logging policy.
Proposition 8 (Intrinsic Efficiency).
The proposed DR OPE estimator has the smallest asymptotic variance among the class of DR OPE using MLE for , when the logging policy model is correctly specified. The same holds for arbitrary consistent estimator of .
Corollary 9 (Comparison to Existing Algorithms).
The proposed estimator has at least smaller asymptotic variance than IPW, DR and MRDR utilizing the same estimator for , when the logging policy model is correctly specified.
6 Experimental Results
In this section, we compare the performance of the four estimators: (i) IPW (Horvitz and Thompson, 1952) (ii) MLIPW (Xie et al., 2019) (iii) MRDR (Farajtabar et al., 2018) and (iv) the proposed DRUnknown on CB and RL problems. IPW requires knowledge of the true logging policy and cannot be applied in our experimental scenario. Therefore it serves as a baseline for the other three estimators, and we compare the relative MSE of each estimator.
6.1 Contextual Bandits
6.1.1 Simulation Data
We use the simulation environments described as follows. We generate test datasets with size , given by . The context for sample from dataset contains dimensional vectors randomly sampled from uniform distribution . The rewards are generated from the Gaussian distribution with mean and variance 1, where is a fixed coefficient also sampled from the uniform distribution .
The logging policy and target policy follows the linear logistic model with random coefficients and as and The logging policy model also follows the linear logistic model, while the value function model is defined by the linear regression model. Consequently, only the logging policy model is correctly specified.
The Table 1 and Figure 1 reports the relative MSE values of MLIPW, MRDR and the proposed DRUnknown, calculated by dividing their MSE by the MSE of IPW, for the different sizes of each dataset. The proposed estimator achieves the lowest relative MSE. Figure 2 displays boxplots representing the estimated values of four methods, each computed with repeats and a dataset size of .
sample size | MLIPW | MRDR | DRUnknown |
---|---|---|---|
0.8911 | 0.8126 | 0.8038 | |
0.8416 | 0.7957 | 0.7731 | |
0.8366 | 0.8043 | 0.7620 | |
0.8234 | 0.8073 | 0.7560 | |
0.8222 | 0.8200 | 0.7662 | |
0.7575 | 0.7369 | 0.6792 |
6.1.2 UCI Machine Learning Repository Dataset
For the real data experiment, we transform six classification datasets of the UCI Machine Learning Repository into contextual bandit problems : glass, letter, zooimage, iris and handwritten (German, 1987; Slate, 1991; Forsyth, 1990; mis, 1990; Fisher, 1988; Alpaydin and Kaynak, 1998). Assigning the data to each class is considered as pulling an arm in the bandit. When the class is correct the reward is 1 and otherwise 0.
We construct the logging policy as follows: we train the policy using a linear logistic model on a separate dataset, and mix the policy with a random policy as , for . The target policy is constructed as in the simulation data experiment. The logging policy class is defined as , where serves as the parameter , and we use the constant value function model for .
We generate datasets, each with a size of , by randomly sampling contexts from the original dataset and selecting arms using the logging policy . Table 2 displays the relative MSE values of MLIPW, MRDR, and our proposed DRUnknown. Additionally, Figure 3 depicts the log-relative MSE of each estimator on the glass dataset as a function of the logging policy parameter . The DRUnknown consistently demonstrates the smallest relative MSE across all six datasets.
dataset | MLIPW | MRDR | DRUnknown |
---|---|---|---|
glass | 0.9548 | 0.8308 | 0.8125 |
letter | 0.9677 | 0.9323 | 0.9265 |
zoo | 0.9987 | 0.9842 | 0.9583 |
image | 0.8250 | 0.8245 | 0.8062 |
iris | 0.6116 | 0.5715 | 0.5454 |
handwritten | 0.9521 | 0.9432 | 0.9407 |
6.2 Reinforcement Learning
In this section, we provide the experimental results of OPE in the context of RL. We conducted the experiments on the ModelWin and ModelFail domains introduced in Thomas and Brunskill (2016). Detailed descriptions of these environments are available in the Appendix. As done in Section 6.1.2, we utilize a mixture of the separately trained optimal policy and the uniform random policy with a rate as the logging policy . We employ a linear logistic policy for the target policy and a linear model as the value function model .
Table 3 displays the relative MSE of each estimator in ModelWin with and ModelFail environments. The proposed DRUnknown attains the lowest relative MSE in most scenarios, if the number of sample is large enough. Figure 4 illustrates the cumulative distribution function (CDF) of squared error values (larger is better) on the ModelWin domain with and . The DRUnknown exhibits the highest CDF values, indicating that the estimated values are more concentrated around the true value with high probability.
6.3 Conclusion
In conclusion, our study has introduced DRUnknown, a novel doubly-robust off-policy evaluation estimator for contextual bandits and reinforcement learning problems where both the logging policy and the value function remain unknown. Through a two-step estimation process, DRUnknown first estimates the logging policy using maximum likelihood estimator and subsequently estimates the value function model, aiming to minimize the asymptotic variance of the estimator while considering the impact of the logging policy estimation.
When the logging policy model is correctly specified, DRUnknown achieves the smallest asymptotic variance within the class that encompasses existing OPE estimators. Furthermore, if the value function model is also correctly specified, DRUnknown attains optimality by reaching the semiparametric lower bound for asymptotic variance.
To show the effectiveness of DRUnknown, we conducted experiments in both contextual bandits and reinforcement learning settings. The empirical results demonstrate the superior performance of DRUnknown compared to existing methods.
7 Broader Impact
This paper addresses the off-policy evaluation problem, aiming to contribute to the field of Machine Learning. It includes theoretical contributions and numerical experimental results, with no apparent societal or ethical issues that require further discussion.
ModelWin | MLIPW | MRDR | DRUnknown |
---|---|---|---|
1.1604 | 1.9963 | 0.8916 | |
1.0187 | 2.5577 | 0.8273 | |
1.1064 | 1.7349 | 0.8529 | |
1.0676 | 1.5270 | 0.9117 |
ModelFail | MLIPW | MRDR | DRUnknown |
---|---|---|---|
0.7794 | 0.7963 | 0.8476 | |
0.5608 | 0.5591 | 0.2951 | |
0.5532 | 0.5605 | 0.3154 | |
0.5273 | 0.5285 | 0.1451 |
References
- Dudík et al. (2011) Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104, 2011.
- Swaminathan et al. (2017) Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. Advances in Neural Information Processing Systems, 30, 2017.
- Precup (2000) Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
- Mahmood et al. (2014) A Rupam Mahmood, Hado P Van Hasselt, and Richard S Sutton. Weighted importance sampling for off-policy learning with linear function approximation. Advances in neural information processing systems, 27, 2014.
- Jiang and Li (2016) Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661. PMLR, 2016.
- Thomas and Brunskill (2016) Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148. PMLR, 2016.
- Wang et al. (2017) Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597. PMLR, 2017.
- Farajtabar et al. (2018) Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447–1456. PMLR, 2018.
- Su et al. (2020) Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning, pages 9167–9176. PMLR, 2020.
- Cao et al. (2009) Weihua Cao, Anastasios A Tsiatis, and Marie Davidian. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika, 96(3):723–734, 2009.
- Rubin and van der Laan (2008) Daniel B Rubin and Mark J van der Laan. Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis. The International Journal of Biostatistics, 4(1), 2008.
- Li et al. (2015) Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation. In Artificial Intelligence and Statistics, pages 608–616. PMLR, 2015.
- Raghu et al. (2018) Aniruddh Raghu, Omer Gottesman, Yao Liu, Matthieu Komorowski, Aldo Faisal, Finale Doshi-Velez, and Emma Brunskill. Behaviour policy estimation in off-policy policy evaluation: Calibration matters. arXiv preprint arXiv:1807.01066, 2018.
- Xie et al. (2019) Yuan Xie, Boyi Liu, Qiang Liu, Zhaoran Wang, Yuan Zhou, and Jian Peng. Off-policy evaluation and learning from logged bandit feedback: Error reduction via surrogate policy. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=HklKui0ct7.
- Hanna et al. (2021) Josiah P Hanna, Scott Niekum, and Peter Stone. Importance sampling in reinforcement learning with an estimated behavior policy. Machine Learning, 110(6):1267–1317, 2021.
- Rothe (2016) Christoph Rothe. The value of knowing the propensity score for estimating average treatment effects. Available at SSRN 2797560, 2016.
- Horvitz and Thompson (1952) Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
- Cassel et al. (1976) Claes M Cassel, Carl E Särndal, and Jan H Wretman. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3):615–620, 1976.
- Robins and Rotnitzky (1995) James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
- German (1987) B. German. Glass Identification. UCI Machine Learning Repository, 1987. DOI: https://doi.org/10.24432/C5WW2P.
- Slate (1991) David Slate. Letter Recognition. UCI Machine Learning Repository, 1991. DOI: https://doi.org/10.24432/C5ZP40.
- Forsyth (1990) Richard Forsyth. Zoo. UCI Machine Learning Repository, 1990. DOI: https://doi.org/10.24432/C5R59V.
- mis (1990) Image Segmentation. UCI Machine Learning Repository, 1990. DOI: https://doi.org/10.24432/C5GP4N.
- Fisher (1988) R. A. Fisher. Iris. UCI Machine Learning Repository, 1988. DOI: https://doi.org/10.24432/C56C76.
- Alpaydin and Kaynak (1998) E. Alpaydin and C. Kaynak. Optical Recognition of Handwritten Digits. UCI Machine Learning Repository, 1998. DOI: https://doi.org/10.24432/C50P49.
- Kallus and Uehara (2020) Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient and robust off-policy evaluation. In International Conference on Machine Learning, pages 5078–5088. PMLR, 2020.
Appendix A Missing Proofs
A.1 Proof of Proposition 1
Proof.
By the Taylor expansion,
where
and for some . As is correctly specified, we have
For the following term,
For , from we have where
and
Plugging in the results above and denoting , we have
for , since ∎
A.2 Proof of Proposition 2
Proof.
By the law of total variance, we decompose the variance of with , given contexts and rewards :
For the first term, we have and for the second term,
and since , we have
∎
A.3 Proof of Proposition 4
Proof.
The proof is similar to that of Proposition 1 for contextual bandits, with .
For each , applying the Taylor expansion, we have
where
and for some . As is correctly specified, we have
For the following term,
For , from we have where
and
Plugging in the results above and denoting , we have
for and with
∎
A.4 Proof of Proposition 5
Proof.
Section 4.1 of Jiang and Li [2016] introduces an inductive definition for a DR OPE estimator with a known logging policy and a fixed value function model. The Theorem 1 from the same paper provides the variance of this DR OPE estimator, presented as follows:>
where refers to the conditional variance . The proof of the theorem still holds for a more general class of , which takes the state-action trajectory and the time step as arguments. Therefore, following the same approach as the proof of Theorem 1, we obtain the following representation of our proposed DRUnknown for RL:
The first term does not depend on the parameters and , so we denote it as a constant
The variance in the second term is conditioned on , and thus the randomness of this variable only incurs from the action at time step . Therefore, it can be calculated similarly to the proof of Proposition 2, and can be represented as a stochastic semi-norm as below, and we have the desired result.
∎
A.5 Proof of Proposition 7
Proof.
Theorem 3 of Kallus and Uehara [2020] states that in the OPE problem, the DR estimator with the true logging policy and value function , given by
achieves the smallest asymptotic variance, reaching the semiparametric lower bound, for the case with a discount factor . When the value function model is also correctly specified, the proposed DRUnknown is asymptotically equivalent to , as the for all and . For the general problem with , we can modify the MDP by fixing the discount factor to 1, incorporating the time step into the state variable , and changing the reward function to . ∎
A.6 Proof of Proposition 8
Proof.
The IPW Horvitz and Thompson [1952], DR [Dudík et al., 2011, Jiang and Li, 2016], and MRDR [Farajtabar et al., 2018] estimators are originally designed for the OPE problem with a known logging policy . Therefore, for comparison, we assume that the logging policy model is estimated by MLE, as same as in our proposed estimator. We can observe that all three estimators are asymptotically equivalent to for some value of and fixed . As all three estimators cannot take the estimation effect of into account, they all have .
The IPW sets the value of such that the value function model becomes zero, the DR estimator finds by minimizing the least-squares error for the value function. MRDR minimizes the variance of the estimator only with respect to , with fixed to zero. The class of estimators contains all these estimators, and the proposed DRUnknown achieves the smallest asymptotic variance among them, being at least more efficient. ∎
Appendix B Estimation of with General Estimating Equation
The maximum likelihood estimator used to estimate in this work is most efficient in many situations. However, the theoretical results in this paper are applicable to other estimating equations as given below,
for any smooth function . The equation is an unbiased estimating equation as long as the logging policy model is correctly specified. A possible choice is to assign more weight to the state-action pair with high probability in . This estimator remains consistent, satisfies the theoretical properties, and may be particularly useful for scenarios with a small-sized finite sample.
Appendix C Descriptions on Experimental Settings
C.1 Simulation Data
To build the synthetic dataset for the simulation experiment, we generate elements for the context vectors and the coefficient vector from the uniform distribution . The reward mean is determined by the nonlinear function , making the linear value function model incorrectly specified.
For the logging policy and the target policy , we generate the coefficients and from the uniform distribution and , respectively.
C.2 UCI Dataset
The six datasets used in these experiments were initially designed for the classification problem. To transform the problem into a bandit setting, we interpret the assignment of the class label as the selection of an arm in a bandit, with a reward of 1 if the class is correct and 0 if incorrect. We train the classifier , which returns the probability of each class label given a context vector . We treat as a policy and combine it with a uniform random policy with rates and . The value function model is constant and can be regarded as an intercept value.
C.3 ModelWin and ModelFail
C.3.1 ModelWin
ModelWin consists of three states, starting from state . When choosing action , the agent moves to with a probability of 0.6 and to with a probability of 0.4. Conversely, action leads to with a probability of 0.4 and to with a probability of 0.6. In states and , both actions return the agent to with a probability of 1. If the agent visits or , it receives rewards of 1 and -1, respectively. The horizon is fixed at .
C.3.2 ModelFail
ModelFail is an MDP with 4 states, but the learner cannot observe the current state of the agent. Starting from , action leads to the upper middle state , and to the lower middle state . From both, any action moves to the terminal state . If the transition is from the upper state, a reward of 1 is received; otherwise, a reward of -1. The horizon is always .
For both environments, the target policy at selects action with a probability of 0.7 and with a probability of 0.3. The logging policy chooses action with a probability of 0.75 and with a probability of 0.25. The parameter of the logging policy model for both environments is the probability of choosing , and the value function model is given by the linear model with intercepts.
Appendix D Limiations
This paper primarily focuses on the asymptotic properties of the proposed estimator for the OPE problem with an unknown logging policy. We do not extensively explore the estimator’s behavior with a finite sample. This aligns with similar statistical works, such as Cao et al. [2009], which addresses estimated missing mechanisms without providing finite-sample theory. Additionally, current DR OPE methods do not address scenarios with an unknown logging policy.