-
Off-Policy Evaluation of Ranking Policies under Diverse User Behavior
Authors:
Haruka Kiyohara,
Masatoshi Uehara,
Yusuke Narita,
Nobuyuki Shimizu,
Yasuo Yamamoto,
Yuta Saito
Abstract:
Ranking interfaces are everywhere in online platforms. There is thus an ever growing interest in their Off-Policy Evaluation (OPE), aiming towards an accurate performance evaluation of ranking policies using logged data. A de-facto approach for OPE is Inverse Propensity Scoring (IPS), which provides an unbiased and consistent value estimate. However, it becomes extremely inaccurate in the ranking…
▽ More
Ranking interfaces are everywhere in online platforms. There is thus an ever growing interest in their Off-Policy Evaluation (OPE), aiming towards an accurate performance evaluation of ranking policies using logged data. A de-facto approach for OPE is Inverse Propensity Scoring (IPS), which provides an unbiased and consistent value estimate. However, it becomes extremely inaccurate in the ranking setup due to its high variance under large action spaces. To deal with this problem, previous studies assume either independent or cascade user behavior, resulting in some ranking versions of IPS. While these estimators are somewhat effective in reducing the variance, all existing estimators apply a single universal assumption to every user, causing excessive bias and variance. Therefore, this work explores a far more general formulation where user behavior is diverse and can vary depending on the user context. We show that the resulting estimator, which we call Adaptive IPS (AIPS), can be unbiased under any complex user behavior. Moreover, AIPS achieves the minimum variance among all unbiased estimators based on IPS. We further develop a procedure to identify the appropriate user behavior model to minimize the mean squared error (MSE) of AIPS in a data-driven fashion. Extensive experiments demonstrate that the empirical accuracy improvement can be significant, enabling effective OPE of ranking systems even under diverse user behavior.
△ Less
Submitted 26 June, 2023;
originally announced June 2023.
-
Counterfactual Learning with General Data-generating Policies
Authors:
Yusuke Narita,
Kyohei Okumura,
Akihiro Shimizu,
Kohei Yata
Abstract:
Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log data from a different policy. We extend its applicability by develo** an OPE method for a class of both full support and deficient support logging policies in contextual-bandit settings. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-ma…
▽ More
Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log data from a different policy. We extend its applicability by develo** an OPE method for a class of both full support and deficient support logging policies in contextual-bandit settings. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-making based on supervised and unsupervised learning. We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases. We validate our method with experiments on partly and entirely deterministic logging policies. Finally, we apply it to evaluate coupon targeting policies by a major online platform and show how to improve the existing policy.
△ Less
Submitted 4 December, 2022;
originally announced December 2022.
-
Incorporating Participants' Welfare into Sequential Multiple Assignment Randomized Trials
Authors:
Xinru Wang,
Nina Deliu,
Yusuke Narita,
Bibhas Chakraborty
Abstract:
Dynamic treatment regimes (DTRs) are sequences of decision rules that recommend treatments based on patients' time-varying clinical conditions. The sequential multiple assignment randomized trial (SMART) is an experimental design that can provide high-quality evidence for constructing optimal DTRs. In a conventional SMART, participants are randomized to available treatments at multiple stages with…
▽ More
Dynamic treatment regimes (DTRs) are sequences of decision rules that recommend treatments based on patients' time-varying clinical conditions. The sequential multiple assignment randomized trial (SMART) is an experimental design that can provide high-quality evidence for constructing optimal DTRs. In a conventional SMART, participants are randomized to available treatments at multiple stages with balanced randomization probabilities. Despite its relative simplicity of implementation and desirable performance in comparing embedded DTRs, the conventional SMART faces inevitable ethical issues including assigning many participants to the empirically inferior treatment or the treatment they dislike, which might slow down the recruitment procedure and lead to higher attrition rates, ultimately leading to poor internal and external validities of the trial results. In this context, we propose a SMART under the Experiment-as-Market framework (SMART-EXAM), a novel SMART design that holds the potential to improve participants' welfare by incorporating their preferences and predicted treatment effects into the randomization procedure. We describe the steps of conducting a SMART-EXAM and evaluate its performance compared to the conventional SMART. The results indicate that the SMART-EXAM can improve the welfare of the participants enrolled in the trial, while also achieving a desirable ability to construct an optimal DTR when the experimental parameters are suitably specified. We finally illustrate the practical potential of the SMART-EXAM design using data from a SMART for children with attention-deficit/hyperactivity disorder (ADHD).
△ Less
Submitted 19 September, 2023; v1 submitted 28 October, 2022;
originally announced October 2022.
-
Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model
Authors:
Haruka Kiyohara,
Yuta Saito,
Tatsuya Matsuhiro,
Yusuke Narita,
Nobuyuki Shimizu,
Yasuo Yamamoto
Abstract:
In real-world recommender systems and search engines, optimizing ranking decisions to present a ranked list of relevant items is critical. Off-policy evaluation (OPE) for ranking policies is thus gaining a growing interest because it enables performance estimation of new ranking policies using only logged data. Although OPE in contextual bandits has been studied extensively, its naive application…
▽ More
In real-world recommender systems and search engines, optimizing ranking decisions to present a ranked list of relevant items is critical. Off-policy evaluation (OPE) for ranking policies is thus gaining a growing interest because it enables performance estimation of new ranking policies using only logged data. Although OPE in contextual bandits has been studied extensively, its naive application to the ranking setting faces a critical variance issue due to the huge item space. To tackle this problem, previous studies introduce some assumptions on user behavior to make the combinatorial item space tractable. However, an unrealistic assumption may, in turn, cause serious bias. Therefore, appropriately controlling the bias-variance tradeoff by imposing a reasonable assumption is the key for success in OPE of ranking policies. To achieve a well-balanced bias-variance tradeoff, we propose the Cascade Doubly Robust estimator building on the cascade assumption, which assumes that a user interacts with items sequentially from the top position in a ranking. We show that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions. Furthermore, compared to a previous estimator based on the same cascade assumption, the proposed estimator reduces the variance by leveraging a control variate. Comprehensive experiments on both synthetic and real-world data demonstrate that our estimator leads to more accurate OPE than existing estimators in a variety of settings.
△ Less
Submitted 3 February, 2022;
originally announced February 2022.
-
Evaluating the Robustness of Off-Policy Evaluation
Authors:
Yuta Saito,
Takuma Udagawa,
Haruka Kiyohara,
Kazuki Mogi,
Yusuke Narita,
Kei Tateno
Abstract:
Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive setting such as precision medicine and recommender systems. Since many OPE estimators have been proposed and some of them have hyperparameters to…
▽ More
Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive setting such as precision medicine and recommender systems. Since many OPE estimators have been proposed and some of them have hyperparameters to be tuned, there is an emerging challenge for practitioners to select and tune OPE estimators for their specific application. Unfortunately, identifying a reliable estimator from results reported in research papers is often difficult because the current experimental procedure evaluates and compares the estimators' performance on a narrow set of hyperparameters and evaluation policies. Therefore, it is difficult to know which estimator is safe and reliable to use. In this work, we develop Interpretable Evaluation for Offline Evaluation (IEOE), an experimental procedure to evaluate OPE estimators' robustness to changes in hyperparameters and/or evaluation policies in an interpretable manner. Then, using the IEOE procedure, we perform extensive evaluation of a wide variety of existing estimators on Open Bandit Dataset, a large-scale public real-world dataset for OPE. We demonstrate that our procedure can evaluate the estimators' robustness to the hyperparamter choice, hel** us avoid using unsafe estimators. Finally, we apply IEOE to real-world e-commerce platform data and demonstrate how to use our protocol in practice.
△ Less
Submitted 31 August, 2021;
originally announced August 2021.
-
Algorithm as Experiment: Machine Learning, Market Design, and Policy Eligibility Rules
Authors:
Yusuke Narita,
Kohei Yata
Abstract:
Algorithms make a growing portion of policy and business decisions. We develop a treatment-effect estimator using algorithmic decisions as instruments for a class of stochastic and deterministic algorithms. Our estimator is consistent and asymptotically normal for well-defined causal effects. A special case of our setup is multidimensional regression discontinuity designs with complex boundaries.…
▽ More
Algorithms make a growing portion of policy and business decisions. We develop a treatment-effect estimator using algorithmic decisions as instruments for a class of stochastic and deterministic algorithms. Our estimator is consistent and asymptotically normal for well-defined causal effects. A special case of our setup is multidimensional regression discontinuity designs with complex boundaries. We apply our estimator to evaluate the Coronavirus Aid, Relief, and Economic Security Act, which allocated many billions of dollars worth of relief funding to hospitals via an algorithmic rule. The funding is shown to have little effect on COVID-19-related hospital activities. Naive estimates exhibit selection bias.
△ Less
Submitted 5 December, 2023; v1 submitted 26 April, 2021;
originally announced April 2021.
-
Curse of Democracy: Evidence from the 21st Century
Authors:
Yusuke Narita,
Ayumi Sudo
Abstract:
Democracy is widely believed to contribute to economic growth and public health in the 20th and earlier centuries. We find that this conventional wisdom is reversed in this century, i.e., democracy has persistent negative impacts on GDP growth during 2001-2020. This finding emerges from five different instrumental variable strategies. Our analysis suggests that democracies cause slower growth thro…
▽ More
Democracy is widely believed to contribute to economic growth and public health in the 20th and earlier centuries. We find that this conventional wisdom is reversed in this century, i.e., democracy has persistent negative impacts on GDP growth during 2001-2020. This finding emerges from five different instrumental variable strategies. Our analysis suggests that democracies cause slower growth through less investment and trade. For 2020, democracy is also found to cause more deaths from Covid-19.
△ Less
Submitted 26 September, 2021; v1 submitted 15 April, 2021;
originally announced April 2021.
-
Breaking Ties: Regression Discontinuity Design Meets Market Design
Authors:
Atila Abdulkadiroglu,
Joshua D. Angrist,
Yusuke Narita,
Parag Pathak
Abstract:
Many schools in large urban districts have more applicants than seats. Centralized school assignment algorithms ration seats at over-subscribed schools using randomly assigned lottery numbers, non-lottery tie-breakers like test scores, or both. The New York City public high school match illustrates the latter, using test scores and other criteria to rank applicants at ``screened'' schools, combine…
▽ More
Many schools in large urban districts have more applicants than seats. Centralized school assignment algorithms ration seats at over-subscribed schools using randomly assigned lottery numbers, non-lottery tie-breakers like test scores, or both. The New York City public high school match illustrates the latter, using test scores and other criteria to rank applicants at ``screened'' schools, combined with lottery tie-breaking at unscreened ``lottery'' schools. We show how to identify causal effects of school attendance in such settings. Our approach generalizes regression discontinuity methods to allow for multiple treatments and multiple running variables, some of which are randomly assigned. The key to this generalization is a local propensity score that quantifies the school assignment probabilities induced by lottery and non-lottery tie-breakers. The local propensity score is applied in an empirical assessment of the predictive value of New York City's school report cards. Schools that receive a high grade indeed improve SAT math scores and increase graduation rates, though by much less than OLS estimates suggest. Selection bias in OLS estimates is egregious for screened schools.
△ Less
Submitted 31 December, 2020;
originally announced January 2021.
-
Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation
Authors:
Yuta Saito,
Shunsuke Aihara,
Megumi Matsutani,
Yusuke Narita
Abstract:
Off-policy evaluation (OPE) aims to estimate the performance of hypothetical policies using data generated by a different policy. Because of its huge potential impact in practice, there has been growing research interest in this field. There is, however, no real-world public dataset that enables the evaluation of OPE, making its experimental studies unrealistic and irreproducible. With the goal of…
▽ More
Off-policy evaluation (OPE) aims to estimate the performance of hypothetical policies using data generated by a different policy. Because of its huge potential impact in practice, there has been growing research interest in this field. There is, however, no real-world public dataset that enables the evaluation of OPE, making its experimental studies unrealistic and irreproducible. With the goal of enabling realistic and reproducible OPE research, we present Open Bandit Dataset, a public logged bandit dataset collected on a large-scale fashion e-commerce platform, ZOZOTOWN. Our dataset is unique in that it contains a set of multiple logged bandit datasets collected by running different policies on the same platform. This enables experimental comparisons of different OPE estimators for the first time. We also develop Python software called Open Bandit Pipeline to streamline and standardize the implementation of batch bandit algorithms and OPE. Our open data and software will contribute to fair and transparent OPE research and help the community identify fruitful research directions. We provide extensive benchmark experiments of existing OPE estimators using our dataset and software. The results open up essential challenges and new avenues for future OPE research.
△ Less
Submitted 26 October, 2021; v1 submitted 17 August, 2020;
originally announced August 2020.
-
Debiased Off-Policy Evaluation for Recommendation Systems
Authors:
Yusuke Narita,
Shota Yasui,
Kohei Yata
Abstract:
Efficient methods to evaluate new algorithms are critical for improving interactive bandit and reinforcement learning systems such as recommendation systems. A/B tests are reliable, but are time- and money-consuming, and entail a risk of failure. In this paper, we develop an alternative method, which predicts the performance of algorithms given historical data that may have been generated by a dif…
▽ More
Efficient methods to evaluate new algorithms are critical for improving interactive bandit and reinforcement learning systems such as recommendation systems. A/B tests are reliable, but are time- and money-consuming, and entail a risk of failure. In this paper, we develop an alternative method, which predicts the performance of algorithms given historical data that may have been generated by a different algorithm. Our estimator has the property that its prediction converges in probability to the true performance of a counterfactual algorithm at a rate of $\sqrt{N}$, as the sample size $N$ increases. We also show a correct way to estimate the variance of our prediction, thus allowing the analyst to quantify the uncertainty in the prediction. These properties hold even when the analyst does not know which among a large number of potentially important state variables are actually important. We validate our method by a simulation experiment about reinforcement learning. We finally apply it to improve advertisement design by a major advertisement company. We find that our method produces smaller mean squared errors than state-of-the-art methods.
△ Less
Submitted 2 August, 2021; v1 submitted 19 February, 2020;
originally announced February 2020.
-
Efficient Adaptive Experimental Design for Average Treatment Effect Estimation
Authors:
Masahiro Kato,
Takuya Ishihara,
Junya Honda,
Yusuke Narita
Abstract:
The goal of many scientific experiments including A/B testing is to estimate the average treatment effect (ATE), which is defined as the difference between the expected outcomes of two or more treatments. In this paper, we consider a situation where an experimenter can assign a treatment to research subjects sequentially. In adaptive experimental design, the experimenter is allowed to change the p…
▽ More
The goal of many scientific experiments including A/B testing is to estimate the average treatment effect (ATE), which is defined as the difference between the expected outcomes of two or more treatments. In this paper, we consider a situation where an experimenter can assign a treatment to research subjects sequentially. In adaptive experimental design, the experimenter is allowed to change the probability of assigning a treatment using past observations for estimating the ATE efficiently. However, with this approach, it is difficult to apply a standard statistical method to construct an estimator because the observations are not independent and identically distributed. We thus propose an algorithm for efficient experiments with estimators constructed from dependent samples. We also introduce a sequential testing framework using the proposed estimator. To justify our proposed approach, we provide finite and infinite sample analyses. Finally, we experimentally show that the proposed algorithm exhibits preferable performance.
△ Less
Submitted 26 October, 2021; v1 submitted 12 February, 2020;
originally announced February 2020.
-
Efficient Counterfactual Learning from Bandit Feedback
Authors:
Yusuke Narita,
Shota Yasui,
Kohei Yata
Abstract:
What is the most statistically efficient way to do off-policy evaluation and optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard e…
▽ More
What is the most statistically efficient way to do off-policy evaluation and optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-the-art benchmark.
△ Less
Submitted 5 December, 2018; v1 submitted 9 September, 2018;
originally announced September 2018.