-
Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests
Authors:
Pallavi Basu,
Ron Berman
Abstract:
A/B testers conducting large-scale tests prioritize lifts and want to be able to control false rejections of the null. This work develops a decision-theoretic framework for maximizing profits subject to false discovery rate (FDR) control. We build an empirical Bayes solution for the problem via the greedy knapsack approach. We derive an oracle rule based on ranking the ratio of expected lifts and…
▽ More
A/B testers conducting large-scale tests prioritize lifts and want to be able to control false rejections of the null. This work develops a decision-theoretic framework for maximizing profits subject to false discovery rate (FDR) control. We build an empirical Bayes solution for the problem via the greedy knapsack approach. We derive an oracle rule based on ranking the ratio of expected lifts and the cost of wrong rejections using the local false discovery rate (lfdr) statistic. Our oracle decision rule is valid and optimal for large-scale tests. Further, we establish asymptotic validity for the data-driven procedure and demonstrate finite-sample validity in experimental studies. We also demonstrate the merit of the proposed method over other FDR control methods. Finally, we discuss an application to actual Optimizely experiments.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Measuring Discrepancies in Airbnb Guest Acceptance Rates Using Anonymized Demographic Data
Authors:
Siddhartha Basu,
Ruthie Berman,
Adam Bloomston,
John Campbell,
Anne Diaz,
Nanako Era,
Benjamin Evans,
Sukhada Palkar,
Skyler Wharton
Abstract:
In order to make technological systems and platforms more equitable, organizations must be able to measure the scale of potential inequities as well as the efficacy of proposed solutions. In this paper, we present a system that measures discrepancies in platform user experience that are attributable to perceived race (experience gaps) using anonymized data. This allows for progress to be made in t…
▽ More
In order to make technological systems and platforms more equitable, organizations must be able to measure the scale of potential inequities as well as the efficacy of proposed solutions. In this paper, we present a system that measures discrepancies in platform user experience that are attributable to perceived race (experience gaps) using anonymized data. This allows for progress to be made in this area while limiting any potential privacy risk. Specifically, the system enforces the privacy model of p-sensitive k-anonymity to conduct measurement without ever storing or having access to a 1:1 map** between user identifiers and perceived race. We test this system in the context of the Airbnb guest booking experience. Our simulation-based power analysis shows that the system can measure the efficacy of proposed platform-wide interventions with comparable precision to non-anonymized data. Our work establishes that measurement of experience gaps with anonymized data is feasible and can be used to guide the development of policies to promote equitable outcomes for users of Airbnb as well as other technology platforms.
△ Less
Submitted 25 April, 2022;
originally announced April 2022.
-
Latent Stratification for Incrementality Experiments
Authors:
Ron Berman,
Elea McDonnell Feit
Abstract:
Incrementality experiments compare customers exposed to a marketing action designed to increase sales to those randomly assigned to a control group. These experiments suffer from noisy responses which make precise estimation of the average treatment effect (ATE) and marketing ROI difficult. We develop a model that improves the precision by estimating separate treatment effects for three latent str…
▽ More
Incrementality experiments compare customers exposed to a marketing action designed to increase sales to those randomly assigned to a control group. These experiments suffer from noisy responses which make precise estimation of the average treatment effect (ATE) and marketing ROI difficult. We develop a model that improves the precision by estimating separate treatment effects for three latent strata defined by potential outcomes in the experiment -- customers who would buy regardless of ad exposure, those who would buy only if exposed to ads and those who would not buy regardless. The overall ATE is estimated by averaging the strata-level effects, and this produces a more precise estimator of the ATE over a wide range of conditions typical of marketing experiments. Analytical results and simulations show that the method decreases the sampling variance of the ATE most when (1) there are large differences in the treatment effect between latent strata and (2) the model used to estimate the strata-level effects is well-identified. Applying the procedure to 5 catalog experiments shows a reduction of 30-60% in the variance of the overall ATE. This leads to a substantial decrease in decision errors when the estimator is used to determine whether ads should be continued or discontinued.
△ Less
Submitted 9 May, 2023; v1 submitted 19 November, 2019;
originally announced November 2019.
-
Test & Roll: Profit-Maximizing A/B Tests
Authors:
Elea McDonnell Feit,
Ron Berman
Abstract:
Marketers often use A/B testing as a tool to compare marketing treatments in a test stage and then deploy the better-performing treatment to the remainder of the consumer population. While these tests have traditionally been analyzed using hypothesis testing, we re-frame them as an explicit trade-off between the opportunity cost of the test (where some customers receive a sub-optimal treatment) an…
▽ More
Marketers often use A/B testing as a tool to compare marketing treatments in a test stage and then deploy the better-performing treatment to the remainder of the consumer population. While these tests have traditionally been analyzed using hypothesis testing, we re-frame them as an explicit trade-off between the opportunity cost of the test (where some customers receive a sub-optimal treatment) and the potential losses associated with deploying a sub-optimal treatment to the remainder of the population.
We derive a closed-form expression for the profit-maximizing test size and show that it is substantially smaller than typically recommended for a hypothesis test, particularly when the response is noisy or when the total population is small. The common practice of using small holdout groups can be rationalized by asymmetric priors. The proposed test design achieves nearly the same expected regret as the flexible, yet harder-to-implement multi-armed bandit under a wide range of conditions.
We demonstrate the benefits of the method in three different marketing contexts -- website design, display advertising and catalog tests -- in which we estimate priors from past data. In all three cases, the optimal sample sizes are substantially smaller than for a traditional hypothesis test, resulting in higher profit.
△ Less
Submitted 21 May, 2019; v1 submitted 1 November, 2018;
originally announced November 2018.