-
Some Statistical and Data Challenges When Building Early-Stage Digital Experimentation and Measurement Capabilities
Authors:
C. H. Bryan Liu
Abstract:
Digital experimentation and measurement (DEM) capabilities -- the knowledge and tools necessary to run experiments with digital products, services, or experiences and measure their impact -- are fast becoming part of the standard toolkit of digital/data-driven organisations in guiding business decisions. Many large technology companies report having mature DEM capabilities, and several businesses…
▽ More
Digital experimentation and measurement (DEM) capabilities -- the knowledge and tools necessary to run experiments with digital products, services, or experiences and measure their impact -- are fast becoming part of the standard toolkit of digital/data-driven organisations in guiding business decisions. Many large technology companies report having mature DEM capabilities, and several businesses have been established purely to manage experiments for others. Given the growing evidence that data-driven organisations tend to outperform their non-data-driven counterparts, there has never been a greater need for organisations to build/acquire DEM capabilities to thrive in the current digital era.
This thesis presents several novel approaches to statistical and data challenges for organisations building DEM capabilities. We focus on the fundamentals associated with building DEM capabilities, which lead to a richer understanding of the underlying assumptions and thus enable us to develop more appropriate capabilities. We address why one should engage in DEM by quantifying the benefits and risks of acquiring DEM capabilities. This is done using a ranking under lower uncertainty model, enabling one to construct a business case. We also examine what ingredients are necessary to run digital experiments. In addition to clarifying the existing literature around statistical tests, datasets, and methods in experimental design and causal inference, we construct an additional dataset and detailed case studies on applying state-of-the-art methods. Finally, we investigate when a digital experiment design would outperform another, leading to an evaluation framework that compares competing designs' data efficiency.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Measuring e-Commerce Metric Changes in Online Experiments
Authors:
C. H. Bryan Liu,
Emma J. McCoy
Abstract:
Digital technology organizations routinely use online experiments (e.g. A/B tests) to guide their product and business decisions. In e-commerce, we often measure changes to transaction- or item-based business metrics such as Average Basket Value (ABV), Average Basket Size (ABS), and Average Selling Price (ASP); yet it remains a common pitfall to ignore the dependency between the value/size of tran…
▽ More
Digital technology organizations routinely use online experiments (e.g. A/B tests) to guide their product and business decisions. In e-commerce, we often measure changes to transaction- or item-based business metrics such as Average Basket Value (ABV), Average Basket Size (ABS), and Average Selling Price (ASP); yet it remains a common pitfall to ignore the dependency between the value/size of transactions/items during experiment design and analysis. We present empirical evidence on such dependency, its impact on measurement uncertainty, and practical implications on A/B test outcomes if left unmitigated. By making the evidence available, we hope to drive awareness of the pitfall among experimenters in e-commerce and hence encourage the adoption of established mitigation approaches. We also share lessons learned when incorporating selected mitigation approaches into our experimentation analysis platform currently in production.
△ Less
Submitted 17 April, 2023; v1 submitted 31 October, 2022;
originally announced October 2022.
-
Datasets for Online Controlled Experiments
Authors:
C. H. Bryan Liu,
Ângelo Cardoso,
Paul Couturier,
Emma J. McCoy
Abstract:
Online Controlled Experiments (OCE) are the gold standard to measure impact and guide decisions for digital products and services. Despite many methodological advances in this area, the scarcity of public datasets and the lack of a systematic review and categorization hinder its development. We present the first survey and taxonomy for OCE datasets, which highlight the lack of a public dataset to…
▽ More
Online Controlled Experiments (OCE) are the gold standard to measure impact and guide decisions for digital products and services. Despite many methodological advances in this area, the scarcity of public datasets and the lack of a systematic review and categorization hinder its development. We present the first survey and taxonomy for OCE datasets, which highlight the lack of a public dataset to support the design and running of experiments with adaptive stop**, an increasingly popular approach to enable quickly deploying improvements or rolling back degrading changes. We release the first such dataset, containing daily checkpoints of decision metrics from multiple, real experiments run on a global e-commerce platform. The dataset design is guided by a broader discussion on data requirements for common statistical tests used in digital experimentation. We demonstrate how to use the dataset in the adaptive stop** scenario using sequential and Bayesian hypothesis tests and learn the relevant parameters for each approach.
△ Less
Submitted 14 January, 2022; v1 submitted 19 November, 2021;
originally announced November 2021.
-
An Evaluation Framework for Personalization Strategy Experiment Designs
Authors:
C. H. Bryan Liu,
Emma J. McCoy
Abstract:
Online Controlled Experiments (OCEs) are the gold standard in evaluating the effectiveness of changes to websites. An important type of OCE evaluates different personalization strategies, which present challenges in low test power and lack of full control in group assignment. We argue that getting the right experiment setup -- the allocation of users to treatment/analysis groups -- should take pre…
▽ More
Online Controlled Experiments (OCEs) are the gold standard in evaluating the effectiveness of changes to websites. An important type of OCE evaluates different personalization strategies, which present challenges in low test power and lack of full control in group assignment. We argue that getting the right experiment setup -- the allocation of users to treatment/analysis groups -- should take precedence of post-hoc variance reduction techniques in order to enable the scaling of the number of experiments. We present an evaluation framework that, along with a few simple rule of thumbs, allow experimenters to quickly compare which experiment setup will lead to the highest probability of detecting a treatment effect under their particular circumstance.
△ Less
Submitted 9 May, 2023; v1 submitted 22 July, 2020;
originally announced July 2020.
-
What is the value of experimentation & measurement?
Authors:
C. H. Bryan Liu,
Benjamin Paul Chamberlain
Abstract:
Experimentation and Measurement (E&M) capabilities allow organizations to accurately assess the impact of new propositions and to experiment with many variants of existing products. However, until now, the question of measuring the measurer, or valuing the contribution of an E&M capability to organizational success has not been addressed. We tackle this problem by analyzing how, by decreasing esti…
▽ More
Experimentation and Measurement (E&M) capabilities allow organizations to accurately assess the impact of new propositions and to experiment with many variants of existing products. However, until now, the question of measuring the measurer, or valuing the contribution of an E&M capability to organizational success has not been addressed. We tackle this problem by analyzing how, by decreasing estimation uncertainty, E&M platforms allow for better prioritization. We quantify this benefit in terms of expected relative improvement in the performance of all new propositions and provide guidance for how much an E&M capability is worth and when organizations should invest in one.
△ Less
Submitted 8 September, 2019;
originally announced September 2019.
-
A Recurrent Neural Network Survival Model: Predicting Web User Return Time
Authors:
Georg L. Grob,
Ângelo Cardoso,
C. H. Bryan Liu,
Duncan A. Little,
Benjamin Paul Chamberlain
Abstract:
The size of a website's active user base directly affects its value. Thus, it is important to monitor and influence a user's likelihood to return to a site. Essential to this is predicting when a user will return. Current state of the art approaches to solve this problem come in two flavors: (1) Recurrent Neural Network (RNN) based solutions and (2) survival analysis methods. We observe that both…
▽ More
The size of a website's active user base directly affects its value. Thus, it is important to monitor and influence a user's likelihood to return to a site. Essential to this is predicting when a user will return. Current state of the art approaches to solve this problem come in two flavors: (1) Recurrent Neural Network (RNN) based solutions and (2) survival analysis methods. We observe that both techniques are severely limited when applied to this problem. Survival models can only incorporate aggregate representations of users instead of automatically learning a representation directly from a raw time series of user actions. RNNs can automatically learn features, but can not be directly trained with examples of non-returning users who have no target value for their return time. We develop a novel RNN survival model that removes the limitations of the state of the art methods. We demonstrate that this model can successfully be applied to return time prediction on a large e-commerce dataset with a superior ability to discriminate between returning and non-returning users than either method applied in isolation.
△ Less
Submitted 11 July, 2018;
originally announced July 2018.
-
Designing Experiments to Measure Incrementality on Facebook
Authors:
C. H. Bryan Liu,
Elaine M. Bettaney,
Benjamin Paul Chamberlain
Abstract:
The importance of Facebook advertising has risen dramatically in recent years, with the platform accounting for almost 20% of the global online ad spend in 2017. An important consideration in advertising is incrementality: how much of the change in an experimental metric is an advertising campaign responsible for. To measure incrementality, Facebook provide lift studies. As Facebook lift studies d…
▽ More
The importance of Facebook advertising has risen dramatically in recent years, with the platform accounting for almost 20% of the global online ad spend in 2017. An important consideration in advertising is incrementality: how much of the change in an experimental metric is an advertising campaign responsible for. To measure incrementality, Facebook provide lift studies. As Facebook lift studies differ from standard A/B tests, the online experimentation literature does not describe how to calculate parameters such as power and minimum sample size. Facebook also offer multi-cell lift tests, which can be used to compare campaigns that don't have statistically identical audiences. In this case, there is no literature describing how to measure the significance of the difference in incrementality between cells, or how to estimate the power or minimum sample size. We fill these gaps in the literature by providing the statistical power and required sample size calculation for Facebook lift studies. We then generalise the statistical significance, power, and required sample size calculation to multi-cell lift studies. We represent our results theoretically in terms of the distributions of test metrics and in practical terms relating to the metrics used by practitioners, making all of our code publicly available.
△ Less
Submitted 11 July, 2018; v1 submitted 7 June, 2018;
originally announced June 2018.
-
Online Controlled Experiments for Personalised e-Commerce Strategies: Design, Challenges, and Pitfalls
Authors:
C. H. Bryan Liu,
Benjamin Paul Chamberlain
Abstract:
Online controlled experiments are the primary tool for measuring the causal impact of product changes in digital businesses. It is increasingly common for digital products and services to interact with customers in a personalised way. Using online controlled experiments to optimise personalised interaction strategies is challenging because the usual assumption of statistically equivalent user grou…
▽ More
Online controlled experiments are the primary tool for measuring the causal impact of product changes in digital businesses. It is increasingly common for digital products and services to interact with customers in a personalised way. Using online controlled experiments to optimise personalised interaction strategies is challenging because the usual assumption of statistically equivalent user groups is violated. Additionally, challenges are introduced by users qualifying for strategies based on dynamic, stochastic attributes. Traditional A/B tests can salvage statistical equivalence by pre-allocating users to control and exposed groups, but this dilutes the experimental metrics and reduces the test power. We present a stacked incrementality test framework that addresses problems with running online experiments for personalised user strategies. We derive bounds that show that our framework is superior to the best simple A/B test given enough users and that this condition is easily met for large scale online experiments. In addition, we provide a test power calculator and describe a selection of pitfalls and lessons learnt from our experience using it.
△ Less
Submitted 1 July, 2021; v1 submitted 16 March, 2018;
originally announced March 2018.
-
Generalising Random Forest Parameter Optimisation to Include Stability and Cost
Authors:
C. H. Bryan Liu,
Benjamin Paul Chamberlain,
Duncan A. Little,
Angelo Cardoso
Abstract:
Random forests are among the most popular classification and regression methods used in industrial applications. To be effective, the parameters of random forests must be carefully tuned. This is usually done by choosing values that minimize the prediction error on a held out dataset. We argue that error reduction is only one of several metrics that must be considered when optimizing random forest…
▽ More
Random forests are among the most popular classification and regression methods used in industrial applications. To be effective, the parameters of random forests must be carefully tuned. This is usually done by choosing values that minimize the prediction error on a held out dataset. We argue that error reduction is only one of several metrics that must be considered when optimizing random forest parameters for commercial applications. We propose a novel metric that captures the stability of random forests predictions, which we argue is key for scenarios that require successive predictions. We motivate the need for multi-criteria optimization by showing that in practical applications, simply choosing the parameters that lead to the lowest error can introduce unnecessary costs and produce predictions that are not stable across independent runs. To optimize this multi-criteria trade-off, we present a new framework that efficiently finds a principled balance between these three considerations using Bayesian optimisation. The pitfalls of optimising forest parameters purely for error reduction are demonstrated using two publicly available real world datasets. We show that our framework leads to parameter settings that are markedly different from the values discovered by error reduction metrics.
△ Less
Submitted 13 July, 2017; v1 submitted 29 June, 2017;
originally announced June 2017.
-
Customer Lifetime Value Prediction Using Embeddings
Authors:
Benjamin Paul Chamberlain,
Angelo Cardoso,
C. H. Bryan Liu,
Roberto Pagliari,
Marc Peter Deisenroth
Abstract:
We describe the Customer LifeTime Value (CLTV) prediction system deployed at ASOS.com, a global online fashion retailer. CLTV prediction is an important problem in e-commerce where an accurate estimate of future value allows retailers to effectively allocate marketing spend, identify and nurture high value customers and mitigate exposure to losses. The system at ASOS provides daily estimates of th…
▽ More
We describe the Customer LifeTime Value (CLTV) prediction system deployed at ASOS.com, a global online fashion retailer. CLTV prediction is an important problem in e-commerce where an accurate estimate of future value allows retailers to effectively allocate marketing spend, identify and nurture high value customers and mitigate exposure to losses. The system at ASOS provides daily estimates of the future value of every customer and is one of the cornerstones of the personalised shop** experience. The state of the art in this domain uses large numbers of handcrafted features and ensemble regressors to forecast value, predict churn and evaluate customer loyalty. Recently, domains including language, vision and speech have shown dramatic advances by replacing handcrafted features with features that are learned automatically from data. We detail the system deployed at ASOS and show that learning feature representations is a promising extension to the state of the art in CLTV modelling. We propose a novel way to generate embeddings of customers, which addresses the issue of the ever changing product catalogue and obtain a significant improvement over an exhaustive set of handcrafted features.
△ Less
Submitted 6 July, 2017; v1 submitted 7 March, 2017;
originally announced March 2017.