Search | arXiv e-print repository

doi 10.25560/110307

Some Statistical and Data Challenges When Building Early-Stage Digital Experimentation and Measurement Capabilities

Abstract: Digital experimentation and measurement (DEM) capabilities -- the knowledge and tools necessary to run experiments with digital products, services, or experiences and measure their impact -- are fast becoming part of the standard toolkit of digital/data-driven organisations in guiding business decisions. Many large technology companies report having mature DEM capabilities, and several businesses… ▽ More Digital experimentation and measurement (DEM) capabilities -- the knowledge and tools necessary to run experiments with digital products, services, or experiences and measure their impact -- are fast becoming part of the standard toolkit of digital/data-driven organisations in guiding business decisions. Many large technology companies report having mature DEM capabilities, and several businesses have been established purely to manage experiments for others. Given the growing evidence that data-driven organisations tend to outperform their non-data-driven counterparts, there has never been a greater need for organisations to build/acquire DEM capabilities to thrive in the current digital era. This thesis presents several novel approaches to statistical and data challenges for organisations building DEM capabilities. We focus on the fundamentals associated with building DEM capabilities, which lead to a richer understanding of the underlying assumptions and thus enable us to develop more appropriate capabilities. We address why one should engage in DEM by quantifying the benefits and risks of acquiring DEM capabilities. This is done using a ranking under lower uncertainty model, enabling one to construct a business case. We also examine what ingredients are necessary to run digital experiments. In addition to clarifying the existing literature around statistical tests, datasets, and methods in experimental design and causal inference, we construct an additional dataset and detailed case studies on applying state-of-the-art methods. Finally, we investigate when a digital experiment design would outperform another, leading to an evaluation framework that compares competing designs' data efficiency. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: PhD thesis. Imperial College London. Official library version available on: https://spiral.imperial.ac.uk/handle/10044/1/110307

arXiv:2210.17187 [pdf, other]

doi 10.1145/3543873.3584654

Measuring e-Commerce Metric Changes in Online Experiments

Authors: C. H. Bryan Liu, Emma J. McCoy

Abstract: Digital technology organizations routinely use online experiments (e.g. A/B tests) to guide their product and business decisions. In e-commerce, we often measure changes to transaction- or item-based business metrics such as Average Basket Value (ABV), Average Basket Size (ABS), and Average Selling Price (ASP); yet it remains a common pitfall to ignore the dependency between the value/size of tran… ▽ More Digital technology organizations routinely use online experiments (e.g. A/B tests) to guide their product and business decisions. In e-commerce, we often measure changes to transaction- or item-based business metrics such as Average Basket Value (ABV), Average Basket Size (ABS), and Average Selling Price (ASP); yet it remains a common pitfall to ignore the dependency between the value/size of transactions/items during experiment design and analysis. We present empirical evidence on such dependency, its impact on measurement uncertainty, and practical implications on A/B test outcomes if left unmitigated. By making the evidence available, we hope to drive awareness of the pitfall among experimenters in e-commerce and hence encourage the adoption of established mitigation approaches. We also share lessons learned when incorporating selected mitigation approaches into our experimentation analysis platform currently in production. △ Less

Submitted 17 April, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

Comments: To appear in WWW '23 Companion. 5 pages, 4 figures, 2 tables. The experiment code and results on the two publicly available datasets are available on GitHub/Zenodo: https://doi.org/10.5281/zenodo.7659092. This version supersedes a previous working paper with a different title

arXiv:2111.10198 [pdf, other]

Datasets for Online Controlled Experiments

Authors: C. H. Bryan Liu, Ângelo Cardoso, Paul Couturier, Emma J. McCoy

Abstract: Online Controlled Experiments (OCE) are the gold standard to measure impact and guide decisions for digital products and services. Despite many methodological advances in this area, the scarcity of public datasets and the lack of a systematic review and categorization hinder its development. We present the first survey and taxonomy for OCE datasets, which highlight the lack of a public dataset to… ▽ More Online Controlled Experiments (OCE) are the gold standard to measure impact and guide decisions for digital products and services. Despite many methodological advances in this area, the scarcity of public datasets and the lack of a systematic review and categorization hinder its development. We present the first survey and taxonomy for OCE datasets, which highlight the lack of a public dataset to support the design and running of experiments with adaptive stop**, an increasingly popular approach to enable quickly deploying improvements or rolling back degrading changes. We release the first such dataset, containing daily checkpoints of decision metrics from multiple, real experiments run on a global e-commerce platform. The dataset design is guided by a broader discussion on data requirements for common statistical tests used in digital experimentation. We demonstrate how to use the dataset in the adaptive stop** scenario using sequential and Bayesian hypothesis tests and learn the relevant parameters for each approach. △ Less

Submitted 14 January, 2022; v1 submitted 19 November, 2021; originally announced November 2021.

Comments: 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks. 17 pages, 2 figures, 2 tables. Dataset available on Open Science Framework: https://osf.io/64jsb/

arXiv:2007.11638 [pdf, other]

An Evaluation Framework for Personalization Strategy Experiment Designs

Authors: C. H. Bryan Liu, Emma J. McCoy

Abstract: Online Controlled Experiments (OCEs) are the gold standard in evaluating the effectiveness of changes to websites. An important type of OCE evaluates different personalization strategies, which present challenges in low test power and lack of full control in group assignment. We argue that getting the right experiment setup -- the allocation of users to treatment/analysis groups -- should take pre… ▽ More Online Controlled Experiments (OCEs) are the gold standard in evaluating the effectiveness of changes to websites. An important type of OCE evaluates different personalization strategies, which present challenges in low test power and lack of full control in group assignment. We argue that getting the right experiment setup -- the allocation of users to treatment/analysis groups -- should take precedence of post-hoc variance reduction techniques in order to enable the scaling of the number of experiments. We present an evaluation framework that, along with a few simple rule of thumbs, allow experimenters to quickly compare which experiment setup will lead to the highest probability of detecting a treatment effect under their particular circumstance. △ Less

Submitted 9 May, 2023; v1 submitted 22 July, 2020; originally announced July 2020.

Comments: Presented in the AdKDD 2020 workshop, in conjunction with The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2020. Main paper: 7 pages, 2 figures, 2 tables, Supplementary document: 6 pages. Fixed minor typos in Eqs. (17) and (18), and Expr. (27a)

arXiv:1909.03457 [pdf, other]

What is the value of experimentation & measurement?

Authors: C. H. Bryan Liu, Benjamin Paul Chamberlain

Abstract: Experimentation and Measurement (E&M) capabilities allow organizations to accurately assess the impact of new propositions and to experiment with many variants of existing products. However, until now, the question of measuring the measurer, or valuing the contribution of an E&M capability to organizational success has not been addressed. We tackle this problem by analyzing how, by decreasing esti… ▽ More Experimentation and Measurement (E&M) capabilities allow organizations to accurately assess the impact of new propositions and to experiment with many variants of existing products. However, until now, the question of measuring the measurer, or valuing the contribution of an E&M capability to organizational success has not been addressed. We tackle this problem by analyzing how, by decreasing estimation uncertainty, E&M platforms allow for better prioritization. We quantify this benefit in terms of expected relative improvement in the performance of all new propositions and provide guidance for how much an E&M capability is worth and when organizations should invest in one. △ Less

Submitted 8 September, 2019; originally announced September 2019.

Comments: Accepted into IEEE International Conference on Data Mining (ICDM) 2019. Main paper: 6 pages, 3 figures; Supplementary document: 7 pages, 2 figures. Code available on: https://github.com/liuchbryan/value_of_experimentation

arXiv:1807.04098 [pdf, other]

doi 10.1007/978-3-030-10997-4_10

A Recurrent Neural Network Survival Model: Predicting Web User Return Time

Authors: Georg L. Grob, Ângelo Cardoso, C. H. Bryan Liu, Duncan A. Little, Benjamin Paul Chamberlain

Abstract: The size of a website's active user base directly affects its value. Thus, it is important to monitor and influence a user's likelihood to return to a site. Essential to this is predicting when a user will return. Current state of the art approaches to solve this problem come in two flavors: (1) Recurrent Neural Network (RNN) based solutions and (2) survival analysis methods. We observe that both… ▽ More The size of a website's active user base directly affects its value. Thus, it is important to monitor and influence a user's likelihood to return to a site. Essential to this is predicting when a user will return. Current state of the art approaches to solve this problem come in two flavors: (1) Recurrent Neural Network (RNN) based solutions and (2) survival analysis methods. We observe that both techniques are severely limited when applied to this problem. Survival models can only incorporate aggregate representations of users instead of automatically learning a representation directly from a raw time series of user actions. RNNs can automatically learn features, but can not be directly trained with examples of non-returning users who have no target value for their return time. We develop a novel RNN survival model that removes the limitations of the state of the art methods. We demonstrate that this model can successfully be applied to return time prediction on a large e-commerce dataset with a superior ability to discriminate between returning and non-returning users than either method applied in isolation. △ Less

Submitted 11 July, 2018; originally announced July 2018.

Comments: Accepted into ECML PKDD 2018; 8 figures and 1 table

Journal ref: Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2018. Lecture Notes in Computer Science, vol 11053. pp 152-168

arXiv:1806.02588 [pdf, other]

Designing Experiments to Measure Incrementality on Facebook

Authors: C. H. Bryan Liu, Elaine M. Bettaney, Benjamin Paul Chamberlain

Abstract: The importance of Facebook advertising has risen dramatically in recent years, with the platform accounting for almost 20% of the global online ad spend in 2017. An important consideration in advertising is incrementality: how much of the change in an experimental metric is an advertising campaign responsible for. To measure incrementality, Facebook provide lift studies. As Facebook lift studies d… ▽ More The importance of Facebook advertising has risen dramatically in recent years, with the platform accounting for almost 20% of the global online ad spend in 2017. An important consideration in advertising is incrementality: how much of the change in an experimental metric is an advertising campaign responsible for. To measure incrementality, Facebook provide lift studies. As Facebook lift studies differ from standard A/B tests, the online experimentation literature does not describe how to calculate parameters such as power and minimum sample size. Facebook also offer multi-cell lift tests, which can be used to compare campaigns that don't have statistically identical audiences. In this case, there is no literature describing how to measure the significance of the difference in incrementality between cells, or how to estimate the power or minimum sample size. We fill these gaps in the literature by providing the statistical power and required sample size calculation for Facebook lift studies. We then generalise the statistical significance, power, and required sample size calculation to multi-cell lift studies. We represent our results theoretically in terms of the distributions of test metrics and in practical terms relating to the metrics used by practitioners, making all of our code publicly available. △ Less

Submitted 11 July, 2018; v1 submitted 7 June, 2018; originally announced June 2018.

Comments: Accepted into 2018 AdKDD & TargetAd Workshop in conjunction with KDD 2018; 6 pages, 4 figures, and 2 tables

arXiv:1803.06258 [pdf, other]

Online Controlled Experiments for Personalised e-Commerce Strategies: Design, Challenges, and Pitfalls

Authors: C. H. Bryan Liu, Benjamin Paul Chamberlain

Abstract: Online controlled experiments are the primary tool for measuring the causal impact of product changes in digital businesses. It is increasingly common for digital products and services to interact with customers in a personalised way. Using online controlled experiments to optimise personalised interaction strategies is challenging because the usual assumption of statistically equivalent user grou… ▽ More Online controlled experiments are the primary tool for measuring the causal impact of product changes in digital businesses. It is increasingly common for digital products and services to interact with customers in a personalised way. Using online controlled experiments to optimise personalised interaction strategies is challenging because the usual assumption of statistically equivalent user groups is violated. Additionally, challenges are introduced by users qualifying for strategies based on dynamic, stochastic attributes. Traditional A/B tests can salvage statistical equivalence by pre-allocating users to control and exposed groups, but this dilutes the experimental metrics and reduces the test power. We present a stacked incrementality test framework that addresses problems with running online experiments for personalised user strategies. We derive bounds that show that our framework is superior to the best simple A/B test given enough users and that this condition is easily met for large scale online experiments. In addition, we provide a test power calculator and describe a selection of pitfalls and lessons learnt from our experience using it. △ Less

Submitted 1 July, 2021; v1 submitted 16 March, 2018; originally announced March 2018.

Comments: Not peer-reviewed but retained for historic interest. Removed an erroneous statement on Welch's t-test assumptions in Section 3.2. 9 pages, 7 figures

arXiv:1706.09865 [pdf, other]

doi 10.1007/978-3-319-71273-4_9

Generalising Random Forest Parameter Optimisation to Include Stability and Cost

Authors: C. H. Bryan Liu, Benjamin Paul Chamberlain, Duncan A. Little, Angelo Cardoso

Abstract: Random forests are among the most popular classification and regression methods used in industrial applications. To be effective, the parameters of random forests must be carefully tuned. This is usually done by choosing values that minimize the prediction error on a held out dataset. We argue that error reduction is only one of several metrics that must be considered when optimizing random forest… ▽ More Random forests are among the most popular classification and regression methods used in industrial applications. To be effective, the parameters of random forests must be carefully tuned. This is usually done by choosing values that minimize the prediction error on a held out dataset. We argue that error reduction is only one of several metrics that must be considered when optimizing random forest parameters for commercial applications. We propose a novel metric that captures the stability of random forests predictions, which we argue is key for scenarios that require successive predictions. We motivate the need for multi-criteria optimization by showing that in practical applications, simply choosing the parameters that lead to the lowest error can introduce unnecessary costs and produce predictions that are not stable across independent runs. To optimize this multi-criteria trade-off, we present a new framework that efficiently finds a principled balance between these three considerations using Bayesian optimisation. The pitfalls of optimising forest parameters purely for error reduction are demonstrated using two publicly available real world datasets. We show that our framework leads to parameter settings that are markedly different from the values discovered by error reduction metrics. △ Less

Submitted 13 July, 2017; v1 submitted 29 June, 2017; originally announced June 2017.

Comments: To appear in ECML-PKDD 2017

Journal ref: Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2017. LNCS vol 10536, pp. 102-113 (2017)

arXiv:1703.02596 [pdf, other]

doi 10.1145/3097983.3098123

Customer Lifetime Value Prediction Using Embeddings

Authors: Benjamin Paul Chamberlain, Angelo Cardoso, C. H. Bryan Liu, Roberto Pagliari, Marc Peter Deisenroth

Abstract: We describe the Customer LifeTime Value (CLTV) prediction system deployed at ASOS.com, a global online fashion retailer. CLTV prediction is an important problem in e-commerce where an accurate estimate of future value allows retailers to effectively allocate marketing spend, identify and nurture high value customers and mitigate exposure to losses. The system at ASOS provides daily estimates of th… ▽ More We describe the Customer LifeTime Value (CLTV) prediction system deployed at ASOS.com, a global online fashion retailer. CLTV prediction is an important problem in e-commerce where an accurate estimate of future value allows retailers to effectively allocate marketing spend, identify and nurture high value customers and mitigate exposure to losses. The system at ASOS provides daily estimates of the future value of every customer and is one of the cornerstones of the personalised shop** experience. The state of the art in this domain uses large numbers of handcrafted features and ensemble regressors to forecast value, predict churn and evaluate customer loyalty. Recently, domains including language, vision and speech have shown dramatic advances by replacing handcrafted features with features that are learned automatically from data. We detail the system deployed at ASOS and show that learning feature representations is a promising extension to the state of the art in CLTV modelling. We propose a novel way to generate embeddings of customers, which addresses the issue of the ever changing product catalogue and obtain a significant improvement over an exhaustive set of handcrafted features. △ Less

Submitted 6 July, 2017; v1 submitted 7 March, 2017; originally announced March 2017.

Comments: 10 pages, 11 figures

Journal ref: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pages 1753-1762, 2017

Showing 1–10 of 10 results for author: Liu, C H B