-
A Study of Symbiosis Bias in A/B Tests of Recommendation Algorithms
Authors:
David Holtz,
Jennifer Brennan,
Jean Pouget-Abadie
Abstract:
One assumption underlying the unbiasedness of global treatment effect estimates from randomized experiments is the stable unit treatment value assumption (SUTVA). Many experiments that compare the efficacy of different recommendation algorithms violate SUTVA, because each algorithm is trained on a pool of shared data, often coming from a mixture of recommendation algorithms in the experiment. We e…
▽ More
One assumption underlying the unbiasedness of global treatment effect estimates from randomized experiments is the stable unit treatment value assumption (SUTVA). Many experiments that compare the efficacy of different recommendation algorithms violate SUTVA, because each algorithm is trained on a pool of shared data, often coming from a mixture of recommendation algorithms in the experiment. We explore, through simulation, cluster randomized and data-diverted solutions to mitigating this bias, which we call "symbiosis bias."
△ Less
Submitted 14 September, 2023; v1 submitted 13 September, 2023;
originally announced September 2023.
-
More Reviews May Not Help: Evidence from Incentivized First Reviews on Airbnb
Authors:
Andrey Fradkin,
David Holtz
Abstract:
Online reviews are typically written by volunteers and, as a consequence, information about seller quality may be under-provided in digital marketplaces. We study the extent of this under-provision in a large-scale randomized experiment conducted by Airbnb. In this experiment, buyers are offered a coupon to review listings that have no prior reviews. The treatment induces additional reviews and th…
▽ More
Online reviews are typically written by volunteers and, as a consequence, information about seller quality may be under-provided in digital marketplaces. We study the extent of this under-provision in a large-scale randomized experiment conducted by Airbnb. In this experiment, buyers are offered a coupon to review listings that have no prior reviews. The treatment induces additional reviews and these reviews tend to be more negative than reviews in the control group, consistent with selection bias in reviewing. Reviews induced by the treatment result in a temporary increase in transactions but these transactions are for fewer nights, on average. The effects on transactions and nights per transaction cancel out so that there is no detectable effect on total nights sold and revenue. Measures of transaction quality in the treatment group fall, suggesting that incentivized reviews do not improve matching. We show how market conditions and the design of the reputation system can explain our findings.
△ Less
Submitted 17 December, 2021;
originally announced December 2021.
-
How Work From Home Affects Collaboration: A Large-Scale Study of Information Workers in a Natural Experiment During COVID-19
Authors:
Longqi Yang,
Sonia Jaffe,
David Holtz,
Siddharth Suri,
Shilpi Sinha,
Jeffrey Weston,
Connor Joyce,
Neha Shah,
Kevin Sherman,
CJ Lee,
Brent Hecht,
Jaime Teevan
Abstract:
The COVID-19 pandemic has had a wide-ranging impact on information workers such as higher stress levels, increased workloads, new workstreams, and more caregiving responsibilities during lockdown. COVID-19 also caused the overwhelming majority of information workers to rapidly shift to working from home (WFH). The central question this work addresses is: can we isolate the effects of WFH on inform…
▽ More
The COVID-19 pandemic has had a wide-ranging impact on information workers such as higher stress levels, increased workloads, new workstreams, and more caregiving responsibilities during lockdown. COVID-19 also caused the overwhelming majority of information workers to rapidly shift to working from home (WFH). The central question this work addresses is: can we isolate the effects of WFH on information workers' collaboration activities from all other factors, especially the other effects of COVID-19? This is important because in the future, WFH will likely to be more common than it was prior to the pandemic.
We use difference-in-differences (DiD), a causal identification strategy commonly used in the social sciences, to control for unobserved confounding factors and estimate the causal effect of WFH. Our analysis relies on measuring the difference in changes between those who WFH prior to COVID-19 and those who did not. Our preliminary results suggest that on average, people spent more time on collaboration in April (Post WFH mandate) than in February (Pre WFH mandate), but this is primarily due to factors other than WFH, such as lockdowns during the pandemic. The change attributable to WFH specifically is in the opposite direction: less time on collaboration and more focus time. This reversal shows the importance of using causal inference: a simple analysis would have resulted in the wrong conclusion. We further find that the effect of WFH is moderated by individual remote collaboration experience prior to WFH. Meanwhile, the medium for collaboration has also shifted due to WFH: instant messages were used more, whereas scheduled meetings were used less. We discuss design implications -- how future WFH may affect focused work, collaborative work, and creative work.
△ Less
Submitted 30 July, 2020;
originally announced July 2020.
-
Reducing Interference Bias in Online Marketplace Pricing Experiments
Authors:
David Holtz,
Ruben Lobel,
Inessa Liskovich,
Sinan Aral
Abstract:
Online marketplace designers frequently run A/B tests to measure the impact of proposed product changes. However, given that marketplaces are inherently connected, total average treatment effect estimates obtained through Bernoulli randomized experiments are often biased due to violations of the stable unit treatment value assumption. This can be particularly problematic for experiments that impac…
▽ More
Online marketplace designers frequently run A/B tests to measure the impact of proposed product changes. However, given that marketplaces are inherently connected, total average treatment effect estimates obtained through Bernoulli randomized experiments are often biased due to violations of the stable unit treatment value assumption. This can be particularly problematic for experiments that impact sellers' strategic choices, affect buyers' preferences over items in their consideration set, or change buyers' consideration sets altogether. In this work, we measure and reduce bias due to interference in online marketplace experiments by using observational data to create clusters of similar listings, and then using those clusters to conduct cluster-randomized field experiments. We provide a lower bound on the magnitude of bias due to interference by conducting a meta-experiment that randomizes over two experiment designs: one Bernoulli randomized, one cluster randomized. In both meta-experiment arms, treatment sellers are subject to a different platform fee policy than control sellers, resulting in different prices for buyers. By conducting a joint analysis of the two meta-experiment arms, we find a large and statistically significant difference between the total average treatment effect estimates obtained with the two designs, and estimate that 32.60% of the Bernoulli-randomized treatment effect estimate is due to interference bias. We also find weak evidence that the magnitude and/or direction of interference bias depends on extent to which a marketplace is supply- or demand-constrained, and analyze a second meta-experiment to highlight the difficulty of detecting interference bias when treatment interventions require intention-to-treat analysis.
△ Less
Submitted 26 April, 2020;
originally announced April 2020.
-
Limiting Bias from Test-Control Interference in Online Marketplace Experiments
Authors:
David Holtz,
Sinan Aral
Abstract:
In an A/B test, the typical objective is to measure the total average treatment effect (TATE), which measures the difference between the average outcome if all users were treated and the average outcome if all users were untreated. However, a simple difference-in-means estimator will give a biased estimate of the TATE when outcomes of control units depend on the outcomes of treatment units, an iss…
▽ More
In an A/B test, the typical objective is to measure the total average treatment effect (TATE), which measures the difference between the average outcome if all users were treated and the average outcome if all users were untreated. However, a simple difference-in-means estimator will give a biased estimate of the TATE when outcomes of control units depend on the outcomes of treatment units, an issue we refer to as test-control interference. Using a simulation built on top of data from Airbnb, this paper considers the use of methods from the network interference literature for online marketplace experimentation. We model the marketplace as a network in which an edge exists between two sellers if their goods substitute for one another. We then simulate seller outcomes, specifically considering a "status quo" context and "treatment" context that forces all sellers to lower their prices. We use the same simulation framework to approximate TATE distributions produced by using blocked graph cluster randomization, exposure modeling, and the Hajek estimator for the difference in means. We find that while blocked graph cluster randomization reduces the bias of the naive difference-in-means estimator by as much as 62%, it also significantly increases the variance of the estimator. On the other hand, the use of more sophisticated estimators produces mixed results. While some provide (small) additional reductions in bias and small reductions in variance, others lead to increased bias and variance. Overall, our results suggest that experiment design and analysis techniques from the network experimentation literature are promising tools for reducing bias due to test-control interference in marketplace experiments.
△ Less
Submitted 25 April, 2020;
originally announced April 2020.
-
The Engagement-Diversity Connection: Evidence from a Field Experiment on Spotify
Authors:
David Holtz,
Benjamin Carterette,
Praveen Chandar,
Zahra Nazari,
Henriette Cramer,
Sinan Aral
Abstract:
It remains unknown whether personalized recommendations increase or decrease the diversity of content people consume. We present results from a randomized field experiment on Spotify testing the effect of personalized recommendations on consumption diversity. In the experiment, both control and treatment users were given podcast recommendations, with the sole aim of increasing podcast consumption.…
▽ More
It remains unknown whether personalized recommendations increase or decrease the diversity of content people consume. We present results from a randomized field experiment on Spotify testing the effect of personalized recommendations on consumption diversity. In the experiment, both control and treatment users were given podcast recommendations, with the sole aim of increasing podcast consumption. Treatment users' recommendations were personalized based on their music listening history, whereas control users were recommended popular podcasts among users in their demographic group. We find that, on average, the treatment increased podcast streams by 28.90%. However, the treatment also decreased the average individual-level diversity of podcast streams by 11.51%, and increased the aggregate diversity of podcast streams by 5.96%, indicating that personalized recommendations have the potential to create patterns of consumption that are homogenous within and diverse across users, a pattern reflecting Balkanization. Our results provide evidence of an "engagement-diversity trade-off" when recommendations are optimized solely to drive consumption: while personalized recommendations increase user engagement, they also affect the diversity of consumed content. This shift in consumption diversity can affect user retention and lifetime value, and impact the optimal strategy for content producers. We also observe evidence that our treatment affected streams from sections of Spotify's app not directly affected by the experiment, suggesting that exposure to personalized recommendations can affect the content that users consume organically. We believe these findings highlight the need for academics and practitioners to continue investing in personalization methods that explicitly take into account the diversity of content recommended.
△ Less
Submitted 17 March, 2020;
originally announced March 2020.
-
The Atacama Cosmology Telescope: Cosmological parameters from three seasons of data
Authors:
Jonathan L. Sievers,
Renée A. Hlozek,
Michael R. Nolta,
Viviana Acquaviva,
Graeme E. Addison,
Peter A. R. Ade,
Paula Aguirre,
Mandana Amiri,
John William Appel,
L. Felipe Barrientos,
Elia S. Battistelli,
Nick Battaglia,
J. Richard Bond,
Ben Brown,
Bryce Burger,
Erminia Calabrese,
Jay Chervenak,
Devin Crichton,
Sudeep Das,
Mark J. Devlin,
Simon R. Dicker,
W. Bertrand Doriese,
Joanna Dunkley,
Rolando Dünner,
Thomas Essinger-Hileman
, et al. (68 additional authors not shown)
Abstract:
We present constraints on cosmological and astrophysical parameters from high-resolution microwave background maps at 148 GHz and 218 GHz made by the Atacama Cosmology Telescope (ACT) in three seasons of observations from 2008 to 2010. A model of primary cosmological and secondary foreground parameters is fit to the map power spectra and lensing deflection power spectrum, including contributions f…
▽ More
We present constraints on cosmological and astrophysical parameters from high-resolution microwave background maps at 148 GHz and 218 GHz made by the Atacama Cosmology Telescope (ACT) in three seasons of observations from 2008 to 2010. A model of primary cosmological and secondary foreground parameters is fit to the map power spectra and lensing deflection power spectrum, including contributions from both the thermal Sunyaev-Zeldovich (tSZ) effect and the kinematic Sunyaev-Zeldovich (kSZ) effect, Poisson and correlated anisotropy from unresolved infrared sources, radio sources, and the correlation between the tSZ effect and infrared sources. The power ell^2 C_ell/2pi of the thermal SZ power spectrum at 148 GHz is measured to be 3.4 +\- 1.4 muK^2 at ell=3000, while the corresponding amplitude of the kinematic SZ power spectrum has a 95% confidence level upper limit of 8.6 muK^2. Combining ACT power spectra with the WMAP 7-year temperature and polarization power spectra, we find excellent consistency with the LCDM model. We constrain the number of effective relativistic degrees of freedom in the early universe to be Neff=2.79 +\- 0.56, in agreement with the canonical value of Neff=3.046 for three massless neutrinos. We constrain the sum of the neutrino masses to be Sigma m_nu < 0.39 eV at 95% confidence when combining ACT and WMAP 7-year data with BAO and Hubble constant measurements. We constrain the amount of primordial helium to be Yp = 0.225 +\- 0.034, and measure no variation in the fine structure constant alpha since recombination, with alpha/alpha0 = 1.004 +/- 0.005. We also find no evidence for any running of the scalar spectral index, dns/dlnk = -0.004 +\- 0.012.
△ Less
Submitted 11 October, 2013; v1 submitted 4 January, 2013;
originally announced January 2013.
-
A Study of the Residual 39Ar Content in Argon from Underground Sources
Authors:
J. Xu,
F. Calaprice,
C. Galbiati,
A. Goretti,
G. Guray,
T. Hohman,
D. Holtz,
A. Ianni,
M. Laubenstein,
B. Loer,
C. Love,
C. J. Martoff,
D. Montanari,
S. Mukhopadhyay,
A. Nelson,
S. D. Rountree,
R. B. Vogelaar,
A. Wright
Abstract:
The discovery of argon from underground sources with significantly less 39Ar than atmospheric argon was an important step in the development of direct-detection dark matter experiments using argon as the active target. We report on the design and operation of a low background detector with a single phase liquid argon target that was built to study the 39Ar content of the underground argon. Undergr…
▽ More
The discovery of argon from underground sources with significantly less 39Ar than atmospheric argon was an important step in the development of direct-detection dark matter experiments using argon as the active target. We report on the design and operation of a low background detector with a single phase liquid argon target that was built to study the 39Ar content of the underground argon. Underground argon from the Kinder Morgan CO2 plant in Cortez, Colorado was determined to have less than 0.65% of the 39Ar activity in atmospheric argon.
△ Less
Submitted 26 April, 2012;
originally announced April 2012.
-
The Atacama Cosmology Telescope: Cosmology from Galaxy Clusters Detected via the Sunyaev-Zel'dovich Effect
Authors:
Neelima Sehgal,
Hy Trac,
Viviana Acquaviva,
Peter A. R. Ade,
Paula Aguirre,
Mandana Amiri,
John W. Appel,
L. Felipe Barrientos,
Elia S. Battistelli,
J. Richard Bond,
Ben Brown,
Bryce Burger,
Jay Chervenak,
Sudeep Das,
Mark J. Devlin,
Simon R. Dicker,
W. Bertrand Doriese,
Joanna Dunkley,
Rolando Dünner,
Thomas Essinger-Hileman,
Ryan P. Fisher,
Joseph W. Fowler,
Amir Hajian,
Mark Halpern,
Matthew Hasselfield
, et al. (44 additional authors not shown)
Abstract:
We present constraints on cosmological parameters based on a sample of Sunyaev-Zel'dovich-selected galaxy clusters detected in a millimeter-wave survey by the Atacama Cosmology Telescope. The cluster sample used in this analysis consists of 9 optically-confirmed high-mass clusters comprising the high-significance end of the total cluster sample identified in 455 square degrees of sky surveyed duri…
▽ More
We present constraints on cosmological parameters based on a sample of Sunyaev-Zel'dovich-selected galaxy clusters detected in a millimeter-wave survey by the Atacama Cosmology Telescope. The cluster sample used in this analysis consists of 9 optically-confirmed high-mass clusters comprising the high-significance end of the total cluster sample identified in 455 square degrees of sky surveyed during 2008 at 148 GHz. We focus on the most massive systems to reduce the degeneracy between unknown cluster astrophysics and cosmology derived from SZ surveys. We describe the scaling relation between cluster mass and SZ signal with a 4-parameter fit. Marginalizing over the values of the parameters in this fit with conservative priors gives sigma_8 = 0.851 +/- 0.115 and w = -1.14 +/- 0.35 for a spatially-flat wCDM cosmological model with WMAP 7-year priors on cosmological parameters. This gives a modest improvement in statistical uncertainty over WMAP 7-year constraints alone. Fixing the scaling relation between cluster mass and SZ signal to a fiducial relation obtained from numerical simulations and calibrated by X-ray observations, we find sigma_8 = 0.821 +/- 0.044 and w = -1.05 +/- 0.20. These results are consistent with constraints from WMAP 7 plus baryon acoustic oscillations plus type Ia supernoava which give sigma_8 = 0.802 +/- 0.038 and w = -0.98 +/- 0.053. A stacking analysis of the clusters in this sample compared to clusters simulated assuming the fiducial model also shows good agreement. These results suggest that, given the sample of clusters used here, both the astrophysics of massive clusters and the cosmological parameters derived from them are broadly consistent with current models.
△ Less
Submitted 5 October, 2010;
originally announced October 2010.