-
Empirical Bayes mean estimation with nonparametric errors via order statistic regression on replicated data
Authors:
Nikolaos Ignatiadis,
Sujayam Saha,
Dennis L. Sun,
Omkar Muralidharan
Abstract:
We study empirical Bayes estimation of the effect sizes of $N$ units from $K$ noisy observations on each unit. We show that it is possible to achieve near-Bayes optimal mean squared error, without any assumptions or knowledge about the effect size distribution or the noise. The noise distribution can be heteroskedastic and vary arbitrarily from unit to unit. Our proposal, which we call Aurora, lev…
▽ More
We study empirical Bayes estimation of the effect sizes of $N$ units from $K$ noisy observations on each unit. We show that it is possible to achieve near-Bayes optimal mean squared error, without any assumptions or knowledge about the effect size distribution or the noise. The noise distribution can be heteroskedastic and vary arbitrarily from unit to unit. Our proposal, which we call Aurora, leverages the replication inherent in the $K$ observations per unit and recasts the effect size estimation problem as a general regression problem. Aurora with linear regression provably matches the performance of a wide array of estimators including the sample mean, the trimmed mean, the sample median, as well as James-Stein shrunk versions thereof. Aurora automates effect size estimation for Internet-scale datasets, as we demonstrate on data from a large technology firm.
△ Less
Submitted 10 August, 2021; v1 submitted 14 November, 2019;
originally announced November 2019.
-
Second Order Calibration: A Simple Way to Get Approximate Posteriors
Authors:
Omkar Muralidharan,
Amir Najmi
Abstract:
Many large-scale machine learning problems involve estimating an unknown parameter $θ_{i}$ for each of many items. For example, a key problem in sponsored search is to estimate the click through rate (CTR) of each of billions of query-ad pairs. Most common methods, though, only give a point estimate of each $θ_{i}$. A posterior distribution for each $θ_{i}$ is usually more useful but harder to get…
▽ More
Many large-scale machine learning problems involve estimating an unknown parameter $θ_{i}$ for each of many items. For example, a key problem in sponsored search is to estimate the click through rate (CTR) of each of billions of query-ad pairs. Most common methods, though, only give a point estimate of each $θ_{i}$. A posterior distribution for each $θ_{i}$ is usually more useful but harder to get.
We present a simple post-processing technique that takes point estimates or scores $t_{i}$ (from any method) and estimates an approximate posterior for each $θ_{i}$. We build on the idea of calibration, a common post-processing technique that estimates $\mathrm{E}\left(θ_{i}\!\!\bigm|\!\! t_{i}\right)$. Our method, second order calibration, uses empirical Bayes methods to estimate the distribution of $θ_{i}\!\!\bigm|\!\! t_{i}$ and uses the estimated distribution as an approximation to the posterior distribution of $θ_{i}$. We show that this can yield improved point estimates and useful accuracy estimates. The method scales to large problems - our motivating example is a CTR estimation problem involving tens of billions of query-ad pairs.
△ Less
Submitted 28 October, 2015;
originally announced October 2015.
-
Teaching Statistics at Google Scale
Authors:
Nicholas Chamandy,
Omkar Muralidharan,
Stefan Wager
Abstract:
Modern data and applications pose very different challenges from those of the 1950s or even the 1980s. Students contemplating a career in statistics or data science need to have the tools to tackle problems involving massive, heavy-tailed data, often interacting with live, complex systems. However, despite the deepening connections between engineering and modern data science, we argue that trainin…
▽ More
Modern data and applications pose very different challenges from those of the 1950s or even the 1980s. Students contemplating a career in statistics or data science need to have the tools to tackle problems involving massive, heavy-tailed data, often interacting with live, complex systems. However, despite the deepening connections between engineering and modern data science, we argue that training in classical statistical concepts plays a central role in preparing students to solve Google-scale problems. To this end, we present three industrial applications where significant modern data challenges were overcome by statistical thinking.
△ Less
Submitted 16 August, 2015; v1 submitted 6 August, 2015;
originally announced August 2015.
-
Feedback Detection for Live Predictors
Authors:
Stefan Wager,
Nick Chamandy,
Omkar Muralidharan,
Amir Najmi
Abstract:
A predictor that is deployed in a live production system may perturb the features it uses to make predictions. Such a feedback loop can occur, for example, when a model that predicts a certain type of behavior ends up causing the behavior it predicts, thus creating a self-fulfilling prophecy. In this paper we analyze predictor feedback detection as a causal inference problem, and introduce a local…
▽ More
A predictor that is deployed in a live production system may perturb the features it uses to make predictions. Such a feedback loop can occur, for example, when a model that predicts a certain type of behavior ends up causing the behavior it predicts, thus creating a self-fulfilling prophecy. In this paper we analyze predictor feedback detection as a causal inference problem, and introduce a local randomization scheme that can be used to detect non-linear feedback in real-world problems. We conduct a pilot study for our proposed methodology using a predictive system currently deployed as a part of a search engine.
△ Less
Submitted 31 October, 2014; v1 submitted 10 October, 2013;
originally announced October 2013.
-
On Calibrated Predictions for Auction Selection Mechanisms
Authors:
H. Brendan McMahan,
Omkar Muralidharan
Abstract:
Calibration is a basic property for prediction systems, and algorithms for achieving it are well-studied in both statistics and machine learning. In many applications, however, the predictions are used to make decisions that select which observations are made. This makes calibration difficult, as adjusting predictions to achieve calibration changes future data. We focus on click-through-rate (CTR)…
▽ More
Calibration is a basic property for prediction systems, and algorithms for achieving it are well-studied in both statistics and machine learning. In many applications, however, the predictions are used to make decisions that select which observations are made. This makes calibration difficult, as adjusting predictions to achieve calibration changes future data. We focus on click-through-rate (CTR) prediction for search ad auctions. Here, CTR predictions are used by an auction that determines which ads are shown, and we want to maximize the value generated by the auction.
We show that certain natural notions of calibration can be impossible to achieve, depending on the details of the auction. We also show that it can be impossible to maximize auction efficiency while using calibrated predictions. Finally, we give conditions under which calibration is achievable and simultaneously maximizes auction efficiency: roughly speaking, bids and queries must not contain information about CTRs that is not already captured by the predictions.
△ Less
Submitted 16 November, 2012;
originally announced November 2012.
-
Detecting mutations in mixed sample sequencing data using empirical Bayes
Authors:
Omkar Muralidharan,
Georges Natsoulis,
John Bell,
Hanlee Ji,
Nancy R. Zhang
Abstract:
We develop statistically based methods to detect single nucleotide DNA mutations in next generation sequencing data. Sequencing generates counts of the number of times each base was observed at hundreds of thousands to billions of genome positions in each sample. Using these counts to detect mutations is challenging because mutations may have very low prevalence and sequencing error rates vary dra…
▽ More
We develop statistically based methods to detect single nucleotide DNA mutations in next generation sequencing data. Sequencing generates counts of the number of times each base was observed at hundreds of thousands to billions of genome positions in each sample. Using these counts to detect mutations is challenging because mutations may have very low prevalence and sequencing error rates vary dramatically by genome position. The discreteness of sequencing data also creates a difficult multiple testing problem: current false discovery rate methods are designed for continuous data, and work poorly, if at all, on discrete data. We show that a simple randomization technique lets us use continuous false discovery rate methods on discrete data. Our approach is a useful way to estimate false discovery rates for any collection of discrete test statistics, and is hence not limited to sequencing data. We then use an empirical Bayes model to capture different sources of variation in sequencing error rates. The resulting method outperforms existing detection approaches on example data sets.
△ Less
Submitted 28 September, 2012;
originally announced September 2012.
-
An empirical Bayes mixture method for effect size and false discovery rate estimation
Authors:
Omkar Muralidharan
Abstract:
Many statistical problems involve data from thousands of parallel cases. Each case has some associated effect size, and most cases will have no effect. It is often important to estimate the effect size and the local or tail-area false discovery rate for each case. Most current methods do this separately, and most are designed for normal data. This paper uses an empirical Bayes mixture model approa…
▽ More
Many statistical problems involve data from thousands of parallel cases. Each case has some associated effect size, and most cases will have no effect. It is often important to estimate the effect size and the local or tail-area false discovery rate for each case. Most current methods do this separately, and most are designed for normal data. This paper uses an empirical Bayes mixture model approach to estimate both quantities together for exponential family data. The proposed method yields simple, interpretable models that can still be used nonparametrically. It can also estimate an empirical null and incorporate it fully into the model. The method outperforms existing effect size and false discovery rate estimation procedures in normal data simulations; it nearly acheives the Bayes error for effect size estimation. The method is implemented in an R package (mixfdr), freely available from CRAN.
△ Less
Submitted 7 October, 2010;
originally announced October 2010.