Search | arXiv e-print repository

Optimized Regression Discontinuity Designs

Abstract: The increasing popularity of regression discontinuity methods for causal inference in observational studies has led to a proliferation of different estimating strategies, most of which involve first fitting non-parametric regression models on both sides of a treatment assignment boundary and then reporting plug-in estimates for the effect of interest. In applications, however, it is often difficul… ▽ More The increasing popularity of regression discontinuity methods for causal inference in observational studies has led to a proliferation of different estimating strategies, most of which involve first fitting non-parametric regression models on both sides of a treatment assignment boundary and then reporting plug-in estimates for the effect of interest. In applications, however, it is often difficult to tune the non-parametric regressions in a way that is well calibrated for the specific target of inference; for example, the model with the best global in-sample fit may provide poor estimates of the discontinuity parameter. In this paper, we propose an alternative method for estimation and statistical inference in regression discontinuity designs that uses numerical convex optimization to directly obtain the finite-sample-minimax linear estimator for the regression discontinuity parameter, subject to bounds on the second derivative of the conditional response function. Given a bound on the second derivative, our proposed method is fully data-driven, and provides uniform confidence intervals for the regression discontinuity parameter with both discrete and continuous running variables. The method also naturally extends to the case of multiple running variables. △ Less

Submitted 7 June, 2018; v1 submitted 3 May, 2017; originally announced May 2017.

Comments: Review of Economics and Statistics, forthcoming

arXiv:1702.02896 [pdf, other]

Policy Learning with Observational Data

Authors: Susan Athey, Stefan Wager

Abstract: In many areas, practitioners seek to use observational data to learn a treatment assignment policy that satisfies application-specific constraints, such as budget, fairness, simplicity, or other functional form constraints. For example, policies may be restricted to take the form of decision trees based on a limited set of easily observable individual characteristics. We propose a new approach to… ▽ More In many areas, practitioners seek to use observational data to learn a treatment assignment policy that satisfies application-specific constraints, such as budget, fairness, simplicity, or other functional form constraints. For example, policies may be restricted to take the form of decision trees based on a limited set of easily observable individual characteristics. We propose a new approach to this problem motivated by the theory of semiparametrically efficient estimation. Our method can be used to optimize either binary treatments or infinitesimal nudges to continuous treatments, and can leverage observational data where causal effects are identified using a variety of strategies, including selection on observables and instrumental variables. Given a doubly robust estimator of the causal effect of assigning everyone to treatment, we develop an algorithm for choosing whom to treat, and establish strong guarantees for the asymptotic utilitarian regret of the resulting policy. △ Less

Submitted 4 September, 2020; v1 submitted 9 February, 2017; originally announced February 2017.

Comments: Forthcoming in Econometrica. Original title: Efficient Policy Learning

arXiv:1702.01250 [pdf, ps, other]

Estimating Average Treatment Effects: Supplementary Analyses and Remaining Challenges

Authors: Susan Athey, Guido Imbens, Thai Pham, Stefan Wager

Abstract: There is a large literature on semiparametric estimation of average treatment effects under unconfounded treatment assignment in settings with a fixed number of covariates. More recently attention has focused on settings with a large number of covariates. In this paper we extend lessons from the earlier literature to this new setting. We propose that in addition to reporting point estimates and st… ▽ More There is a large literature on semiparametric estimation of average treatment effects under unconfounded treatment assignment in settings with a fixed number of covariates. More recently attention has focused on settings with a large number of covariates. In this paper we extend lessons from the earlier literature to this new setting. We propose that in addition to reporting point estimates and standard errors, researchers report results from a number of supplementary analyses to assist in assessing the credibility of their estimates. △ Less

Submitted 4 February, 2017; originally announced February 2017.

Comments: 9 pages

arXiv:1610.01271 [pdf, other]

Generalized Random Forests

Authors: Susan Athey, Julie Tibshirani, Stefan Wager

Abstract: We propose generalized random forests, a method for non-parametric statistical estimation based on random forests (Breiman, 2001) that can be used to fit any quantity of interest identified as the solution to a set of local moment equations. Following the literature on local maximum likelihood estimation, our method considers a weighted set of nearby training examples; however, instead of using cl… ▽ More We propose generalized random forests, a method for non-parametric statistical estimation based on random forests (Breiman, 2001) that can be used to fit any quantity of interest identified as the solution to a set of local moment equations. Following the literature on local maximum likelihood estimation, our method considers a weighted set of nearby training examples; however, instead of using classical kernel weighting functions that are prone to a strong curse of dimensionality, we use an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest. We propose a flexible, computationally efficient algorithm for growing generalized random forests, develop a large sample theory for our method showing that our estimates are consistent and asymptotically Gaussian, and provide an estimator for their asymptotic variance that enables valid confidence intervals. We use our approach to develop new methods for three statistical tasks: non-parametric quantile regression, conditional average partial effect estimation, and heterogeneous treatment effect estimation via instrumental variables. A software implementation, grf for R and C++, is available from CRAN. △ Less

Submitted 5 April, 2018; v1 submitted 5 October, 2016; originally announced October 2016.

Comments: Forthcoming in the Annals of Statistics

arXiv:1607.06801 [pdf, other]

doi 10.1073/pnas.1614732113

High-dimensional regression adjustments in randomized experiments

Authors: Stefan Wager, Wenfei Du, Jonathan Taylor, Robert Tibshirani

Abstract: We study the problem of treatment effect estimation in randomized experiments with high-dimensional covariate information, and show that essentially any risk-consistent regression adjustment can be used to obtain efficient estimates of the average treatment effect. Our results considerably extend the range of settings where high-dimensional regression adjustments are guaranteed to provide valid in… ▽ More We study the problem of treatment effect estimation in randomized experiments with high-dimensional covariate information, and show that essentially any risk-consistent regression adjustment can be used to obtain efficient estimates of the average treatment effect. Our results considerably extend the range of settings where high-dimensional regression adjustments are guaranteed to provide valid inference about the population average treatment effect. We then propose cross-estimation, a simple method for obtaining finite-sample-unbiased treatment effect estimates that leverages high-dimensional regression adjustments. Our method can be used when the regression model is estimated using the lasso, the elastic net, subset selection, etc. Finally, we extend our analysis to allow for adaptive specification search via cross-validation, and flexible non-parametric regression adjustments with machine learning methods such as random forests or neural networks. △ Less

Submitted 27 October, 2016; v1 submitted 22 July, 2016; originally announced July 2016.

Comments: To appear in the Proceedings of the National Academy of Sciences. The present draft does not reflect final copyediting by the PNAS staff

arXiv:1604.07125 [pdf, other]

Approximate Residual Balancing: De-Biased Inference of Average Treatment Effects in High Dimensions

Authors: Susan Athey, Guido W. Imbens, Stefan Wager

Abstract: There are many settings where researchers are interested in estimating average treatment effects and are willing to rely on the unconfoundedness assumption, which requires that the treatment assignment be as good as random conditional on pre-treatment variables. The unconfoundedness assumption is often more plausible if a large number of pre-treatment variables are included in the analysis, but th… ▽ More There are many settings where researchers are interested in estimating average treatment effects and are willing to rely on the unconfoundedness assumption, which requires that the treatment assignment be as good as random conditional on pre-treatment variables. The unconfoundedness assumption is often more plausible if a large number of pre-treatment variables are included in the analysis, but this can worsen the performance of standard approaches to treatment effect estimation. In this paper, we develop a method for de-biasing penalized regression adjustments to allow sparse regression methods like the lasso to be used for sqrt{n}-consistent inference of average treatment effects in high-dimensional linear models. Given linearity, we do not need to assume that the treatment propensities are estimable, or that the average treatment effect is a sparse contrast of the outcome model parameters. Rather, in addition standard assumptions used to make lasso regression on the outcome model consistent under 1-norm error, we only require overlap, i.e., that the propensity score be uniformly bounded away from 0 and 1. Procedurally, our method combines balancing weights with a regularized regression adjustment. △ Less

Submitted 31 January, 2018; v1 submitted 25 April, 2016; originally announced April 2016.

Comments: Forthcoming in the Journal of the Royal Statistical Society, Series B

arXiv:1603.06340 [pdf, other]

Data Augmentation via Levy Processes

Authors: Stefan Wager, William Fithian, Percy Liang

Abstract: If a document is about travel, we may expect that short snippets of the document should also be about travel. We introduce a general framework for incorporating these types of invariances into a discriminative classifier. The framework imagines data as being drawn from a slice of a Levy process. If we slice the Levy process at an earlier point in time, we obtain additional pseudo-examples, which c… ▽ More If a document is about travel, we may expect that short snippets of the document should also be about travel. We introduce a general framework for incorporating these types of invariances into a discriminative classifier. The framework imagines data as being drawn from a slice of a Levy process. If we slice the Levy process at an earlier point in time, we obtain additional pseudo-examples, which can be used to train the classifier. We show that this scheme has two desirable properties: it preserves the Bayes decision boundary, and it is equivalent to fitting a generative model in the limit where we rewind time back to 0. Our construction captures popular schemes such as Gaussian feature noising and dropout training, as well as admitting new generalizations. △ Less

Submitted 21 March, 2016; originally announced March 2016.

arXiv:1602.01206 [pdf, other]

denoiseR: A Package for Low Rank Matrix Estimation

Authors: Julie Josse, Sylvain Sardy, Stefan Wager

Abstract: We introduce denoiseR, an R package that provides a unified implementation of several state-of-the-art proposals for regularized low rank matrix estimation, along with automatic selection of the regularization parameters. We also extend these methods to allow for missing values. The regularization schemes discussed in this paper are built around singular-value shrinkage and bootstrap-based stabili… ▽ More We introduce denoiseR, an R package that provides a unified implementation of several state-of-the-art proposals for regularized low rank matrix estimation, along with automatic selection of the regularization parameters. We also extend these methods to allow for missing values. The regularization schemes discussed in this paper are built around singular-value shrinkage and bootstrap-based stability arguments. We illustrate how to use out package by applying it to several real and simulated datasets, and highlight strengths and weaknesses of the different implemented methods. △ Less

Submitted 8 August, 2018; v1 submitted 3 February, 2016; originally announced February 2016.

arXiv:1510.04342 [pdf, other]

Estimation and Inference of Heterogeneous Treatment Effects using Random Forests

Authors: Stefan Wager, Susan Athey

Abstract: Many scientific and engineering challenges -- ranging from personalized medicine to customized marketing recommendations -- require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfounde… ▽ More Many scientific and engineering challenges -- ranging from personalized medicine to customized marketing recommendations -- require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates. △ Less

Submitted 9 July, 2017; v1 submitted 14 October, 2015; originally announced October 2015.

Comments: To appear in the Journal of the American Statistical Association. Part of the results developed in this paper were made available as an earlier technical report "Asymptotic Theory for Random Forests", available at (arXiv:1405.0352)

arXiv:1508.01278 [pdf, other]

Teaching Statistics at Google Scale

Authors: Nicholas Chamandy, Omkar Muralidharan, Stefan Wager

Abstract: Modern data and applications pose very different challenges from those of the 1950s or even the 1980s. Students contemplating a career in statistics or data science need to have the tools to tackle problems involving massive, heavy-tailed data, often interacting with live, complex systems. However, despite the deepening connections between engineering and modern data science, we argue that trainin… ▽ More Modern data and applications pose very different challenges from those of the 1950s or even the 1980s. Students contemplating a career in statistics or data science need to have the tools to tackle problems involving massive, heavy-tailed data, often interacting with live, complex systems. However, despite the deepening connections between engineering and modern data science, we argue that training in classical statistical concepts plays a central role in preparing students to solve Google-scale problems. To this end, we present three industrial applications where significant modern data challenges were overcome by statistical thinking. △ Less

Submitted 16 August, 2015; v1 submitted 6 August, 2015; originally announced August 2015.

Comments: To appear in The American Statistician

arXiv:1507.03003 [pdf, other]

High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification

Authors: Edgar Dobriban, Stefan Wager

Abstract: We provide a unified analysis of the predictive risk of ridge regression and regularized discriminant analysis in a dense random effects model. We work in a high-dimensional asymptotic regime where $p, n \to \infty$ and $p/n \to γ\in (0, \, \infty)$, and allow for arbitrary covariance among the features. For both methods, we provide an explicit and efficiently computable expression for the limitin… ▽ More We provide a unified analysis of the predictive risk of ridge regression and regularized discriminant analysis in a dense random effects model. We work in a high-dimensional asymptotic regime where $p, n \to \infty$ and $p/n \to γ\in (0, \, \infty)$, and allow for arbitrary covariance among the features. For both methods, we provide an explicit and efficiently computable expression for the limiting predictive risk, which depends only on the spectrum of the feature-covariance matrix, the signal strength, and the aspect ratio $γ$. Especially in the case of regularized discriminant analysis, we find that predictive accuracy has a nuanced dependence on the eigenvalue distribution of the covariance matrix, suggesting that analyses based on the operator norm of the covariance matrix may not be sharp. Our results also uncover several qualitative insights about both methods: for example, with ridge regression, there is an exact inverse relation between the limiting predictive risk and the limiting estimation risk given a fixed signal strength. Our analysis builds on recent advances in random matrix theory. △ Less

Submitted 4 November, 2015; v1 submitted 10 July, 2015; originally announced July 2015.

Comments: Added a section on prediction versus estimation for ridge regression. Rewrote introduction. Other results unchanged

arXiv:1507.00832 [pdf, other]

The Efficiency of Density Deconvolution

Authors: Stefan Wager

Abstract: The density deconvolution problem involves recovering a target density g from a sample that has been corrupted by noise. From the perspective of Le Cam's local asymptotic normality theory, we show that non-parametric density deconvolution with Gaussian noise behaves similarly to a low-dimensional parametric problem that can easily be solved by maximum likelihood. This framework allows us to give a… ▽ More The density deconvolution problem involves recovering a target density g from a sample that has been corrupted by noise. From the perspective of Le Cam's local asymptotic normality theory, we show that non-parametric density deconvolution with Gaussian noise behaves similarly to a low-dimensional parametric problem that can easily be solved by maximum likelihood. This framework allows us to give a simple account of the statistical efficiency of density deconvolution and to concisely describe the effect of Gaussian noise on our ability to estimate g, all while relying on classical maximum likelihood theory instead of the kernel estimators typically used to study density deconvolution. △ Less

Submitted 3 July, 2015; originally announced July 2015.

arXiv:1503.06388 [pdf, other]

Adaptive Concentration of Regression Trees, with Application to Random Forests

Authors: Stefan Wager, Guenther Walther

Abstract: We study the convergence of the predictive surface of regression trees and forests. To support our analysis we introduce a notion of adaptive concentration for regression trees. This approach breaks tree training into a model selection phase in which we pick the tree splits, followed by a model fitting phase where we find the best regression model consistent with these splits. We then show that th… ▽ More We study the convergence of the predictive surface of regression trees and forests. To support our analysis we introduce a notion of adaptive concentration for regression trees. This approach breaks tree training into a model selection phase in which we pick the tree splits, followed by a model fitting phase where we find the best regression model consistent with these splits. We then show that the fitted regression tree concentrates around the optimal predictor with the same splits: as d and n get large, the discrepancy is with high probability bounded on the order of sqrt(log(d) log(n)/k) uniformly over the whole regression surface, where d is the dimension of the feature space, n is the number of training examples, and k is the minimum leaf size for each tree. We also provide rate-matching lower bounds for this adaptive concentration statement. From a practical perspective, our result enables us to prove consistency results for adaptively grown forests in high dimensions, and to carry out valid post-selection inference in the sense of Berk et al. [2013] for subgroups defined by tree leaves. △ Less

Submitted 30 April, 2016; v1 submitted 22 March, 2015; originally announced March 2015.

arXiv:1412.4182 [pdf, other]

The Statistics of Streaming Sparse Regression

Authors: Jacob Steinhardt, Stefan Wager, Percy Liang

Abstract: We present a sparse analogue to stochastic gradient descent that is guaranteed to perform well under similar conditions to the lasso. In the linear regression setup with irrepresentable noise features, our algorithm recovers the support set of the optimal parameter vector with high probability, and achieves a statistically quasi-optimal rate of convergence of Op(k log(d)/T), where k is the sparsit… ▽ More We present a sparse analogue to stochastic gradient descent that is guaranteed to perform well under similar conditions to the lasso. In the linear regression setup with irrepresentable noise features, our algorithm recovers the support set of the optimal parameter vector with high probability, and achieves a statistically quasi-optimal rate of convergence of Op(k log(d)/T), where k is the sparsity of the solution, d is the number of features, and T is the number of training examples. Meanwhile, our algorithm does not require any more computational resources than stochastic gradient descent. In our experiments, we find that our method substantially out-performs existing streaming algorithms on both real and simulated data. △ Less

Submitted 12 December, 2014; originally announced December 2014.

arXiv:1410.8275 [pdf, other]

Bootstrap-Based Regularization for Low-Rank Matrix Estimation

Authors: Julie Josse, Stefan Wager

Abstract: We develop a flexible framework for low-rank matrix estimation that allows us to transform noise models into regularization schemes via a simple bootstrap algorithm. Effectively, our procedure seeks an autoencoding basis for the observed matrix that is stable with respect to the specified noise model; we call the resulting procedure a stable autoencoder. In the simplest case, with an isotropic noi… ▽ More We develop a flexible framework for low-rank matrix estimation that allows us to transform noise models into regularization schemes via a simple bootstrap algorithm. Effectively, our procedure seeks an autoencoding basis for the observed matrix that is stable with respect to the specified noise model; we call the resulting procedure a stable autoencoder. In the simplest case, with an isotropic noise model, our method is equivalent to a classical singular value shrinkage estimator. For non-isotropic noise models, e.g., Poisson noise, the method does not reduce to singular value shrinkage, and instead yields new estimators that perform well in experiments. Moreover, by iterating our stable autoencoding scheme, we can automatically generate low-rank estimates without specifying the target rank as a tuning parameter. △ Less

Submitted 28 June, 2016; v1 submitted 30 October, 2014; originally announced October 2014.

Comments: To appear in the Journal of Machine Learning Research

arXiv:1407.7614 [pdf, other]

Confidence Areas for Fixed-Effects PCA

Authors: Julie Josse, Stefan Wager, François Husson

Abstract: PCA is often used to visualize data when the rows and the columns are both of interest. In such a setting there is a lack of inferential methods on the PCA output. We study the asymptotic variance of a fixed-effects model for PCA, and propose several approaches to assessing the variability of PCA estimates: a method based on a parametric bootstrap, a new cell-wise jackknife, as well as a computati… ▽ More PCA is often used to visualize data when the rows and the columns are both of interest. In such a setting there is a lack of inferential methods on the PCA output. We study the asymptotic variance of a fixed-effects model for PCA, and propose several approaches to assessing the variability of PCA estimates: a method based on a parametric bootstrap, a new cell-wise jackknife, as well as a computationally cheaper approximation to the jackknife. We visualize the confidence regions by Procrustes rotation. Using a simulation study, we compare the proposed methods and highlight the strengths and drawbacks of each method as we vary the number of rows, the number of columns, and the strength of the relationships between variables. △ Less

Submitted 28 July, 2014; originally announced July 2014.

arXiv:1407.3289 [pdf, other]

Altitude Training: Strong Bounds for Single-Layer Dropout

Authors: Stefan Wager, William Fithian, Sida Wang, Percy Liang

Abstract: Dropout training, originally designed for deep neural networks, has been successful on high-dimensional single-layer natural language tasks. This paper proposes a theoretical explanation for this phenomenon: we show that, under a generative Poisson topic model with long documents, dropout training improves the exponent in the generalization bound for empirical risk minimization. Dropout achieves t… ▽ More Dropout training, originally designed for deep neural networks, has been successful on high-dimensional single-layer natural language tasks. This paper proposes a theoretical explanation for this phenomenon: we show that, under a generative Poisson topic model with long documents, dropout training improves the exponent in the generalization bound for empirical risk minimization. Dropout achieves this gain much like a marathon runner who practices at altitude: once a classifier learns to perform reasonably well on training examples that have been artificially corrupted by dropout, it will do very well on the uncorrupted test set. We also show that, under similar conditions, dropout preserves the Bayes decision boundary and should therefore induce minimal bias in high dimensions. △ Less

Submitted 31 October, 2014; v1 submitted 11 July, 2014; originally announced July 2014.

Comments: Advances in Neural Information Processing Systems (NIPS), 2014

arXiv:1405.0352 [pdf, other]

Asymptotic Theory for Random Forests

Authors: Stefan Wager

Abstract: Random forests have proven to be reliable predictive algorithms in many application areas. Not much is known, however, about the statistical properties of random forests. Several authors have established conditions under which their predictions are consistent, but these results do not provide practical estimates of random forest errors. In this paper, we analyze a random forest model based on subs… ▽ More Random forests have proven to be reliable predictive algorithms in many application areas. Not much is known, however, about the statistical properties of random forests. Several authors have established conditions under which their predictions are consistent, but these results do not provide practical estimates of random forest errors. In this paper, we analyze a random forest model based on subsampling, and show that random forest predictions are asymptotically normal provided that the subsample size s scales as s(n)/n = o(log(n)^{-d}), where n is the number of training examples and d is the number of features. Moreover, we show that the asymptotic variance can consistently be estimated using an infinitesimal jackknife for bagged ensembles recently proposed by Efron (2014). In other words, our results let us both characterize and estimate the error-distribution of random forest predictions, thus taking a step towards making random forests tools for statistical inference instead of just black-box predictive algorithms. △ Less

Submitted 3 May, 2016; v1 submitted 2 May, 2014; originally announced May 2014.

Comments: This manuscript is superseded by "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests" by Wager and Athey (arXiv:1510.04342). The new paper extends the asymptotic theory developed here, and applies it to causal inference in the potential outcomes framework with unconfoundedness. The present version is maintained online for archival purposes only

arXiv:1311.4555 [pdf, other]

Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife

Authors: Stefan Wager, Trevor Hastie, Bradley Efron

Abstract: We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2012) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and working… ▽ More We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2012) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and working with a large B can be computationally expensive. Direct applications of jackknife and IJ estimators to bagging require B on the order of n^{1.5} bootstrap replicates to converge, where n is the size of the training set. We propose improved versions that only require B on the order of n replicates. Moreover, we show that the IJ estimator requires 1.7 times less bootstrap replicates than the jackknife to achieve a given accuracy. Finally, we study the sampling distributions of the jackknife and IJ variance estimates themselves. We illustrate our findings with multiple experiments and simulation studies. △ Less

Submitted 28 March, 2014; v1 submitted 18 November, 2013; originally announced November 2013.

Comments: To appear in Journal of Machine Learning Research (JMLR)

arXiv:1310.2931 [pdf, other]

Feedback Detection for Live Predictors

Authors: Stefan Wager, Nick Chamandy, Omkar Muralidharan, Amir Najmi

Abstract: A predictor that is deployed in a live production system may perturb the features it uses to make predictions. Such a feedback loop can occur, for example, when a model that predicts a certain type of behavior ends up causing the behavior it predicts, thus creating a self-fulfilling prophecy. In this paper we analyze predictor feedback detection as a causal inference problem, and introduce a local… ▽ More A predictor that is deployed in a live production system may perturb the features it uses to make predictions. Such a feedback loop can occur, for example, when a model that predicts a certain type of behavior ends up causing the behavior it predicts, thus creating a self-fulfilling prophecy. In this paper we analyze predictor feedback detection as a causal inference problem, and introduce a local randomization scheme that can be used to detect non-linear feedback in real-world problems. We conduct a pilot study for our proposed methodology using a predictive system currently deployed as a part of a search engine. △ Less

Submitted 31 October, 2014; v1 submitted 10 October, 2013; originally announced October 2013.

Comments: Advances in Neural Information Processing Systems (NIPS), 2014

arXiv:1310.1363 [pdf, ps, other]

doi 10.1214/15-AOAS812

Weakly supervised clustering: Learning fine-grained signals from coarse labels

Authors: Stefan Wager, Alexander Blocker, Niall Cardin

Abstract: Consider a classification problem where we do not have access to labels for individual training examples, but only have average labels over subpopulations. We give practical examples of this setup and show how such a classification task can usefully be analyzed as a weakly supervised clustering problem. We propose three approaches to solving the weakly supervised clustering problem, including a la… ▽ More Consider a classification problem where we do not have access to labels for individual training examples, but only have average labels over subpopulations. We give practical examples of this setup and show how such a classification task can usefully be analyzed as a weakly supervised clustering problem. We propose three approaches to solving the weakly supervised clustering problem, including a latent variables model that performs well in our experiments. We illustrate our methods on an analysis of aggregated elections data and an industry data set that was the original motivation for this research. △ Less

Submitted 15 September, 2015; v1 submitted 4 October, 2013; originally announced October 2013.

Comments: Published at http://dx.doi.org/10.1214/15-AOAS812 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS812

Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 2, 801-820

arXiv:1309.5352 [pdf, other]

Sequential Selection Procedures and False Discovery Rate Control

Authors: Max Grazier G'Sell, Stefan Wager, Alexandra Chouldechova, Robert Tibshirani

Abstract: We consider a multiple hypothesis testing setting where the hypotheses are ordered and one is only permitted to reject an initial contiguous block, H_1,\dots,H_k, of hypotheses. A rejection rule in this setting amounts to a procedure for choosing the stop** point k. This setting is inspired by the sequential nature of many model selection problems, where choosing a stop** point or a model is e… ▽ More We consider a multiple hypothesis testing setting where the hypotheses are ordered and one is only permitted to reject an initial contiguous block, H_1,\dots,H_k, of hypotheses. A rejection rule in this setting amounts to a procedure for choosing the stop** point k. This setting is inspired by the sequential nature of many model selection problems, where choosing a stop** point or a model is equivalent to rejecting all hypotheses up to that point and none thereafter. We propose two new testing procedures, and prove that they control the false discovery rate in the ordered testing setting. We also show how the methods can be applied to model selection using recent results on p-values in sequential model selection settings. △ Less

Submitted 23 March, 2015; v1 submitted 20 September, 2013; originally announced September 2013.

Comments: 31 pages, 14 figures. Accepted to the Journal of the Royal Statistical Society: Series B

arXiv:1307.7830 [pdf, other]

Semiparametric Exponential Families for Heavy-Tailed Data

Authors: William Fithian, Stefan Wager

Abstract: We propose a semiparametric method for fitting the tail of a heavy-tailed population given a relatively small sample from that population and a larger sample from a related background population. We model the tail of the small sample as an exponential tilt of the better-observed large-sample tail, using a robust sufficient statistic motivated by extreme value theory. In particular, our method indu… ▽ More We propose a semiparametric method for fitting the tail of a heavy-tailed population given a relatively small sample from that population and a larger sample from a related background population. We model the tail of the small sample as an exponential tilt of the better-observed large-sample tail, using a robust sufficient statistic motivated by extreme value theory. In particular, our method induces an estimator of the small-population mean, and we give theoretical and empirical evidence that this estimator outperforms methods that do not use the background sample. We demonstrate substantial efficiency gains over competing methods in simulation and on data from a large controlled experiment conducted by Facebook. △ Less

Submitted 19 October, 2014; v1 submitted 30 July, 2013; originally announced July 2013.

Comments: To appear in Biometrika

MSC Class: 62G32; 62G35 (Primary) 62G20 (Secondary)

arXiv:1307.1493 [pdf, other]

Dropout Training as Adaptive Regularization

Authors: Stefan Wager, Sida Wang, Percy Liang

Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We… ▽ More Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. △ Less

Submitted 1 November, 2013; v1 submitted 4 July, 2013; originally announced July 2013.

Comments: 11 pages. Advances in Neural Information Processing Systems (NIPS), 2013

arXiv:1204.0316 [pdf, other]

Subsampling Extremes: From Block Maxima to Smooth Tail Estimation

Authors: Stefan Wager

Abstract: We study a new estimator for the tail index of a distribution in the Frechet domain of attraction that arises naturally by computing subsample maxima. This estimator is equivalent to taking a U-statistic over a Hill estimator with two order statistics. The estimator presents multiple advantages over the Hill estimator. In particular, it has asymptotically smooth sample paths as a function of the t… ▽ More We study a new estimator for the tail index of a distribution in the Frechet domain of attraction that arises naturally by computing subsample maxima. This estimator is equivalent to taking a U-statistic over a Hill estimator with two order statistics. The estimator presents multiple advantages over the Hill estimator. In particular, it has asymptotically smooth sample paths as a function of the threshold k, making it considerably more stable than the Hill estimator. The estimator also admits a simple and intuitive threshold selection rule that does not require fitting a second-order model. Journal of Multivariate Analysis, 130, 2014 △ Less

Submitted 19 October, 2014; v1 submitted 2 April, 2012; originally announced April 2012.

Comments: Added references

Showing 51–75 of 75 results for author: Wager, S