-
Optimized Regression Discontinuity Designs
Authors:
Guido Imbens,
Stefan Wager
Abstract:
The increasing popularity of regression discontinuity methods for causal inference in observational studies has led to a proliferation of different estimating strategies, most of which involve first fitting non-parametric regression models on both sides of a treatment assignment boundary and then reporting plug-in estimates for the effect of interest. In applications, however, it is often difficul…
▽ More
The increasing popularity of regression discontinuity methods for causal inference in observational studies has led to a proliferation of different estimating strategies, most of which involve first fitting non-parametric regression models on both sides of a treatment assignment boundary and then reporting plug-in estimates for the effect of interest. In applications, however, it is often difficult to tune the non-parametric regressions in a way that is well calibrated for the specific target of inference; for example, the model with the best global in-sample fit may provide poor estimates of the discontinuity parameter. In this paper, we propose an alternative method for estimation and statistical inference in regression discontinuity designs that uses numerical convex optimization to directly obtain the finite-sample-minimax linear estimator for the regression discontinuity parameter, subject to bounds on the second derivative of the conditional response function. Given a bound on the second derivative, our proposed method is fully data-driven, and provides uniform confidence intervals for the regression discontinuity parameter with both discrete and continuous running variables. The method also naturally extends to the case of multiple running variables.
△ Less
Submitted 7 June, 2018; v1 submitted 3 May, 2017;
originally announced May 2017.
-
Policy Learning with Observational Data
Authors:
Susan Athey,
Stefan Wager
Abstract:
In many areas, practitioners seek to use observational data to learn a treatment assignment policy that satisfies application-specific constraints, such as budget, fairness, simplicity, or other functional form constraints. For example, policies may be restricted to take the form of decision trees based on a limited set of easily observable individual characteristics. We propose a new approach to…
▽ More
In many areas, practitioners seek to use observational data to learn a treatment assignment policy that satisfies application-specific constraints, such as budget, fairness, simplicity, or other functional form constraints. For example, policies may be restricted to take the form of decision trees based on a limited set of easily observable individual characteristics. We propose a new approach to this problem motivated by the theory of semiparametrically efficient estimation. Our method can be used to optimize either binary treatments or infinitesimal nudges to continuous treatments, and can leverage observational data where causal effects are identified using a variety of strategies, including selection on observables and instrumental variables. Given a doubly robust estimator of the causal effect of assigning everyone to treatment, we develop an algorithm for choosing whom to treat, and establish strong guarantees for the asymptotic utilitarian regret of the resulting policy.
△ Less
Submitted 4 September, 2020; v1 submitted 9 February, 2017;
originally announced February 2017.
-
Estimating Average Treatment Effects: Supplementary Analyses and Remaining Challenges
Authors:
Susan Athey,
Guido Imbens,
Thai Pham,
Stefan Wager
Abstract:
There is a large literature on semiparametric estimation of average treatment effects under unconfounded treatment assignment in settings with a fixed number of covariates. More recently attention has focused on settings with a large number of covariates. In this paper we extend lessons from the earlier literature to this new setting. We propose that in addition to reporting point estimates and st…
▽ More
There is a large literature on semiparametric estimation of average treatment effects under unconfounded treatment assignment in settings with a fixed number of covariates. More recently attention has focused on settings with a large number of covariates. In this paper we extend lessons from the earlier literature to this new setting. We propose that in addition to reporting point estimates and standard errors, researchers report results from a number of supplementary analyses to assist in assessing the credibility of their estimates.
△ Less
Submitted 4 February, 2017;
originally announced February 2017.
-
Generalized Random Forests
Authors:
Susan Athey,
Julie Tibshirani,
Stefan Wager
Abstract:
We propose generalized random forests, a method for non-parametric statistical estimation based on random forests (Breiman, 2001) that can be used to fit any quantity of interest identified as the solution to a set of local moment equations. Following the literature on local maximum likelihood estimation, our method considers a weighted set of nearby training examples; however, instead of using cl…
▽ More
We propose generalized random forests, a method for non-parametric statistical estimation based on random forests (Breiman, 2001) that can be used to fit any quantity of interest identified as the solution to a set of local moment equations. Following the literature on local maximum likelihood estimation, our method considers a weighted set of nearby training examples; however, instead of using classical kernel weighting functions that are prone to a strong curse of dimensionality, we use an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest. We propose a flexible, computationally efficient algorithm for growing generalized random forests, develop a large sample theory for our method showing that our estimates are consistent and asymptotically Gaussian, and provide an estimator for their asymptotic variance that enables valid confidence intervals. We use our approach to develop new methods for three statistical tasks: non-parametric quantile regression, conditional average partial effect estimation, and heterogeneous treatment effect estimation via instrumental variables. A software implementation, grf for R and C++, is available from CRAN.
△ Less
Submitted 5 April, 2018; v1 submitted 5 October, 2016;
originally announced October 2016.
-
High-dimensional regression adjustments in randomized experiments
Authors:
Stefan Wager,
Wenfei Du,
Jonathan Taylor,
Robert Tibshirani
Abstract:
We study the problem of treatment effect estimation in randomized experiments with high-dimensional covariate information, and show that essentially any risk-consistent regression adjustment can be used to obtain efficient estimates of the average treatment effect. Our results considerably extend the range of settings where high-dimensional regression adjustments are guaranteed to provide valid in…
▽ More
We study the problem of treatment effect estimation in randomized experiments with high-dimensional covariate information, and show that essentially any risk-consistent regression adjustment can be used to obtain efficient estimates of the average treatment effect. Our results considerably extend the range of settings where high-dimensional regression adjustments are guaranteed to provide valid inference about the population average treatment effect. We then propose cross-estimation, a simple method for obtaining finite-sample-unbiased treatment effect estimates that leverages high-dimensional regression adjustments. Our method can be used when the regression model is estimated using the lasso, the elastic net, subset selection, etc. Finally, we extend our analysis to allow for adaptive specification search via cross-validation, and flexible non-parametric regression adjustments with machine learning methods such as random forests or neural networks.
△ Less
Submitted 27 October, 2016; v1 submitted 22 July, 2016;
originally announced July 2016.
-
Approximate Residual Balancing: De-Biased Inference of Average Treatment Effects in High Dimensions
Authors:
Susan Athey,
Guido W. Imbens,
Stefan Wager
Abstract:
There are many settings where researchers are interested in estimating average treatment effects and are willing to rely on the unconfoundedness assumption, which requires that the treatment assignment be as good as random conditional on pre-treatment variables. The unconfoundedness assumption is often more plausible if a large number of pre-treatment variables are included in the analysis, but th…
▽ More
There are many settings where researchers are interested in estimating average treatment effects and are willing to rely on the unconfoundedness assumption, which requires that the treatment assignment be as good as random conditional on pre-treatment variables. The unconfoundedness assumption is often more plausible if a large number of pre-treatment variables are included in the analysis, but this can worsen the performance of standard approaches to treatment effect estimation. In this paper, we develop a method for de-biasing penalized regression adjustments to allow sparse regression methods like the lasso to be used for sqrt{n}-consistent inference of average treatment effects in high-dimensional linear models. Given linearity, we do not need to assume that the treatment propensities are estimable, or that the average treatment effect is a sparse contrast of the outcome model parameters. Rather, in addition standard assumptions used to make lasso regression on the outcome model consistent under 1-norm error, we only require overlap, i.e., that the propensity score be uniformly bounded away from 0 and 1. Procedurally, our method combines balancing weights with a regularized regression adjustment.
△ Less
Submitted 31 January, 2018; v1 submitted 25 April, 2016;
originally announced April 2016.
-
Data Augmentation via Levy Processes
Authors:
Stefan Wager,
William Fithian,
Percy Liang
Abstract:
If a document is about travel, we may expect that short snippets of the document should also be about travel. We introduce a general framework for incorporating these types of invariances into a discriminative classifier. The framework imagines data as being drawn from a slice of a Levy process. If we slice the Levy process at an earlier point in time, we obtain additional pseudo-examples, which c…
▽ More
If a document is about travel, we may expect that short snippets of the document should also be about travel. We introduce a general framework for incorporating these types of invariances into a discriminative classifier. The framework imagines data as being drawn from a slice of a Levy process. If we slice the Levy process at an earlier point in time, we obtain additional pseudo-examples, which can be used to train the classifier. We show that this scheme has two desirable properties: it preserves the Bayes decision boundary, and it is equivalent to fitting a generative model in the limit where we rewind time back to 0. Our construction captures popular schemes such as Gaussian feature noising and dropout training, as well as admitting new generalizations.
△ Less
Submitted 21 March, 2016;
originally announced March 2016.
-
denoiseR: A Package for Low Rank Matrix Estimation
Authors:
Julie Josse,
Sylvain Sardy,
Stefan Wager
Abstract:
We introduce denoiseR, an R package that provides a unified implementation of several state-of-the-art proposals for regularized low rank matrix estimation, along with automatic selection of the regularization parameters. We also extend these methods to allow for missing values. The regularization schemes discussed in this paper are built around singular-value shrinkage and bootstrap-based stabili…
▽ More
We introduce denoiseR, an R package that provides a unified implementation of several state-of-the-art proposals for regularized low rank matrix estimation, along with automatic selection of the regularization parameters. We also extend these methods to allow for missing values. The regularization schemes discussed in this paper are built around singular-value shrinkage and bootstrap-based stability arguments. We illustrate how to use out package by applying it to several real and simulated datasets, and highlight strengths and weaknesses of the different implemented methods.
△ Less
Submitted 8 August, 2018; v1 submitted 3 February, 2016;
originally announced February 2016.
-
Estimation and Inference of Heterogeneous Treatment Effects using Random Forests
Authors:
Stefan Wager,
Susan Athey
Abstract:
Many scientific and engineering challenges -- ranging from personalized medicine to customized marketing recommendations -- require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfounde…
▽ More
Many scientific and engineering challenges -- ranging from personalized medicine to customized marketing recommendations -- require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates.
△ Less
Submitted 9 July, 2017; v1 submitted 14 October, 2015;
originally announced October 2015.
-
Teaching Statistics at Google Scale
Authors:
Nicholas Chamandy,
Omkar Muralidharan,
Stefan Wager
Abstract:
Modern data and applications pose very different challenges from those of the 1950s or even the 1980s. Students contemplating a career in statistics or data science need to have the tools to tackle problems involving massive, heavy-tailed data, often interacting with live, complex systems. However, despite the deepening connections between engineering and modern data science, we argue that trainin…
▽ More
Modern data and applications pose very different challenges from those of the 1950s or even the 1980s. Students contemplating a career in statistics or data science need to have the tools to tackle problems involving massive, heavy-tailed data, often interacting with live, complex systems. However, despite the deepening connections between engineering and modern data science, we argue that training in classical statistical concepts plays a central role in preparing students to solve Google-scale problems. To this end, we present three industrial applications where significant modern data challenges were overcome by statistical thinking.
△ Less
Submitted 16 August, 2015; v1 submitted 6 August, 2015;
originally announced August 2015.
-
High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification
Authors:
Edgar Dobriban,
Stefan Wager
Abstract:
We provide a unified analysis of the predictive risk of ridge regression and regularized discriminant analysis in a dense random effects model. We work in a high-dimensional asymptotic regime where $p, n \to \infty$ and $p/n \to γ\in (0, \, \infty)$, and allow for arbitrary covariance among the features. For both methods, we provide an explicit and efficiently computable expression for the limitin…
▽ More
We provide a unified analysis of the predictive risk of ridge regression and regularized discriminant analysis in a dense random effects model. We work in a high-dimensional asymptotic regime where $p, n \to \infty$ and $p/n \to γ\in (0, \, \infty)$, and allow for arbitrary covariance among the features. For both methods, we provide an explicit and efficiently computable expression for the limiting predictive risk, which depends only on the spectrum of the feature-covariance matrix, the signal strength, and the aspect ratio $γ$. Especially in the case of regularized discriminant analysis, we find that predictive accuracy has a nuanced dependence on the eigenvalue distribution of the covariance matrix, suggesting that analyses based on the operator norm of the covariance matrix may not be sharp. Our results also uncover several qualitative insights about both methods: for example, with ridge regression, there is an exact inverse relation between the limiting predictive risk and the limiting estimation risk given a fixed signal strength. Our analysis builds on recent advances in random matrix theory.
△ Less
Submitted 4 November, 2015; v1 submitted 10 July, 2015;
originally announced July 2015.
-
The Efficiency of Density Deconvolution
Authors:
Stefan Wager
Abstract:
The density deconvolution problem involves recovering a target density g from a sample that has been corrupted by noise. From the perspective of Le Cam's local asymptotic normality theory, we show that non-parametric density deconvolution with Gaussian noise behaves similarly to a low-dimensional parametric problem that can easily be solved by maximum likelihood. This framework allows us to give a…
▽ More
The density deconvolution problem involves recovering a target density g from a sample that has been corrupted by noise. From the perspective of Le Cam's local asymptotic normality theory, we show that non-parametric density deconvolution with Gaussian noise behaves similarly to a low-dimensional parametric problem that can easily be solved by maximum likelihood. This framework allows us to give a simple account of the statistical efficiency of density deconvolution and to concisely describe the effect of Gaussian noise on our ability to estimate g, all while relying on classical maximum likelihood theory instead of the kernel estimators typically used to study density deconvolution.
△ Less
Submitted 3 July, 2015;
originally announced July 2015.
-
Adaptive Concentration of Regression Trees, with Application to Random Forests
Authors:
Stefan Wager,
Guenther Walther
Abstract:
We study the convergence of the predictive surface of regression trees and forests. To support our analysis we introduce a notion of adaptive concentration for regression trees. This approach breaks tree training into a model selection phase in which we pick the tree splits, followed by a model fitting phase where we find the best regression model consistent with these splits. We then show that th…
▽ More
We study the convergence of the predictive surface of regression trees and forests. To support our analysis we introduce a notion of adaptive concentration for regression trees. This approach breaks tree training into a model selection phase in which we pick the tree splits, followed by a model fitting phase where we find the best regression model consistent with these splits. We then show that the fitted regression tree concentrates around the optimal predictor with the same splits: as d and n get large, the discrepancy is with high probability bounded on the order of sqrt(log(d) log(n)/k) uniformly over the whole regression surface, where d is the dimension of the feature space, n is the number of training examples, and k is the minimum leaf size for each tree. We also provide rate-matching lower bounds for this adaptive concentration statement. From a practical perspective, our result enables us to prove consistency results for adaptively grown forests in high dimensions, and to carry out valid post-selection inference in the sense of Berk et al. [2013] for subgroups defined by tree leaves.
△ Less
Submitted 30 April, 2016; v1 submitted 22 March, 2015;
originally announced March 2015.
-
The Statistics of Streaming Sparse Regression
Authors:
Jacob Steinhardt,
Stefan Wager,
Percy Liang
Abstract:
We present a sparse analogue to stochastic gradient descent that is guaranteed to perform well under similar conditions to the lasso. In the linear regression setup with irrepresentable noise features, our algorithm recovers the support set of the optimal parameter vector with high probability, and achieves a statistically quasi-optimal rate of convergence of Op(k log(d)/T), where k is the sparsit…
▽ More
We present a sparse analogue to stochastic gradient descent that is guaranteed to perform well under similar conditions to the lasso. In the linear regression setup with irrepresentable noise features, our algorithm recovers the support set of the optimal parameter vector with high probability, and achieves a statistically quasi-optimal rate of convergence of Op(k log(d)/T), where k is the sparsity of the solution, d is the number of features, and T is the number of training examples. Meanwhile, our algorithm does not require any more computational resources than stochastic gradient descent. In our experiments, we find that our method substantially out-performs existing streaming algorithms on both real and simulated data.
△ Less
Submitted 12 December, 2014;
originally announced December 2014.
-
Bootstrap-Based Regularization for Low-Rank Matrix Estimation
Authors:
Julie Josse,
Stefan Wager
Abstract:
We develop a flexible framework for low-rank matrix estimation that allows us to transform noise models into regularization schemes via a simple bootstrap algorithm. Effectively, our procedure seeks an autoencoding basis for the observed matrix that is stable with respect to the specified noise model; we call the resulting procedure a stable autoencoder. In the simplest case, with an isotropic noi…
▽ More
We develop a flexible framework for low-rank matrix estimation that allows us to transform noise models into regularization schemes via a simple bootstrap algorithm. Effectively, our procedure seeks an autoencoding basis for the observed matrix that is stable with respect to the specified noise model; we call the resulting procedure a stable autoencoder. In the simplest case, with an isotropic noise model, our method is equivalent to a classical singular value shrinkage estimator. For non-isotropic noise models, e.g., Poisson noise, the method does not reduce to singular value shrinkage, and instead yields new estimators that perform well in experiments. Moreover, by iterating our stable autoencoding scheme, we can automatically generate low-rank estimates without specifying the target rank as a tuning parameter.
△ Less
Submitted 28 June, 2016; v1 submitted 30 October, 2014;
originally announced October 2014.
-
Confidence Areas for Fixed-Effects PCA
Authors:
Julie Josse,
Stefan Wager,
François Husson
Abstract:
PCA is often used to visualize data when the rows and the columns are both of interest. In such a setting there is a lack of inferential methods on the PCA output. We study the asymptotic variance of a fixed-effects model for PCA, and propose several approaches to assessing the variability of PCA estimates: a method based on a parametric bootstrap, a new cell-wise jackknife, as well as a computati…
▽ More
PCA is often used to visualize data when the rows and the columns are both of interest. In such a setting there is a lack of inferential methods on the PCA output. We study the asymptotic variance of a fixed-effects model for PCA, and propose several approaches to assessing the variability of PCA estimates: a method based on a parametric bootstrap, a new cell-wise jackknife, as well as a computationally cheaper approximation to the jackknife. We visualize the confidence regions by Procrustes rotation. Using a simulation study, we compare the proposed methods and highlight the strengths and drawbacks of each method as we vary the number of rows, the number of columns, and the strength of the relationships between variables.
△ Less
Submitted 28 July, 2014;
originally announced July 2014.
-
Altitude Training: Strong Bounds for Single-Layer Dropout
Authors:
Stefan Wager,
William Fithian,
Sida Wang,
Percy Liang
Abstract:
Dropout training, originally designed for deep neural networks, has been successful on high-dimensional single-layer natural language tasks. This paper proposes a theoretical explanation for this phenomenon: we show that, under a generative Poisson topic model with long documents, dropout training improves the exponent in the generalization bound for empirical risk minimization. Dropout achieves t…
▽ More
Dropout training, originally designed for deep neural networks, has been successful on high-dimensional single-layer natural language tasks. This paper proposes a theoretical explanation for this phenomenon: we show that, under a generative Poisson topic model with long documents, dropout training improves the exponent in the generalization bound for empirical risk minimization. Dropout achieves this gain much like a marathon runner who practices at altitude: once a classifier learns to perform reasonably well on training examples that have been artificially corrupted by dropout, it will do very well on the uncorrupted test set. We also show that, under similar conditions, dropout preserves the Bayes decision boundary and should therefore induce minimal bias in high dimensions.
△ Less
Submitted 31 October, 2014; v1 submitted 11 July, 2014;
originally announced July 2014.
-
Asymptotic Theory for Random Forests
Authors:
Stefan Wager
Abstract:
Random forests have proven to be reliable predictive algorithms in many application areas. Not much is known, however, about the statistical properties of random forests. Several authors have established conditions under which their predictions are consistent, but these results do not provide practical estimates of random forest errors. In this paper, we analyze a random forest model based on subs…
▽ More
Random forests have proven to be reliable predictive algorithms in many application areas. Not much is known, however, about the statistical properties of random forests. Several authors have established conditions under which their predictions are consistent, but these results do not provide practical estimates of random forest errors. In this paper, we analyze a random forest model based on subsampling, and show that random forest predictions are asymptotically normal provided that the subsample size s scales as s(n)/n = o(log(n)^{-d}), where n is the number of training examples and d is the number of features. Moreover, we show that the asymptotic variance can consistently be estimated using an infinitesimal jackknife for bagged ensembles recently proposed by Efron (2014). In other words, our results let us both characterize and estimate the error-distribution of random forest predictions, thus taking a step towards making random forests tools for statistical inference instead of just black-box predictive algorithms.
△ Less
Submitted 3 May, 2016; v1 submitted 2 May, 2014;
originally announced May 2014.
-
Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife
Authors:
Stefan Wager,
Trevor Hastie,
Bradley Efron
Abstract:
We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2012) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and working…
▽ More
We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2012) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and working with a large B can be computationally expensive. Direct applications of jackknife and IJ estimators to bagging require B on the order of n^{1.5} bootstrap replicates to converge, where n is the size of the training set. We propose improved versions that only require B on the order of n replicates. Moreover, we show that the IJ estimator requires 1.7 times less bootstrap replicates than the jackknife to achieve a given accuracy. Finally, we study the sampling distributions of the jackknife and IJ variance estimates themselves. We illustrate our findings with multiple experiments and simulation studies.
△ Less
Submitted 28 March, 2014; v1 submitted 18 November, 2013;
originally announced November 2013.
-
Feedback Detection for Live Predictors
Authors:
Stefan Wager,
Nick Chamandy,
Omkar Muralidharan,
Amir Najmi
Abstract:
A predictor that is deployed in a live production system may perturb the features it uses to make predictions. Such a feedback loop can occur, for example, when a model that predicts a certain type of behavior ends up causing the behavior it predicts, thus creating a self-fulfilling prophecy. In this paper we analyze predictor feedback detection as a causal inference problem, and introduce a local…
▽ More
A predictor that is deployed in a live production system may perturb the features it uses to make predictions. Such a feedback loop can occur, for example, when a model that predicts a certain type of behavior ends up causing the behavior it predicts, thus creating a self-fulfilling prophecy. In this paper we analyze predictor feedback detection as a causal inference problem, and introduce a local randomization scheme that can be used to detect non-linear feedback in real-world problems. We conduct a pilot study for our proposed methodology using a predictive system currently deployed as a part of a search engine.
△ Less
Submitted 31 October, 2014; v1 submitted 10 October, 2013;
originally announced October 2013.
-
Weakly supervised clustering: Learning fine-grained signals from coarse labels
Authors:
Stefan Wager,
Alexander Blocker,
Niall Cardin
Abstract:
Consider a classification problem where we do not have access to labels for individual training examples, but only have average labels over subpopulations. We give practical examples of this setup and show how such a classification task can usefully be analyzed as a weakly supervised clustering problem. We propose three approaches to solving the weakly supervised clustering problem, including a la…
▽ More
Consider a classification problem where we do not have access to labels for individual training examples, but only have average labels over subpopulations. We give practical examples of this setup and show how such a classification task can usefully be analyzed as a weakly supervised clustering problem. We propose three approaches to solving the weakly supervised clustering problem, including a latent variables model that performs well in our experiments. We illustrate our methods on an analysis of aggregated elections data and an industry data set that was the original motivation for this research.
△ Less
Submitted 15 September, 2015; v1 submitted 4 October, 2013;
originally announced October 2013.
-
Sequential Selection Procedures and False Discovery Rate Control
Authors:
Max Grazier G'Sell,
Stefan Wager,
Alexandra Chouldechova,
Robert Tibshirani
Abstract:
We consider a multiple hypothesis testing setting where the hypotheses are ordered and one is only permitted to reject an initial contiguous block, H_1,\dots,H_k, of hypotheses. A rejection rule in this setting amounts to a procedure for choosing the stop** point k. This setting is inspired by the sequential nature of many model selection problems, where choosing a stop** point or a model is e…
▽ More
We consider a multiple hypothesis testing setting where the hypotheses are ordered and one is only permitted to reject an initial contiguous block, H_1,\dots,H_k, of hypotheses. A rejection rule in this setting amounts to a procedure for choosing the stop** point k. This setting is inspired by the sequential nature of many model selection problems, where choosing a stop** point or a model is equivalent to rejecting all hypotheses up to that point and none thereafter. We propose two new testing procedures, and prove that they control the false discovery rate in the ordered testing setting. We also show how the methods can be applied to model selection using recent results on p-values in sequential model selection settings.
△ Less
Submitted 23 March, 2015; v1 submitted 20 September, 2013;
originally announced September 2013.
-
Semiparametric Exponential Families for Heavy-Tailed Data
Authors:
William Fithian,
Stefan Wager
Abstract:
We propose a semiparametric method for fitting the tail of a heavy-tailed population given a relatively small sample from that population and a larger sample from a related background population. We model the tail of the small sample as an exponential tilt of the better-observed large-sample tail, using a robust sufficient statistic motivated by extreme value theory. In particular, our method indu…
▽ More
We propose a semiparametric method for fitting the tail of a heavy-tailed population given a relatively small sample from that population and a larger sample from a related background population. We model the tail of the small sample as an exponential tilt of the better-observed large-sample tail, using a robust sufficient statistic motivated by extreme value theory. In particular, our method induces an estimator of the small-population mean, and we give theoretical and empirical evidence that this estimator outperforms methods that do not use the background sample. We demonstrate substantial efficiency gains over competing methods in simulation and on data from a large controlled experiment conducted by Facebook.
△ Less
Submitted 19 October, 2014; v1 submitted 30 July, 2013;
originally announced July 2013.
-
Dropout Training as Adaptive Regularization
Authors:
Stefan Wager,
Sida Wang,
Percy Liang
Abstract:
Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We…
▽ More
Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset.
△ Less
Submitted 1 November, 2013; v1 submitted 4 July, 2013;
originally announced July 2013.
-
Subsampling Extremes: From Block Maxima to Smooth Tail Estimation
Authors:
Stefan Wager
Abstract:
We study a new estimator for the tail index of a distribution in the Frechet domain of attraction that arises naturally by computing subsample maxima. This estimator is equivalent to taking a U-statistic over a Hill estimator with two order statistics. The estimator presents multiple advantages over the Hill estimator. In particular, it has asymptotically smooth sample paths as a function of the t…
▽ More
We study a new estimator for the tail index of a distribution in the Frechet domain of attraction that arises naturally by computing subsample maxima. This estimator is equivalent to taking a U-statistic over a Hill estimator with two order statistics. The estimator presents multiple advantages over the Hill estimator. In particular, it has asymptotically smooth sample paths as a function of the threshold k, making it considerably more stable than the Hill estimator. The estimator also admits a simple and intuitive threshold selection rule that does not require fitting a second-order model. Journal of Multivariate Analysis, 130, 2014
△ Less
Submitted 19 October, 2014; v1 submitted 2 April, 2012;
originally announced April 2012.