-
Decreasing the human coding burden in randomized trials with text-based outcomes via model-assisted impact analysis
Authors:
Reagan Mozer,
Luke Miratrix
Abstract:
For randomized trials that use text as an outcome, traditional approaches for assessing treatment impact require that each document first be manually coded for constructs of interest by trained human raters. This process, the current standard, is both time-consuming and limiting: even the largest human coding efforts are typically constrained to measure only a small set of dimensions across a subs…
▽ More
For randomized trials that use text as an outcome, traditional approaches for assessing treatment impact require that each document first be manually coded for constructs of interest by trained human raters. This process, the current standard, is both time-consuming and limiting: even the largest human coding efforts are typically constrained to measure only a small set of dimensions across a subsample of available texts. In this work, we present an inferential framework that can be used to increase the power of an impact assessment, given a fixed human-coding budget, by taking advantage of any ``untapped" observations -- those documents not manually scored due to time or resource constraints -- as a supplementary resource. Our approach, a methodological combination of causal inference, survey sampling methods, and machine learning, has four steps: (1) select and code a sample of documents; (2) build a machine learning model to predict the human-coded outcomes from a set of automatically extracted text features; (3) generate machine-predicted scores for all documents and use these scores to estimate treatment impacts; and (4) adjust the final impact estimates using the residual differences between human-coded and machine-predicted outcomes. As an extension to this approach, we also develop a strategy for identifying an optimal subset of documents to code in Step 1 in order to further enhance precision. Through an extensive simulation study based on data from a recent field trial in education, we show that our proposed approach can be used to reduce the scope of a human-coding effort while maintaining nominal power to detect a significant treatment impact.
△ Less
Submitted 24 September, 2023;
originally announced September 2023.
-
Empirical Bayes Double Shrinkage for Combining Biased and Unbiased Causal Estimates
Authors:
Evan T. R. Rosenman,
Francesca Dominici,
Luke Miratrix
Abstract:
Motivated by the proliferation of observational datasets and the need to integrate non-randomized evidence with randomized controlled trials, causal inference researchers have recently proposed several new methodologies for combining biased and unbiased estimators. We contribute to this growing literature by develo** a new class of estimators for the data-combination problem: double-shrinkage es…
▽ More
Motivated by the proliferation of observational datasets and the need to integrate non-randomized evidence with randomized controlled trials, causal inference researchers have recently proposed several new methodologies for combining biased and unbiased estimators. We contribute to this growing literature by develo** a new class of estimators for the data-combination problem: double-shrinkage estimators. Double-shrinkers first compute a data-driven convex combination of the the biased and unbiased estimators, and then apply a final, Stein-like shrinkage toward zero. Such estimators do not require hyperparameter tuning, and are targeted at multidimensional causal estimands, such as vectors of conditional average treatment effects (CATEs). We derive several workable versions of double-shrinkage estimators and propose a method for constructing valid Empirical Bayes confidence intervals. We also demonstrate the utility of our estimators using simulations on data from the Women's Health Initiative.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
Improving the Estimation of Site-Specific Effects and their Distribution in Multisite Trials
Authors:
JoonHo Lee,
Jonathan Che,
Sophia Rabe-Hesketh,
Avi Feller,
Luke Miratrix
Abstract:
In multisite trials, researchers are often interested in several inferential goals: estimating treatment effects for each site, ranking these effects, and studying their distribution. This study seeks to identify optimal methods for estimating these targets. Through a comprehensive simulation study, we assess two strategies and their combined effects: semiparametric modeling of the prior distribut…
▽ More
In multisite trials, researchers are often interested in several inferential goals: estimating treatment effects for each site, ranking these effects, and studying their distribution. This study seeks to identify optimal methods for estimating these targets. Through a comprehensive simulation study, we assess two strategies and their combined effects: semiparametric modeling of the prior distribution, and alternative posterior summary methods tailored to minimize specific loss functions. Our findings highlight that the success of different estimation strategies depends largely on the amount of within-site and between-site information available from the data. We discuss how our results can guide balancing the trade-offs associated with shrinkage in limited data environments.
△ Less
Submitted 1 April, 2024; v1 submitted 13 August, 2023;
originally announced August 2023.
-
Leveraging text data for causal inference using electronic health records
Authors:
Reagan Mozer,
Aaron R. Kaufman,
Leo A. Celi,
Luke Miratrix
Abstract:
In studies that rely on data from electronic health records (EHRs), unstructured text data such as clinical progress notes offer a rich source of information about patient characteristics and care that may be missing from structured data. Despite the prevalence of text in clinical research, these data are often ignored for the purposes of quantitative analysis due their complexity. This paper pres…
▽ More
In studies that rely on data from electronic health records (EHRs), unstructured text data such as clinical progress notes offer a rich source of information about patient characteristics and care that may be missing from structured data. Despite the prevalence of text in clinical research, these data are often ignored for the purposes of quantitative analysis due their complexity. This paper presents a unified framework for leveraging text data to support causal inference with electronic health data at multiple stages of analysis. In particular, we consider how natural language processing and statistical text analysis can be combined with standard inferential techniques to address common challenges due to missing data, confounding bias, and treatment effect heterogeneity. Through an application to a recent EHR study investigating the effects of a non-randomized medical intervention on patient outcomes, we show how incorporating text data in a traditional matching analysis can help strengthen the validity of an estimated treatment effect and identify patient subgroups that may benefit most from treatment. We believe these methods have the potential to expand the scope of secondary analysis of clinical data to domains where structured EHR data is limited, such as in develo** countries. To this end, we provide code and open-source replication materials to encourage adoption and broader exploration of these techniques in clinical research.
△ Less
Submitted 20 May, 2024; v1 submitted 9 June, 2023;
originally announced July 2023.
-
Improving instrumental variable estimators with post-stratification
Authors:
Nicole E. Pashley,
Luke Keele,
Luke W. Miratrix
Abstract:
Experiments studying get-out-the-vote (GOTV) efforts estimate the causal effect of various mobilization efforts on voter turnout. However, there is often substantial noncompliance in these studies. A usual approach is to use an instrumental variable (IV) analysis to estimate impacts for compliers, here being those actually contacted by the investigators. Unfortunately, popular IV estimators can be…
▽ More
Experiments studying get-out-the-vote (GOTV) efforts estimate the causal effect of various mobilization efforts on voter turnout. However, there is often substantial noncompliance in these studies. A usual approach is to use an instrumental variable (IV) analysis to estimate impacts for compliers, here being those actually contacted by the investigators. Unfortunately, popular IV estimators can be unstable in studies with a small fraction of compliers. We explore post-stratifying the data (e.g., taking a weighted average of IV estimates within each stratum) using variables that predict complier status (and, potentially, the outcome) to mitigate this. We present the benefits of post-stratification in terms of bias, variance, and improved standard error estimates, and provide a finite-sample asymptotic variance formula. We also compare the performance of different IV approaches and discuss the advantages of our design-based post-stratification approach over incorporating compliance-predictive covariates into the two-stage least squares estimator. In the end, we show that covariates predictive of compliance can increase precision, but only if one is willing to make a bias-variance trade-off by down-weighting or drop** strata with few compliers. By contrast, standard approaches such as two-stage least squares fail to use such information. We finally examine the benefits of our approach in two GOTV applications.
△ Less
Submitted 28 June, 2024; v1 submitted 17 March, 2023;
originally announced March 2023.
-
Benefits and costs of matching prior to a Difference in Differences analysis when parallel trends does not hold
Authors:
Dae Woong Ham,
Luke Miratrix
Abstract:
The Difference in Difference (DiD) estimator is a popular estimator built on the "parallel trends" assumption, which is an assertion that the treatment group, absent treatment, would change "similarly" to the control group over time. To bolster such a claim, one might generate a comparison group, via matching, that is similar to the treated group with respect to pre-treatment outcomes and/or pre-t…
▽ More
The Difference in Difference (DiD) estimator is a popular estimator built on the "parallel trends" assumption, which is an assertion that the treatment group, absent treatment, would change "similarly" to the control group over time. To bolster such a claim, one might generate a comparison group, via matching, that is similar to the treated group with respect to pre-treatment outcomes and/or pre-treatment covariates. Unfortunately, as has been previously pointed out, this intuitively appealing approach also has a cost in terms of bias. To assess the trade-offs of matching in our application, we first characterize the bias of matching prior to a DiD analysis under a linear structural model that allows for time-invariant observed and unobserved confounders with time-varying effects on the outcome. Given our framework, we verify that matching on baseline covariates generally reduces bias. We further show how additionally matching on pre-treatment outcomes has both cost and benefit. First, matching on pre-treatment outcomes partially balances unobserved confounders, which mitigates some bias. This reduction is proportional to the outcome's reliability, a measure of how coupled the outcomes are with the latent covariates. Offsetting these gains, matching also injects bias into the final estimate by undermining the second difference in the DiD via a regression-to-the-mean effect. Consequently, we provide heuristic guidelines for determining to what degree the bias reduction of matching is likely to outweigh the bias cost. We illustrate our guidelines by reanalyzing a principal turnover study that used matching prior to a DiD analysis and find that matching on both the pre-treatment outcomes and observed covariates makes the estimated treatment effect more credible.
△ Less
Submitted 7 February, 2024; v1 submitted 17 May, 2022;
originally announced May 2022.
-
Designing Experiments Toward Shrinkage Estimation
Authors:
Evan T. R. Rosenman,
Luke Miratrix
Abstract:
We consider how increasingly available observational data can be used to improve the design of randomized controlled trials (RCTs). We seek to design a prospective RCT, with the intent of using an Empirical Bayes estimator to shrink the causal estimates from our trial toward causal estimates obtained from an observational study. We ask: how might we design the experiment to better complement the o…
▽ More
We consider how increasingly available observational data can be used to improve the design of randomized controlled trials (RCTs). We seek to design a prospective RCT, with the intent of using an Empirical Bayes estimator to shrink the causal estimates from our trial toward causal estimates obtained from an observational study. We ask: how might we design the experiment to better complement the observational study in this setting?
We propose using an estimator that shrinks each component of the RCT causal estimator toward its observational counterpart by a factor proportional to its variance. First, we show that the risk of this estimator can be computed efficiently via numerical integration. We then propose algorithms for determining the best allocation of units to strata (the best "design"). We consider three options: Neyman allocation; a "naive" design assuming no unmeasured confounding in the observational study; and a "defensive" design accounting for the imperfect parameter estimates we would obtain from the observational study with unmeasured confounding.
We also incorporate results from sensitivity analysis to establish guardrails on the designs, so that our experiment could be reasonably analyzed with and without shrinkage. We demonstrate the superiority of these experimental designs with a simulation study involving causal inference on a rare, binary outcome.
△ Less
Submitted 13 April, 2022;
originally announced April 2022.
-
Power Under Multiplicity Project (PUMP): Estimating Power, Minimum Detectable Effect Size, and Sample Size When Adjusting for Multiple Outcomes in Multi-level Experiments
Authors:
Kristen Hunter,
Luke Miratrix,
Kristin Porter
Abstract:
For randomized controlled trials (RCTs) with a single intervention being measured on multiple outcomes, researchers often apply a multiple testing procedure (such as Bonferroni or Benjamini-Hochberg) to adjust $p$-values. Such an adjustment reduces the likelihood of spurious findings, but also changes the statistical power, sometimes substantially, which reduces the probability of detecting effect…
▽ More
For randomized controlled trials (RCTs) with a single intervention being measured on multiple outcomes, researchers often apply a multiple testing procedure (such as Bonferroni or Benjamini-Hochberg) to adjust $p$-values. Such an adjustment reduces the likelihood of spurious findings, but also changes the statistical power, sometimes substantially, which reduces the probability of detecting effects when they do exist. However, this consideration is frequently ignored in typical power analyses, as existing tools do not easily accommodate the use of multiple testing procedures. We introduce the PUMP R package as a tool for analysts to estimate statistical power, minimum detectable effect size, and sample size requirements for multi-level RCTs with multiple outcomes. Multiple outcomes are accounted for in two ways. First, power estimates from PUMP properly account for the adjustment in $p$-values from applying a multiple testing procedure. Second, as researchers change their focus from one outcome to multiple outcomes, different definitions of statistical power emerge. PUMP allows researchers to consider a variety of definitions of power, as some may be more appropriate for the goals of their study. The package estimates power for frequentist multi-level mixed effects models, and supports a variety of commonly-used RCT designs and models and multiple testing procedures. In addition to the main functionality of estimating power, minimum detectable effect size, and sample size requirements, the package allows the user to easily explore sensitivity of these quantities to changes in underlying assumptions.
△ Less
Submitted 15 May, 2023; v1 submitted 30 December, 2021;
originally announced December 2021.
-
Leveraging Population Outcomes to Improve the Generalization of Experimental Results
Authors:
Melody Huang,
Naoki Egami,
Erin Hartman,
Luke Miratrix
Abstract:
Generalizing causal estimates in randomized experiments to a broader target population is essential for guiding decisions by policymakers and practitioners in the social and biomedical sciences. While recent papers developed various weighting estimators for the population average treatment effect (PATE), many of these methods result in large variance because the experimental sample often differs s…
▽ More
Generalizing causal estimates in randomized experiments to a broader target population is essential for guiding decisions by policymakers and practitioners in the social and biomedical sciences. While recent papers developed various weighting estimators for the population average treatment effect (PATE), many of these methods result in large variance because the experimental sample often differs substantially from the target population, and estimated sampling weights are extreme. To improve efficiency in practice, we propose post-residualized weighting in which we use the outcome measured in the observational population data to build a flexible predictive model (e.g., machine learning methods) and residualize the outcome in the experimental data before using conventional weighting methods. We show that the proposed PATE estimator is consistent under the same assumptions required for existing weighting methods, importantly without assuming the correct specification of the predictive model. We demonstrate the efficiency gains from this approach through simulations and our application based on a set of job training experiments.
△ Less
Submitted 1 November, 2021;
originally announced November 2021.
-
Precise Unbiased Estimation in Randomized Experiments using Auxiliary Observational Data
Authors:
Johann A. Gagnon-Bartsch,
Adam C. Sales,
Edward Wu,
Anthony F. Botelho,
John A. Erickson,
Luke W. Miratrix,
Neil T. Heffernan
Abstract:
Randomized controlled trials (RCTs) are increasingly prevalent in education research, and are often regarded as a gold standard of causal inference. Two main virtues of randomized experiments are that they (1) do not suffer from confounding, thereby allowing for an unbiased estimate of an intervention's causal impact, and (2) allow for design-based inference, meaning that the physical act of rando…
▽ More
Randomized controlled trials (RCTs) are increasingly prevalent in education research, and are often regarded as a gold standard of causal inference. Two main virtues of randomized experiments are that they (1) do not suffer from confounding, thereby allowing for an unbiased estimate of an intervention's causal impact, and (2) allow for design-based inference, meaning that the physical act of randomization largely justifies the statistical assumptions made. However, RCT sample sizes are often small, leading to low precision; in many cases RCT estimates may be too imprecise to guide policy or inform science. Observational studies, by contrast, have strengths and weaknesses complementary to those of RCTs. Observational studies typically offer much larger sample sizes, but may suffer confounding. In many contexts, experimental and observational data exist side by side, allowing the possibility of integrating "big observational data" with "small but high-quality experimental data" to get the best of both. Such approaches hold particular promise in the field of education, where RCT sample sizes are often small due to cost constraints, but automatic collection of observational data, such as in computerized educational technology applications, or in state longitudinal data systems (SLDS) with administrative data on hundreds of thousand of students, has made rich, high-dimensional observational data widely available. We outline an approach that allows one to employ machine learning algorithms to learn from the observational data, and use the resulting models to improve precision in randomized experiments. Importantly, there is no requirement that the machine learning models are "correct" in any sense, and the final experimental results are guaranteed to be exactly unbiased. Thus, there is no danger of confounding biases in the observational data leaking into the experiment.
△ Less
Submitted 19 May, 2023; v1 submitted 7 May, 2021;
originally announced May 2021.
-
Is it who you are or where you are? Accounting for compositional differences in cross-site treatment variation
Authors:
Benjamin Lu,
Eli Ben-Michael,
Avi Feller,
Luke Miratrix
Abstract:
Multisite trials, in which treatment is randomized separately in multiple sites, offer a unique opportunity to disentangle treatment effect variation due to "compositional" differences in the distributions of unit-level features from variation due to "contextual" differences in site-level features. In particular, if we can re-weight (or "transport") each site to have a common distribution of unit-…
▽ More
Multisite trials, in which treatment is randomized separately in multiple sites, offer a unique opportunity to disentangle treatment effect variation due to "compositional" differences in the distributions of unit-level features from variation due to "contextual" differences in site-level features. In particular, if we can re-weight (or "transport") each site to have a common distribution of unit-level covariates, the remaining effect variation captures contextual differences across sites. In this paper, we develop a framework for transporting effects in multisite trials using approximate balancing weights, where the weights are chosen to directly optimize unit-level covariate balance between each site and the target distribution. We first develop our approach for the general setting of transporting the effect of a single-site trial. We then extend our method to multisite trials, assess its performance via simulation, and use it to analyze a series of multisite trials of welfare-to-work programs. Our method is available in the balancer R package.
△ Less
Submitted 26 March, 2021;
originally announced March 2021.
-
Randomization Inference beyond the Sharp Null: Bounded Null Hypotheses and Quantiles of Individual Treatment Effects
Authors:
Devin Caughey,
Allan Dafoe,
Xinran Li,
Luke Miratrix
Abstract:
Randomization inference (RI) is typically interpreted as testing Fisher's "sharp" null hypothesis that all unit-level effects are exactly zero. This hypothesis is often criticized as restrictive and implausible, making its rejection scientifically uninteresting. We show, however, that many randomization tests are also valid for a "bounded" null hypothesis under which the unit-level effects are all…
▽ More
Randomization inference (RI) is typically interpreted as testing Fisher's "sharp" null hypothesis that all unit-level effects are exactly zero. This hypothesis is often criticized as restrictive and implausible, making its rejection scientifically uninteresting. We show, however, that many randomization tests are also valid for a "bounded" null hypothesis under which the unit-level effects are all non-positive (or all non-negative) but are otherwise heterogeneous. In addition to being more plausible a priori, bounded nulls are closely related to substantively important concepts such as monotonicity and Pareto efficiency. Reinterpreting RI in this way also dramatically expands the range of inferences possible in this framework. We show that exact confidence intervals for the maximum (or minimum) unit-level effect can be obtained by inverting tests for a sequence of bounded nulls. We also generalize RI to cover inference for quantiles of the individual effect distribution as well as for the proportion of individual effects larger (or smaller) than a given threshold. The proposed confidence intervals for all effect quantiles are simultaneously valid, in the sense that no correction for multiple analyses is required, and are thus a "free lunch" added to conventional RI. In sum, our reinterpretation and generalization provide a broader justification for randomization tests and a basis for exact nonparametric inference for effect quantiles. We illustrate our methods with simulations and applications, finding that Stephenson rank statistics can provide more informative results than the more common Wilcoxon rank or difference-in-means statistics. We also provide an R package RIQITE implementing the proposed approach.
△ Less
Submitted 28 August, 2023; v1 submitted 22 January, 2021;
originally announced January 2021.
-
Block what you can, except when you shouldn't
Authors:
Nicole E. Pashley,
Luke W. Miratrix
Abstract:
Several branches of the potential outcome causal inference literature have discussed the merits of blocking versus complete randomization. Some have concluded it can never hurt the precision of estimates, and some have concluded it can hurt. In this paper, we reconcile these apparently conflicting views, give a more thorough discussion of what guarantees no harm, and discuss how other aspects of a…
▽ More
Several branches of the potential outcome causal inference literature have discussed the merits of blocking versus complete randomization. Some have concluded it can never hurt the precision of estimates, and some have concluded it can hurt. In this paper, we reconcile these apparently conflicting views, give a more thorough discussion of what guarantees no harm, and discuss how other aspects of a blocked design can cost, all in terms of precision. We discuss how the different findings are due to different sampling models and assumptions of how the blocks were formed. We also connect these ideas to common misconceptions, for instance showing that analyzing a blocked experiment as if it were completely randomized, a seemingly conservative method, can actually backfire in some cases. Overall, we find that blocking can have a price, but that this price is usually small and the potential for gain can be large. It is hard to go too far wrong with blocking.
△ Less
Submitted 27 May, 2021; v1 submitted 26 October, 2020;
originally announced October 2020.
-
Conditional As-If Analyses in Randomized Experiments
Authors:
Nicole E. Pashley,
Guillaume W. Basse,
Luke W. Miratrix
Abstract:
The injunction to `analyze the way you randomize' is well-known to statisticians since Fisher advocated for randomization as the basis of inference. Yet even those convinced by the merits of randomization-based inference seldom follow this injunction to the letter. Bernoulli randomized experiments are often analyzed as completely randomized experiments, and completely randomized experiments are an…
▽ More
The injunction to `analyze the way you randomize' is well-known to statisticians since Fisher advocated for randomization as the basis of inference. Yet even those convinced by the merits of randomization-based inference seldom follow this injunction to the letter. Bernoulli randomized experiments are often analyzed as completely randomized experiments, and completely randomized experiments are analyzed as if they had been stratified; more generally, it is not uncommon to analyze an experiment as if it had been randomized differently. This paper examines the theoretical foundation behind this practice within a randomization-based framework. Specifically, we ask when is it legitimate to analyze an experiment randomized according to one design as if it had been randomized according to some other design. We show that a sufficient condition for this type of analysis to be valid is that the design used for analysis be derived from the original design by an appropriate form of conditioning. We use our theory to justify certain existing methods, question others, and finally suggest new methodological insights such as conditioning on approximate covariate balance.
△ Less
Submitted 19 August, 2021; v1 submitted 3 August, 2020;
originally announced August 2020.
-
Hospital Quality Risk Standardization via Approximate Balancing Weights
Authors:
Luke Keele,
Eli Ben-Michael,
Avi Feller,
Rachel Kelz,
Luke Miratrix
Abstract:
Comparing outcomes across hospitals, often to identify underperforming hospitals, is a critical task in health services research. However, naive comparisons of average outcomes, such as surgery complication rates, can be misleading because hospital case mixes differ -- a hospital's overall complication rate may be lower due to more effective treatments or simply because the hospital serves a healt…
▽ More
Comparing outcomes across hospitals, often to identify underperforming hospitals, is a critical task in health services research. However, naive comparisons of average outcomes, such as surgery complication rates, can be misleading because hospital case mixes differ -- a hospital's overall complication rate may be lower due to more effective treatments or simply because the hospital serves a healthier population overall. In this paper, we develop a method of ``direct standardization'' where we re-weight each hospital patient population to be representative of the overall population and then compare the weighted averages across hospitals. Adapting methods from survey sampling and causal inference, we find weights that directly control for imbalance between the hospital patient mix and the target population, even across many patient attributes. Critically, these balancing weights can also be tuned to preserve sample size for more precise estimates. We also derive principled measures of statistical precision, and use outcome modeling and Bayesian shrinkage to increase precision and account for variation in hospital size. We demonstrate these methods using claims data from Pennsylvania, Florida, and New York, estimating standardized hospital complication rates for general surgery patients. We conclude with a discussion of how to detect low performing hospitals.
△ Less
Submitted 15 February, 2021; v1 submitted 17 July, 2020;
originally announced July 2020.
-
Using Simulation to Analyze Interrupted Time Series Designs
Authors:
Luke Miratrix
Abstract:
We are sometimes forced to use the Interrupted Time Series (ITS) design as an identification strategy for potential policy change, such as when we only have a single treated unit and no comparable controls. For example, with recent county- and state-wide criminal justice reform efforts, where judicial bodies have changed bail setting practices for everyone in their jurisdiction in order to reduce…
▽ More
We are sometimes forced to use the Interrupted Time Series (ITS) design as an identification strategy for potential policy change, such as when we only have a single treated unit and no comparable controls. For example, with recent county- and state-wide criminal justice reform efforts, where judicial bodies have changed bail setting practices for everyone in their jurisdiction in order to reduce rates of pre-trial detention while maintaining court order and public safety, we have no natural comparison group other than the past. In these contexts, it is imperative to model pre-policy trends with a light touch, allowing for structures such as autoregressive departures from any pre-existing trend, in order to accurately and realistically assess the statistical uncertainty of our projections (beyond the stringent assumptions necessary for the subsequent causal inferences). To tackle this problem we provide a methodological approach rooted in commonly understood and used modeling approaches that better captures uncertainty. We quantify uncertainty with simulation, generating a distribution of plausible counterfactual trajectories to compare to the observed; this approach naturally allows for incorporating seasonality and other time varying covariates, and provides confidence intervals along with point estimates for the potential impacts of policy change. We find simulation provides a natural framework to capture and show uncertainty in the ITS designs. It also allows for easy extensions such as nonparametric smoothing in order to handle multiple post-policy time points or more structural models to account for seasonality.
△ Less
Submitted 13 February, 2020;
originally announced February 2020.
-
Design-Based Ratio Estimators and Central Limit Theorems for Clustered, Blocked RCTs
Authors:
Peter Z. Schochet,
Nicole E. Pashley,
Luke W. Miratrix,
Tim Kautz
Abstract:
This article develops design-based ratio estimators for clustered, blocked randomized controlled trials (RCTs), with an application to a federally funded, school-based RCT testing the effects of behavioral health interventions. We consider finite population weighted least squares estimators for average treatment effects (ATEs), allowing for general weighting schemes and covariates. We consider mod…
▽ More
This article develops design-based ratio estimators for clustered, blocked randomized controlled trials (RCTs), with an application to a federally funded, school-based RCT testing the effects of behavioral health interventions. We consider finite population weighted least squares estimators for average treatment effects (ATEs), allowing for general weighting schemes and covariates. We consider models with block-by-treatment status interactions as well as restricted models with block indicators only. We prove new finite population central limit theorems for each block specification. We also discuss simple variance estimators that share features with commonly used cluster-robust standard error estimators. Simulations show that the design-based ATE estimator yields nominal rejection rates with standard errors near true ones, even with few clusters.
△ Less
Submitted 25 February, 2021; v1 submitted 4 February, 2020;
originally announced February 2020.
-
Lurking Inferential Monsters? Quantifying bias in non-experimental evaluations of school programs
Authors:
Ben Weidmann,
Luke Miratrix
Abstract:
This study examines whether unobserved factors substantially bias education evaluations that rely on the Conditional Independence Assumption. We add 14 new within-study comparisons to the literature, all from primary schools in England. Across these 14 studies, we generate 42 estimates of selection bias using a simple matching approach. A meta-analysis of the estimates suggests that the distributi…
▽ More
This study examines whether unobserved factors substantially bias education evaluations that rely on the Conditional Independence Assumption. We add 14 new within-study comparisons to the literature, all from primary schools in England. Across these 14 studies, we generate 42 estimates of selection bias using a simple matching approach. A meta-analysis of the estimates suggests that the distribution of underlying bias is centered around zero. The mean absolute value of estimated bias is 0.03σ, and none of the 42 estimates are larger than 0.11σ. Results are similar for math, reading and writing outcomes. Overall, we find no evidence of substantial selection bias due to unobserved characteristics. These findings may not generalise easily to other settings or to more radical educational interventions, but they do suggest that non-experimental approaches could play a greater role than they currently do in generating reliable causal evidence for school education.
△ Less
Submitted 15 October, 2019;
originally announced October 2019.
-
A Bayesian Nonparametric Approach to Geographic Regression Discontinuity Designs: Do School Districts Affect NYC House Prices?
Authors:
Maxime Rischard,
Zach Branson,
Luke Miratrix,
Luke Bornn
Abstract:
Most research on regression discontinuity designs (RDDs) has focused on univariate cases, where only those units with a "forcing" variable on one side of a threshold value receive a treatment. Geographical regression discontinuity designs (GeoRDDs) extend the RDD to multivariate settings with spatial forcing variables. We propose a framework for analysing GeoRDDs, which we implement using Gaussian…
▽ More
Most research on regression discontinuity designs (RDDs) has focused on univariate cases, where only those units with a "forcing" variable on one side of a threshold value receive a treatment. Geographical regression discontinuity designs (GeoRDDs) extend the RDD to multivariate settings with spatial forcing variables. We propose a framework for analysing GeoRDDs, which we implement using Gaussian process regression. This yields a Bayesian posterior distribution of the treatment effect at every point along the border. We address nuances of having a functional estimand defind on a border with potentially intricate topology, particularly when defining and estimating causal estimands of the local average treatment effect (LATE). The Bayesian estimate of the LATE can also be used as a test statistic in a hypothesis test with good frequentist properties, which we validate using simulations and placebo tests. We demonstrate our methodology with a dataset of property sales in New York City, to assess whether there is a discontinuity in housing prices at the border between two school district. We find a statistically significant difference in price across the border between the districts with $p$=0.002, and estimate a 20% higher price on average for a house on the more desirable side.
△ Less
Submitted 11 July, 2018;
originally announced July 2018.
-
Identifying and Estimating Principal Causal Effects in Multi-site Trials
Authors:
Lo-Hua Yuan,
Avi Feller,
Luke W. Miratrix
Abstract:
Randomized trials are often conducted with separate randomizations across multiple sites such as schools, voting districts, or hospitals. These sites can differ in important ways, including the site's implementation, local conditions, and the composition of individuals. An important question in practice is whether---and under what assumptions---researchers can leverage this cross-site variation to…
▽ More
Randomized trials are often conducted with separate randomizations across multiple sites such as schools, voting districts, or hospitals. These sites can differ in important ways, including the site's implementation, local conditions, and the composition of individuals. An important question in practice is whether---and under what assumptions---researchers can leverage this cross-site variation to learn more about the intervention. We address these questions in the principal stratification framework, which describes causal effects for subgroups defined by post-treatment quantities. We show that researchers can estimate certain principal causal effects via the multi-site design if they are willing to impose the strong assumption that the site-specific effects are uncorrelated with the site-specific distribution of stratum membership. We motivate this approach with a multi-site trial of the Early College High School Initiative, a unique secondary education program with the goal of increasing high school graduation rates and college enrollment. Our analyses corroborate previous studies suggesting that the initiative had positive effects for students who would have otherwise attended a low-quality high school, although power is limited.
△ Less
Submitted 15 March, 2018;
originally announced March 2018.
-
Randomization Tests that Condition on Non-Categorical Covariate Balance
Authors:
Zach Branson,
Luke Miratrix
Abstract:
A benefit of randomized experiments is that covariate distributions of treatment and control groups are balanced on average, resulting in simple unbiased estimators for treatment effects. However, it is possible that a particular randomization yields covariate imbalances that researchers want to address in the analysis stage through adjustment or other methods. Here we present a randomization test…
▽ More
A benefit of randomized experiments is that covariate distributions of treatment and control groups are balanced on average, resulting in simple unbiased estimators for treatment effects. However, it is possible that a particular randomization yields covariate imbalances that researchers want to address in the analysis stage through adjustment or other methods. Here we present a randomization test that conditions on covariate balance by only considering treatment assignments that are similar to the observed one in terms of covariate balance. Previous conditional randomization tests have only allowed for categorical covariates, while our randomization test allows for any type of covariate. Through extensive simulation studies, we find that our conditional randomization test is more powerful than unconditional randomization tests and other conditional tests. Furthermore, we find that our conditional randomization test is valid (1) unconditionally across levels of covariate balance, and (2) conditional on particular levels of covariate balance. Meanwhile, unconditional randomization tests are valid for (1) but not (2). Finally, we find that our conditional randomization test is similar to a randomization test that uses a model-adjusted test statistic.
△ Less
Submitted 4 October, 2018; v1 submitted 3 February, 2018;
originally announced February 2018.
-
Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality
Authors:
Reagan Mozer,
Luke Miratrix,
Aaron Russell Kaufman,
L. Jason Anastasopoulos
Abstract:
Matching for causal inference is a well-studied problem, but standard methods fail when the units to match are text documents: the high-dimensional and rich nature of the data renders exact matching infeasible, causes propensity scores to produce incomparable matches, and makes assessing match quality difficult. In this paper, we characterize a framework for matching text documents that decomposes…
▽ More
Matching for causal inference is a well-studied problem, but standard methods fail when the units to match are text documents: the high-dimensional and rich nature of the data renders exact matching infeasible, causes propensity scores to produce incomparable matches, and makes assessing match quality difficult. In this paper, we characterize a framework for matching text documents that decomposes existing methods into: (1) the choice of text representation, and (2) the choice of distance metric. We investigate how different choices within this framework affect both the quantity and quality of matches identified through a systematic multifactor evaluation experiment using human subjects. Altogether we evaluate over 100 unique text matching methods along with 5 comparison methods taken from the literature. Our experimental results identify methods that generate matches with higher subjective match quality than current state-of-the-art techniques. We enhance the precision of these results by develo** a predictive model to estimate the match quality of pairs of text documents as a function of our various distance scores. This model, which we find successfully mimics human judgment, also allows for approximate and unsupervised evaluation of new procedures. We then employ the identified best method to illustrate the utility of text matching in two applications. First, we engage with a substantive debate in the study of media bias by using text matching to control for topic selection when comparing news articles from thirteen news sources. We then show how conditioning on text data leads to more precise causal inferences in an observational study examining the effects of a medical intervention.
△ Less
Submitted 13 March, 2019; v1 submitted 2 January, 2018;
originally announced January 2018.
-
Insights on Variance Estimation for Blocked and Matched Pairs Designs
Authors:
Nicole E. Pashley,
Luke W. Miratrix
Abstract:
Evaluating blocked randomized experiments from a potential outcomes perspective has two primary branches of work. The first focuses on larger blocks, with multiple treatment and control units in each block. The second focuses on matched pairs, with a single treatment and control unit in each block. These literatures not only provide different estimators for the standard errors of the estimated ave…
▽ More
Evaluating blocked randomized experiments from a potential outcomes perspective has two primary branches of work. The first focuses on larger blocks, with multiple treatment and control units in each block. The second focuses on matched pairs, with a single treatment and control unit in each block. These literatures not only provide different estimators for the standard errors of the estimated average impact, but they are also built on different sets of assumptions. Neither literature handles cases with blocks of varying size that contain singleton treatment or control units, a case which can occur in a variety of contexts, such as with different forms of matching or post-stratification. In this paper, we reconcile the literatures by carefully examining the performance of variance estimators under several different frameworks. We then use these insights to derive novel variance estimators for experiments containing blocks of different sizes.
△ Less
Submitted 29 June, 2020; v1 submitted 27 October, 2017;
originally announced October 2017.
-
Beyond the Sharp Null: Randomization Inference, Bounded Null Hypotheses, and Confidence Intervals for Maximum Effects
Authors:
Devin Caughey,
Allan Dafoe,
Luke Miratrix
Abstract:
Fisherian randomization inference is often dismissed as testing an uninteresting and implausible hypothesis: the sharp null of no effects whatsoever. We show that this view is overly narrow. Many randomization tests are also valid under a more general "bounded" null hypothesis under which all effects are weakly negative (or positive), thus accommodating heterogenous effects. By inverting such test…
▽ More
Fisherian randomization inference is often dismissed as testing an uninteresting and implausible hypothesis: the sharp null of no effects whatsoever. We show that this view is overly narrow. Many randomization tests are also valid under a more general "bounded" null hypothesis under which all effects are weakly negative (or positive), thus accommodating heterogenous effects. By inverting such tests we can form one-sided confidence intervals for the maximum (or minimum) effect. These properties hold for all effect-increasing test statistics, which include both common statistics such as the mean difference and uncommon ones such as Stephenson rank statistics. The latter's sensitivity to extreme effects permits detection of positive effects even when the average effect is negative. We argue that bounded nulls are often of substantive or theoretical interest, and illustrate with two applications: testing monotonicity in an IV analysis and inferring effect sizes in a small randomized experiment.
△ Less
Submitted 21 September, 2017;
originally announced September 2017.
-
Shape-constrained partial identification of a population mean under unknown probabilities of sample selection
Authors:
Luke W. Miratrix,
Stefan Wager,
Jose R. Zubizarreta
Abstract:
A prevailing challenge in the biomedical and social sciences is to estimate a population mean from a sample obtained with unknown selection probabilities. Using a well-known ratio estimator, Aronow and Lee (2013) proposed a method for partial identification of the mean by allowing the unknown selection probabilities to vary arbitrarily between two fixed extreme values. In this paper, we show how t…
▽ More
A prevailing challenge in the biomedical and social sciences is to estimate a population mean from a sample obtained with unknown selection probabilities. Using a well-known ratio estimator, Aronow and Lee (2013) proposed a method for partial identification of the mean by allowing the unknown selection probabilities to vary arbitrarily between two fixed extreme values. In this paper, we show how to leverage auxiliary shape constraints on the population outcome distribution, such as symmetry or log-concavity, to obtain tighter bounds on the population mean. We use this method to estimate the performance of Aymara students---an ethnic minority in the north of Chile---in a national educational standardized test. We implement this method in the new statistical software package scbounds for R.
△ Less
Submitted 22 June, 2017;
originally announced June 2017.
-
Model-free causal inference of binary experimental data
Authors:
Peng Ding,
Luke W. Miratrix
Abstract:
For binary experimental data, we discuss randomization-based inferential procedures that do not need to invoke any modeling assumptions. We also introduce methods for likelihood and Bayesian inference based solely on the physical randomization without any hypothetical super population assumptions about the potential outcomes. These estimators have some properties superior to moment-based ones such…
▽ More
For binary experimental data, we discuss randomization-based inferential procedures that do not need to invoke any modeling assumptions. We also introduce methods for likelihood and Bayesian inference based solely on the physical randomization without any hypothetical super population assumptions about the potential outcomes. These estimators have some properties superior to moment-based ones such as only giving estimates in regions of feasible support. Due to the lack of identification of the causal model, we also propose a sensitivity analysis approach which allows for the characterization of the impact of the association between the potential outcomes on statistical inference.
△ Less
Submitted 23 May, 2017;
originally announced May 2017.
-
A Nonparametric Bayesian Methodology for Regression Discontinuity Designs
Authors:
Zach Branson,
Maxime Rischard,
Luke Bornn,
Luke Miratrix
Abstract:
One of the most popular methodologies for estimating the average treatment effect at the threshold in a regression discontinuity design is local linear regression (LLR), which places larger weight on units closer to the threshold. We propose a Gaussian process regression methodology that acts as a Bayesian analog to LLR for regression discontinuity designs. Our methodology provides a flexible fit…
▽ More
One of the most popular methodologies for estimating the average treatment effect at the threshold in a regression discontinuity design is local linear regression (LLR), which places larger weight on units closer to the threshold. We propose a Gaussian process regression methodology that acts as a Bayesian analog to LLR for regression discontinuity designs. Our methodology provides a flexible fit for treatment and control responses by placing a general prior on the mean response functions. Furthermore, unlike LLR, our methodology can incorporate uncertainty in how units are weighted when estimating the treatment effect. We prove our method is consistent in estimating the average treatment effect at the threshold. Furthermore, we find via simulation that our method exhibits promising coverage, interval length, and mean squared error properties compared to standard LLR and state-of-the-art LLR methodologies. Finally, we explore the performance of our method on a real-world example by studying the impact of being a first-round draft pick on the performance and playing time of basketball players in the National Basketball Association.
△ Less
Submitted 30 September, 2018; v1 submitted 16 April, 2017;
originally announced April 2017.
-
Worth Weighting? How to Think About and Use Weights in Survey Experiments
Authors:
Luke W. Miratrix,
Jasjeet S. Sekhon,
Alexander G. Theodoridis,
Luis F. Campos
Abstract:
The popularity of online surveys has increased the prominence of using weights that capture units' probabilities of inclusion for claims of representativeness. Yet, much uncertainty remains regarding how these weights should be employed in the analysis of survey experiments: Should they be used or ignored? If they are used, which estimators are preferred? We offer practical advice, rooted in the N…
▽ More
The popularity of online surveys has increased the prominence of using weights that capture units' probabilities of inclusion for claims of representativeness. Yet, much uncertainty remains regarding how these weights should be employed in the analysis of survey experiments: Should they be used or ignored? If they are used, which estimators are preferred? We offer practical advice, rooted in the Neyman-Rubin model, for researchers producing and working with survey experimental data. We examine simple, efficient estimators for analyzing these data, and give formulae for their biases and variances. We provide simulations that examine these estimators as well as real examples from experiments administered online through YouGov. We find that for examining the existence of population treatment effects using high-quality, broadly representative samples recruited by top online survey firms, sample quantities, which do not rely on weights, are often sufficient. We found that Sample Average Treatment Effect (SATE) estimates did not appear to differ substantially from their weighted counterparts, and they avoided the substantial loss of statistical power that accompanies weighting. When precise estimates of Population Average Treatment Effects (PATE) are essential, we analytically show post-stratifying on survey weights and/or covariates highly correlated with the outcome to be a conservative choice. While we show these substantial gains in simulations, we find limited evidence of them in practice.
△ Less
Submitted 15 August, 2017; v1 submitted 20 March, 2017;
originally announced March 2017.
-
Bridging Finite and Super Population Causal Inference
Authors:
Peng Ding,
Xinran Li,
Luke W. Miratrix
Abstract:
There are two general views in causal analysis of experimental data: the super population view that the units are an independent sample from some hypothetical infinite populations, and the finite population view that the potential outcomes of the experimental units are fixed and the randomness comes solely from the physical randomization of the treatment assignment. These two views differs concept…
▽ More
There are two general views in causal analysis of experimental data: the super population view that the units are an independent sample from some hypothetical infinite populations, and the finite population view that the potential outcomes of the experimental units are fixed and the randomness comes solely from the physical randomization of the treatment assignment. These two views differs conceptually and mathematically, resulting in different sampling variances of the usual difference-in-means estimator of the average causal effect. Practically, however, these two views result in identical variance estimators. By recalling a variance decomposition and exploiting a completeness-type argument, we establish a connection between these two views in completely randomized experiments. This alternative formulation could serve as a template for bridging finite and super population causal inference in other scenarios.
△ Less
Submitted 27 February, 2017;
originally announced February 2017.
-
Prior matters: simple and general methods for evaluating and improving topic quality in topic modeling
Authors:
Angela Fan,
Finale Doshi-Velez,
Luke Miratrix
Abstract:
Latent Dirichlet Allocation (LDA) models trained without stopword removal often produce topics with high posterior probabilities on uninformative words, obscuring the underlying corpus content. Even when canonical stopwords are manually removed, uninformative words common in that corpus will still dominate the most probable words in a topic. In this work, we first show how the standard topic quali…
▽ More
Latent Dirichlet Allocation (LDA) models trained without stopword removal often produce topics with high posterior probabilities on uninformative words, obscuring the underlying corpus content. Even when canonical stopwords are manually removed, uninformative words common in that corpus will still dominate the most probable words in a topic. In this work, we first show how the standard topic quality measures of coherence and pointwise mutual information act counter-intuitively in the presence of common but irrelevant words, making it difficult to even quantitatively identify situations in which topics may be dominated by stopwords. We propose an additional topic quality metric that targets the stopword problem, and show that it, unlike the standard measures, correctly correlates with human judgements of quality. We also propose a simple-to-implement strategy for generating topics that are evaluated to be of much higher quality by both human assessment and our new metric. This approach, a collection of informative priors easily introduced into most LDA-style inference methods, automatically promotes terms with domain relevance and demotes domain-specific stop words. We demonstrate this approach's effectiveness in three very different domains: Department of Labor accident reports, online health forum posts, and NIPS abstracts. Overall we find that current practices thought to solve this problem do not do so adequately, and that our proposal offers a substantial improvement for those interested in interpreting their topics as objects in their own right.
△ Less
Submitted 14 October, 2017; v1 submitted 11 January, 2017;
originally announced January 2017.
-
Bounding, an accessible method for estimating principal causal effects, examined and explained
Authors:
Luke Miratrix,
Jane Furey,
Avi Feller,
Todd Grindal,
Lindsay C. Page
Abstract:
Estimating treatment effects for subgroups defined by post-treatment behavior (i.e., estimating causal effects in a principal stratification framework) can be technically challenging and heavily reliant on strong assumptions. We investigate an alternative path: using bounds to identify ranges of possible effects that are consistent with the data. This simple approach relies on fewer assumptions an…
▽ More
Estimating treatment effects for subgroups defined by post-treatment behavior (i.e., estimating causal effects in a principal stratification framework) can be technically challenging and heavily reliant on strong assumptions. We investigate an alternative path: using bounds to identify ranges of possible effects that are consistent with the data. This simple approach relies on fewer assumptions and yet can result in policy-relevant findings. As we show, covariates can be used to substantially tighten bounds in a straightforward manner. Via simulation, we demonstrate which types of covariates are maximally beneficial. We conclude with an analysis of a multi-site experimental study of Early College High Schools. When examining the program's impact on students completing the ninth grade "on-track" for college, we find little impact for ECHS students who would otherwise attend a high quality high school, but substantial effects for those who would not. This suggests potential benefit in expanding these programs in areas primarily served by lower quality schools.
△ Less
Submitted 16 August, 2017; v1 submitted 11 January, 2017;
originally announced January 2017.
-
Principal Score Methods: Assumptions and Extensions
Authors:
Avi Feller,
Fabrizia Mealli,
Luke Miratrix
Abstract:
Researchers addressing post-treatment complications in randomized trials often turn to principal stratification to define relevant assumptions and quantities of interest. One approach for estimating causal effects in this framework is to use methods based on the "principal score," typically assuming that stratum membership is as-good-as-randomly assigned given a set of covariates. In this paper, w…
▽ More
Researchers addressing post-treatment complications in randomized trials often turn to principal stratification to define relevant assumptions and quantities of interest. One approach for estimating causal effects in this framework is to use methods based on the "principal score," typically assuming that stratum membership is as-good-as-randomly assigned given a set of covariates. In this paper, we clarify the key assumption in this context, known as Principal Ignorability, and argue that versions of this assumption are quite strong in practice. We describe different estimation approaches and demonstrate that weighting-based methods are generally preferable to subgroup-based approaches that discretize the principal score. We then extend these ideas to the case of two-sided noncompliance and propose a natural framework for combining Principal Ignorability with exclusion restrictions and other assumptions. Finally, we apply these ideas to the Head Start Impact Study, a large-scale randomized evaluation of the Head Start program. Overall, we argue that, while principal score methods are useful tools, applied researchers should fully understand the relevant assumptions when using them in practice.
△ Less
Submitted 8 June, 2016;
originally announced June 2016.
-
More Powerful Multiple Testing in Randomized Experiments with Non-Compliance
Authors:
Joseph J. Lee,
Laura Forastiere,
Luke Miratrix,
Natesh S. Pillai
Abstract:
Two common concerns raised in analyses of randomized experiments are (i) appropriately handling issues of non-compliance, and (ii) appropriately adjusting for multiple tests (e.g., on multiple outcomes or subgroups). Although simple intention-to-treat (ITT) and Bonferroni methods are valid in terms of type I error, they can each lead to a substantial loss of power; when employing both simultaneous…
▽ More
Two common concerns raised in analyses of randomized experiments are (i) appropriately handling issues of non-compliance, and (ii) appropriately adjusting for multiple tests (e.g., on multiple outcomes or subgroups). Although simple intention-to-treat (ITT) and Bonferroni methods are valid in terms of type I error, they can each lead to a substantial loss of power; when employing both simultaneously, the total loss may be severe. Alternatives exist to address each concern. Here we propose an analysis method for experiments involving both features that merges posterior predictive $p$-values for complier causal effects with randomization-based multiple comparisons adjustments; the results are valid familywise tests that are doubly advantageous: more powerful than both those based on standard ITT statistics and those using traditional multiple comparison adjustments. The operating characteristics and advantages of our method are demonstrated through a series of simulated experiments and an analysis of the United States Job Training Partnership Act (JTPA) Study, where our methods lead to different conclusions regarding the significance of estimated JTPA effects.
△ Less
Submitted 23 May, 2016;
originally announced May 2016.
-
Decomposing Treatment Effect Variation
Authors:
Peng Ding,
Avi Feller,
Luke Miratrix
Abstract:
Understanding and characterizing treatment effect variation in randomized experiments has become essential for going beyond the "black box" of the average treatment effect. Nonetheless, traditional statistical approaches often ignore or assume away such variation. In the context of randomized experiments, this paper proposes a framework for decomposing overall treatment effect variation into a sys…
▽ More
Understanding and characterizing treatment effect variation in randomized experiments has become essential for going beyond the "black box" of the average treatment effect. Nonetheless, traditional statistical approaches often ignore or assume away such variation. In the context of randomized experiments, this paper proposes a framework for decomposing overall treatment effect variation into a systematic component explained by observed covariates and a remaining idiosyncratic component. Our framework is fully randomization-based, with estimates of treatment effect variation that are entirely justified by the randomization itself. Our framework can also account for noncompliance, which is an important practical complication. We make several contributions. First, we show that randomization-based estimates of systematic variation are very similar in form to estimates from fully-interacted linear regression and two stage least squares. Second, we use these estimators to develop an omnibus test for systematic treatment effect variation, both with and without noncompliance. Third, we propose an $R^2$-like measure of treatment effect variation explained by covariates and, when applicable, noncompliance. Finally, we assess these methods via simulation studies and apply them to the Head Start Impact Study, a large-scale randomized experiment.
△ Less
Submitted 28 July, 2017; v1 submitted 20 May, 2016;
originally announced May 2016.
-
Weak separation in mixture models and implications for principal stratification
Authors:
Avi Feller,
Evan Greif,
Nhat Ho,
Luke Miratrix,
Natesh Pillai
Abstract:
Principal stratification is a widely used framework for addressing post-randomization complications. After using principal stratification to define causal effects of interest, researchers are increasingly turning to finite mixture models to estimate these quantities. Unfortunately, standard estimators of mixture parameters, like the MLE, are known to exhibit pathological behavior. We study this be…
▽ More
Principal stratification is a widely used framework for addressing post-randomization complications. After using principal stratification to define causal effects of interest, researchers are increasingly turning to finite mixture models to estimate these quantities. Unfortunately, standard estimators of mixture parameters, like the MLE, are known to exhibit pathological behavior. We study this behavior in a simple but fundamental example, a two-component Gaussian mixture model in which only the component means and variances are unknown, and focus on the setting in which the components are weakly separated. In this case, we show that the asymptotic convergence rate of the MLE is quite poor, such as $O(n^{-1/6})$ or even $O(n^{-1/8})$. We then demonstrate via theoretical arguments as well as extensive simulations that, in finite samples, the MLE behaves like a threshold estimator, in the sense that the MLE can give strong evidence that the means are equal when the truth is otherwise. We also explore the behavior of the MLE when the MLE is non-zero, showing that it is difficult to estimate both the sign and magnitude of the means in this case. We provide diagnostics for all of these pathologies and apply these ideas to re-analyzing two randomized evaluations of job training programs, JOBS II and Job Corps. Our results suggest that the corresponding maximum likelihood estimates should be interpreted with caution in these cases.
△ Less
Submitted 17 August, 2019; v1 submitted 21 February, 2016;
originally announced February 2016.
-
Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability
Authors:
Luke Miratrix,
Robin Ackerman
Abstract:
We propose a general framework for topic-specific summarization of large text corpora, and illustrate how it can be used for analysis in two quite different contexts: an OSHA database of fatality and catastrophe reports (to facilitate surveillance for patterns in circumstances leading to injury or death) and legal decisions on workers' compensation claims (to explore relevant case law). Our summar…
▽ More
We propose a general framework for topic-specific summarization of large text corpora, and illustrate how it can be used for analysis in two quite different contexts: an OSHA database of fatality and catastrophe reports (to facilitate surveillance for patterns in circumstances leading to injury or death) and legal decisions on workers' compensation claims (to explore relevant case law). Our summarization framework, built on sparse classification methods, is a compromise between simple word frequency based methods currently in wide use, and more heavyweight, model-intensive methods such as Latent Dirichlet Allocation (LDA). For a particular topic of interest (e.g., mental health disability, or chemical reactions), we regress a labeling of documents onto the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. Using a branch-and-bound approach, this method can be extended to allow for phrases of arbitrary length, which allows for potentially rich summarization. We discuss how focus on the purpose of the summaries can inform choices of regularization parameters and model constraints. We evaluate this tool by comparing computational time and summary statistics of the resulting word lists to three other methods in the literature. We also present a new R package, textreg. Overall, we argue that sparse methods have much to offer text analysis, and is a branch of research that should be considered further in this context.
△ Less
Submitted 22 July, 2016; v1 submitted 20 November, 2015;
originally announced November 2015.
-
Posterior Predictive P-values with Fisher Randomization Tests in Noncompliance Settings: Test Statistics vs Discrepancy Variables
Authors:
Laura Forastiere,
Fabrizia Mealli,
Luke Miratrix
Abstract:
In randomized experiments with noncompliance, tests may focus on compliers rather than on the overall sample. Rubin (1998) put forth such a method, and argued that testing for the complier average causal effect and averaging permutation based p-values over the posterior distribution of the compliance status could increase power, as compared to general intent-to-treat tests. The general scheme is t…
▽ More
In randomized experiments with noncompliance, tests may focus on compliers rather than on the overall sample. Rubin (1998) put forth such a method, and argued that testing for the complier average causal effect and averaging permutation based p-values over the posterior distribution of the compliance status could increase power, as compared to general intent-to-treat tests. The general scheme is to repeatedly do a two-step process of imputing missing compliance statuses and conducting a permutation test with the completed data. In this paper, we explore this idea further, comparing the use of discrepancy measures, which depend on unknown but imputed parameters, to classical test statistics and exploring different approaches for imputing the unknown compliance statuses. We also examine consequences of model misspecification in the imputation step, and discuss to what extent this additional modeling undercuts the permutation test's model independence. We find that, especially for discrepancy measures, modeling choices can impact both power and validity. In particular, imputing missing compliance statuses assuming the null can radically reduce power, but not doing so can jeopardize validity. Fortunately, covariates predictive of compliance status can mitigate these results. Finally, we compare this overall approach to Bayesian model-based tests, that is tests that are directly derived from posterior credible intervals, under both correct and incorrect model specification. We find that adding the permutation step in an otherwise Bayesian approach improves robustness to model specification without substantial loss of power.
△ Less
Submitted 20 February, 2016; v1 submitted 2 November, 2015;
originally announced November 2015.
-
A conditional randomization test to account for covariate imbalance in randomized experiments
Authors:
Jonathan Hennessy,
Tirthankar Dasgupta,
Luke Miratrix,
Cassandra Pattanayak,
Pradipta Sarkar
Abstract:
We consider the conditional randomization test as a way to account for covariate imbalance in randomized experiments. The test accounts for covariate imbalance by comparing the observed test statistic to the null distribution of the test statistic conditional on the observed covariate imbalance. We prove that the conditional randomization test has the correct significance level and introduce origi…
▽ More
We consider the conditional randomization test as a way to account for covariate imbalance in randomized experiments. The test accounts for covariate imbalance by comparing the observed test statistic to the null distribution of the test statistic conditional on the observed covariate imbalance. We prove that the conditional randomization test has the correct significance level and introduce original notation to describe covariate balance more formally. Through simulation, we verify that conditional randomization tests behave like more traditional forms of covariate adjustmet but have the added benefit of having the correct conditional significance level. Finally, we apply the approach to a randomized product marketing experiment where covariate information was collected after randomization.
△ Less
Submitted 20 April, 2017; v1 submitted 22 October, 2015;
originally announced October 2015.
-
Randomization Inference for Treatment Effect Variation
Authors:
Peng Ding,
Avi Feller,
Luke Miratrix
Abstract:
Applied researchers are increasingly interested in whether and how treatment effects vary in randomized evaluations, especially variation not explained by observed covariates. We propose a model-free approach for testing for the presence of such unexplained variation. To use this randomization-based approach, we must address the fact that the average treatment effect, generally the object of inter…
▽ More
Applied researchers are increasingly interested in whether and how treatment effects vary in randomized evaluations, especially variation not explained by observed covariates. We propose a model-free approach for testing for the presence of such unexplained variation. To use this randomization-based approach, we must address the fact that the average treatment effect, generally the object of interest in randomized experiments, actually acts as a nuisance parameter in this setting. We explore potential solutions and advocate for a method that guarantees valid tests in finite samples despite this nuisance. We also show how this method readily extends to testing for heterogeneity beyond a given model, which can be useful for assessing the sufficiency of a given scientific theory. We finally apply our method to the National Head Start Impact Study, a large-scale randomized evaluation of a Federal preschool program, finding that there is indeed significant unexplained treatment effect variation.
△ Less
Submitted 16 December, 2014;
originally announced December 2014.
-
To Adjust or Not to Adjust? Sensitivity Analysis of M-Bias and Butterfly-Bias
Authors:
Peng Ding,
Luke Miratrix
Abstract:
"M-Bias," as it is called in the epidemiologic literature, is the bias introduced by conditioning on a pretreatment covariate due to a particular "M-Structure" between two latent factors, an observed treatment, an outcome, and a "collider." This potential source of bias, which can occur even when the treatment and the outcome are not confounded, has been a source of considerable controversy. We he…
▽ More
"M-Bias," as it is called in the epidemiologic literature, is the bias introduced by conditioning on a pretreatment covariate due to a particular "M-Structure" between two latent factors, an observed treatment, an outcome, and a "collider." This potential source of bias, which can occur even when the treatment and the outcome are not confounded, has been a source of considerable controversy. We here present formulae for identifying under which circumstances biases are inflated or reduced. In particular, we show that the magnitude of M-Bias in linear structural equation models tends to be relatively small compared to confounding bias, suggesting that it is generally not a serious concern in many applied settings. These theoretical results are consistent with recent empirical findings from simulation studies. We also generalize the M-Bias setting (1) to allow for the correlation between the latent factors to be nonzero, and (2) to allow for the collider to be a confounder between the treatment and the outcome. These results demonstrate that mild deviations from the M-Structure tend to increase confounding bias more rapidly than M-Bias, suggesting that choosing to condition on any given covariate is generally the superior choice. As an application, we re-examine a controversial example between Professors Donald Rubin and Judea Pearl.
△ Less
Submitted 1 August, 2014;
originally announced August 2014.
-
Concise comparative summaries (CCS) of large text corpora with a human experiment
Authors:
**zhu Jia,
Luke Miratrix,
Bin Yu,
Brian Gawalt,
Laurent El Ghaoui,
Luke Barnesmoore,
Sophie Clavier
Abstract:
In this paper we propose a general framework for topic-specific summarization of large text corpora and illustrate how it can be used for the analysis of news databases. Our framework, concise comparative summarization (CCS), is built on sparse classification methods. CCS is a lightweight and flexible tool that offers a compromise between simple word frequency based methods currently in wide use a…
▽ More
In this paper we propose a general framework for topic-specific summarization of large text corpora and illustrate how it can be used for the analysis of news databases. Our framework, concise comparative summarization (CCS), is built on sparse classification methods. CCS is a lightweight and flexible tool that offers a compromise between simple word frequency based methods currently in wide use and more heavyweight, model-intensive methods such as latent Dirichlet allocation (LDA). We argue that sparse methods have much to offer for text analysis and hope CCS opens the door for a new branch of research in this important field. For a particular topic of interest (e.g., China or energy), CSS automatically labels documents as being either on- or off-topic (usually via keyword search), and then uses sparse classification methods to predict these labels with the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. To validate our tool, we, using news articles from the New York Times international section, designed and conducted a human survey to compare the different summarizers with human understanding. We demonstrate our approach with two case studies, a media analysis of the framing of "Egypt" in the New York Times throughout the Arab Spring and an informal comparison of the New York Times' and Wall Street Journal's coverage of "energy." Overall, we find that the Lasso with $L^2$ normalization can be effectively and usefully used to summarize large corpora, regardless of document size.
△ Less
Submitted 29 April, 2014;
originally announced April 2014.
-
Implementing Risk-Limiting Post-Election Audits in California
Authors:
Joseph Lorenzo Hall,
Luke W. Miratrix,
Philip B. Stark,
Melvin Briones,
Elaine Ginnold,
Freddie Oakley,
Martin Peaden,
Gail Pellerin,
Tom Stanionis,
Tricia Webber
Abstract:
Risk-limiting post-election audits limit the chance of certifying an electoral outcome if the outcome is not what a full hand count would show. Building on previous work, we report on pilot risk-limiting audits in four elections during 2008 in three California counties: one during the February 2008 Primary Election in Marin County and three during the November 2008 General Elections in Marin, Sa…
▽ More
Risk-limiting post-election audits limit the chance of certifying an electoral outcome if the outcome is not what a full hand count would show. Building on previous work, we report on pilot risk-limiting audits in four elections during 2008 in three California counties: one during the February 2008 Primary Election in Marin County and three during the November 2008 General Elections in Marin, Santa Cruz and Yolo Counties. We explain what makes an audit risk-limiting and how existing and proposed laws fall short. We discuss the differences among our four pilot audits. We identify challenges to practical, efficient risk-limiting audits and conclude that current approaches are too complex to be used routinely on a large scale. One important logistical bottleneck is the difficulty of exporting data from commercial election management systems in a format amenable to audit calculations. Finally, we propose a bare-bones risk-limiting audit that is less efficient than these pilot audits, but avoids many practical problems.
△ Less
Submitted 10 July, 2009; v1 submitted 28 May, 2009;
originally announced May 2009.