Search | arXiv e-print repository

The Impossibility of Fair LLMs

Authors: Jacy Anthis, Kristian Lum, Michael Ekstrand, Avi Feller, Alexander D'Amour, Chenhao Tan

Abstract: The need for fair AI is increasingly clear in the era of general-purpose systems such as ChatGPT, Gemini, and other large language models (LLMs). However, the increasing complexity of human-AI interaction and its social impacts have raised questions of how fairness standards could be applied. Here, we review the technical frameworks that machine learning researchers have used to evaluate fairness,… ▽ More The need for fair AI is increasingly clear in the era of general-purpose systems such as ChatGPT, Gemini, and other large language models (LLMs). However, the increasing complexity of human-AI interaction and its social impacts have raised questions of how fairness standards could be applied. Here, we review the technical frameworks that machine learning researchers have used to evaluate fairness, such as group fairness and fair representations, and find that their application to LLMs faces inherent limitations. We show that each framework either does not logically extend to LLMs or presents a notion of fairness that is intractable for LLMs, primarily due to the multitudes of populations affected, sensitive attributes, and use cases. To address these challenges, we develop guidelines for the more realistic goal of achieving fairness in particular use cases: the criticality of context, the responsibility of LLM developers, and the need for stakeholder participation in an iterative process of design and evaluation. Moreover, it may eventually be possible and even necessary to use the general-purpose capabilities of AI systems to address fairness challenges as a form of scalable AI-assisted alignment. △ Less

Submitted 28 May, 2024; originally announced June 2024.

Comments: Presented at the 1st Human-Centered Evaluation and Auditing of Language Models (HEAL) workshop at CHI 2024

arXiv:2404.11506 [pdf, other]

Statistical methods to estimate the impact of gun policy on gun violence

Authors: Eli Ben-Michael, Mitchell L. Doucette, Avi Feller, Alexander D. McCourt, Elizabeth A. Stuart

Abstract: Gun violence is a critical public health and safety concern in the United States. There is considerable variability in policy proposals meant to curb gun violence, ranging from increasing gun availability to deter potential assailants (e.g., concealed carry laws or arming school teachers) to restricting access to firearms (e.g., universal background checks or banning assault weapons). Many studies… ▽ More Gun violence is a critical public health and safety concern in the United States. There is considerable variability in policy proposals meant to curb gun violence, ranging from increasing gun availability to deter potential assailants (e.g., concealed carry laws or arming school teachers) to restricting access to firearms (e.g., universal background checks or banning assault weapons). Many studies use state-level variation in the enactment of these policies in order to quantify their effect on gun violence. In this paper, we discuss the policy trial emulation framework for evaluating the impact of these policies, and show how to apply this framework to estimating impacts via difference-in-differences and synthetic controls when there is staggered adoption of policies across jurisdictions, estimating the impacts of right-to-carry laws on violent crime as a case study. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2402.00168 [pdf, other]

Continuous Treatment Effects with Surrogate Outcomes

Authors: Zhenghao Zeng, David Arbour, Avi Feller, Raghavendra Addanki, Ryan Rossi, Ritwik Sinha, Edward H. Kennedy

Abstract: In many real-world causal inference applications, the primary outcomes (labels) are often partially missing, especially if they are expensive or difficult to collect. If the missingness depends on covariates (i.e., missingness is not completely at random), analyses based on fully observed samples alone may be biased. Incorporating surrogates, which are fully observed post-treatment variables relat… ▽ More In many real-world causal inference applications, the primary outcomes (labels) are often partially missing, especially if they are expensive or difficult to collect. If the missingness depends on covariates (i.e., missingness is not completely at random), analyses based on fully observed samples alone may be biased. Incorporating surrogates, which are fully observed post-treatment variables related to the primary outcome, can improve estimation in this case. In this paper, we study the role of surrogates in estimating continuous treatment effects and propose a doubly robust method to efficiently incorporate surrogates in the analysis, which uses both labeled and unlabeled data and does not suffer from the above selection bias problem. Importantly, we establish the asymptotic normality of the proposed estimator and show possible improvements on the variance compared with methods that solely use labeled data. Extensive simulations show our methods enjoy appealing empirical performance. △ Less

Submitted 21 May, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

Comments: 30 pages, 7 figures

arXiv:2401.12084 [pdf, other]

Temporal Aggregation for the Synthetic Control Method

Authors: Liyang Sun, Eli Ben-Michael, Avi Feller

Abstract: The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit with panel data. Two challenges arise with higher frequency data (e.g., monthly versus yearly): (1) achieving excellent pre-treatment fit is typically more challenging; and (2) overfitting to noise is more likely. Aggregating data over time can mitigate these problems but can also des… ▽ More The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit with panel data. Two challenges arise with higher frequency data (e.g., monthly versus yearly): (1) achieving excellent pre-treatment fit is typically more challenging; and (2) overfitting to noise is more likely. Aggregating data over time can mitigate these problems but can also destroy important signal. In this paper, we bound the bias for SCM with disaggregated and aggregated outcomes and give conditions under which aggregating tightens the bounds. We then propose finding weights that balance both disaggregated and aggregated series. △ Less

Submitted 15 April, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

Comments: 9 pages, 3 figures, Prepared for 2024 AEA Papers and Proceedings "Treatment Effects: Theory and Implementation"

arXiv:2311.16260 [pdf, other]

Using Multiple Outcomes to Improve the Synthetic Control Method

Authors: Liyang Sun, Eli Ben-Michael, Avi Feller

Abstract: When there are multiple outcome series of interest, Synthetic Control analyses typically proceed by estimating separate weights for each outcome. In this paper, we instead propose estimating a common set of weights across outcomes, by balancing either a vector of all outcomes or an index or average of them. Under a low-rank factor model, we show that these approaches lead to lower bias bounds than… ▽ More When there are multiple outcome series of interest, Synthetic Control analyses typically proceed by estimating separate weights for each outcome. In this paper, we instead propose estimating a common set of weights across outcomes, by balancing either a vector of all outcomes or an index or average of them. Under a low-rank factor model, we show that these approaches lead to lower bias bounds than separate weights, and that averaging leads to further gains when the number of outcomes grows. We illustrate this via simulation and in a re-analysis of the impact of the Flint water crisis on educational outcomes. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: 36 pages, 6 figures

arXiv:2308.06913 [pdf, other]

Improving the Estimation of Site-Specific Effects and their Distribution in Multisite Trials

Authors: JoonHo Lee, Jonathan Che, Sophia Rabe-Hesketh, Avi Feller, Luke Miratrix

Abstract: In multisite trials, researchers are often interested in several inferential goals: estimating treatment effects for each site, ranking these effects, and studying their distribution. This study seeks to identify optimal methods for estimating these targets. Through a comprehensive simulation study, we assess two strategies and their combined effects: semiparametric modeling of the prior distribut… ▽ More In multisite trials, researchers are often interested in several inferential goals: estimating treatment effects for each site, ranking these effects, and studying their distribution. This study seeks to identify optimal methods for estimating these targets. Through a comprehensive simulation study, we assess two strategies and their combined effects: semiparametric modeling of the prior distribution, and alternative posterior summary methods tailored to minimize specific loss functions. Our findings highlight that the success of different estimation strategies depends largely on the amount of within-site and between-site information available from the data. We discuss how our results can guide balancing the trade-offs associated with shrinkage in limited data environments. △ Less

Submitted 1 April, 2024; v1 submitted 13 August, 2023; originally announced August 2023.

arXiv:2305.15851 [pdf, other]

On sampling determinantal and Pfaffian point processes on a quantum computer

Authors: Rémi Bardenet, Michaël Fanuel, Alexandre Feller

Abstract: DPPs were introduced by Macchi as a model in quantum optics the 1970s. Since then, they have been widely used as models and subsampling tools in statistics and computer science. Most applications require sampling from a DPP, and given their quantum origin, it is natural to wonder whether sampling a DPP on a quantum computer is easier than on a classical one. We focus here on DPPs over a finite sta… ▽ More DPPs were introduced by Macchi as a model in quantum optics the 1970s. Since then, they have been widely used as models and subsampling tools in statistics and computer science. Most applications require sampling from a DPP, and given their quantum origin, it is natural to wonder whether sampling a DPP on a quantum computer is easier than on a classical one. We focus here on DPPs over a finite state space, which are distributions over the subsets of $\{1,\dots,N\}$ parametrized by an $N\times N$ Hermitian kernel matrix. Vanilla sampling consists in two steps, of respective costs $\mathcal{O}(N^3)$ and $\mathcal{O}(Nr^2)$ operations on a classical computer, where $r$ is the rank of the kernel matrix. A large first part of the current paper consists in explaining why the state-of-the-art in quantum simulation of fermionic systems already yields quantum DPP sampling algorithms. We then modify existing quantum circuits, and discuss their insertion in a full DPP sampling pipeline that starts from practical kernel specifications. The bottom line is that, with $P$ (classical) parallel processors, we can divide the preprocessing cost by $P$ and build a quantum circuit with $\mathcal{O}(Nr)$ gates that sample a given DPP, with depth varying from $\mathcal{O}(N)$ to $\mathcal{O}(r\log N)$ depending on qubit-communication constraints on the target machine. We also connect existing work on the simulation of superconductors to Pfaffian point processes, which generalize DPPs and would be a natural addition to the machine learner's toolbox. In particular, we describe "projective" Pfaffian point processes, the cardinality of which has constant parity, almost surely. Finally, the circuits are empirically validated on a classical simulator and on 5-qubit IBM machines. △ Less

Submitted 22 November, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: 53 pages, 9 figures. Additional results about parity of cardinality of PfPP samples. Minor corrections in Section 5 and slight generalization of Lemma 5.4. Extra example and derivations in appendix

arXiv:2304.14545 [pdf, other]

Augmented balancing weights as linear regression

Authors: David Bruns-Smith, Oliver Dukes, Avi Feller, Elizabeth L. Ogburn

Abstract: We provide a novel characterization of augmented balancing weights, also known as automatic debiased machine learning (AutoDML). These popular doubly robust or de-biased machine learning estimators combine outcome modeling with balancing weights - weights that achieve covariate balance directly in lieu of estimating and inverting the propensity score. When the outcome and weighting models are both… ▽ More We provide a novel characterization of augmented balancing weights, also known as automatic debiased machine learning (AutoDML). These popular doubly robust or de-biased machine learning estimators combine outcome modeling with balancing weights - weights that achieve covariate balance directly in lieu of estimating and inverting the propensity score. When the outcome and weighting models are both linear in some (possibly infinite) basis, we show that the augmented estimator is equivalent to a single linear model with coefficients that combine the coefficients from the original outcome model and coefficients from an unpenalized ordinary least squares (OLS) fit on the same data. We see that, under certain choices of regularization parameters, the augmented estimator often collapses to the OLS estimator alone; this occurs for example in a re-analysis of the Lalonde 1986 dataset. We then extend these results to specific choices of outcome and weighting models. We first show that the augmented estimator that uses (kernel) ridge regression for both outcome and weighting models is equivalent to a single, undersmoothed (kernel) ridge regression. This holds numerically in finite samples and lays the groundwork for a novel analysis of undersmoothing and asymptotic rates of convergence. When the weighting model is instead lasso-penalized regression, we give closed-form expressions for special cases and demonstrate a ``double selection'' property. Our framework opens the black box on this increasingly popular class of estimators, bridges the gap between existing results on the semiparametric efficiency of undersmoothed and doubly robust estimators, and provides new insights into the performance of augmented balancing weights. △ Less

Submitted 5 June, 2024; v1 submitted 27 April, 2023; originally announced April 2023.

arXiv:2209.04321 [pdf, other]

Estimating Racial Disparities in Emergency General Surgery

Authors: Eli Ben-Michael, Avi Feller, Rachel Kelz, Luke Keele

Abstract: Research documents that Black patients experience worse general surgery outcomes than white patients in the United States. In this paper, we focus on an important but less-examined category: the surgical treatment of emergency general surgery (EGS) conditions, which refers to medical emergencies where the injury is "endogenous," such as a burst appendix. Our goal is to assess racial disparities fo… ▽ More Research documents that Black patients experience worse general surgery outcomes than white patients in the United States. In this paper, we focus on an important but less-examined category: the surgical treatment of emergency general surgery (EGS) conditions, which refers to medical emergencies where the injury is "endogenous," such as a burst appendix. Our goal is to assess racial disparities for common outcomes after EGS treatment using an administrative database of hospital claims in New York, Florida, and Pennsylvania, and to understand the extent to which differences are attributable to patient-level risk factors versus hospital-level factors. To do so, we use a class of linear weighting estimators that re-weight white patients to have a similar distribution of baseline characteristics as Black patients. This framework nests many common approaches, including matching and linear regression, but offers important advantages over these methods in terms of controlling imbalance between groups, minimizing extrapolation, and reducing computation time. Applying this approach to the claims data, we find that disparities estimates that adjust for the admitting hospital are substantially smaller than estimates that adjust for patient baseline characteristics only, suggesting that hospital-specific factors are important drivers of racial disparities in EGS outcomes. △ Less

Submitted 9 November, 2023; v1 submitted 9 September, 2022; originally announced September 2022.

arXiv:2203.09557 [pdf, other]

Outcome Assumptions and Duality Theory for Balancing Weights

Authors: David Bruns-Smith, Avi Feller

Abstract: We study balancing weight estimators, which reweight outcomes from a source population to estimate missing outcomes in a target population. These estimators minimize the worst-case error by making an assumption about the outcome model. In this paper, we show that this outcome assumption has two immediate implications. First, we can replace the minimax optimization problem for balancing weights wit… ▽ More We study balancing weight estimators, which reweight outcomes from a source population to estimate missing outcomes in a target population. These estimators minimize the worst-case error by making an assumption about the outcome model. In this paper, we show that this outcome assumption has two immediate implications. First, we can replace the minimax optimization problem for balancing weights with a simple convex loss over the assumed outcome function class. Second, we can replace the commonly-made overlap assumption with a more appropriate quantitative measure, the minimum worst-case bias. Finally, we show conditions under which the weights remain robust when our assumptions on the outcomes are wrong. △ Less

Submitted 17 March, 2022; originally announced March 2022.

Comments: To appear in AISTATS 2022

arXiv:2110.14831 [pdf, ps, other]

The Balancing Act in Causal Inference

Authors: Eli Ben-Michael, Avi Feller, David A. Hirshberg, José R. Zubizarreta

Abstract: The idea of covariate balance is at the core of causal inference. Inverse propensity weights play a central role because they are the unique set of weights that balance the covariate distributions of different treatment groups. We discuss two broad approaches to estimating these weights: the more traditional one, which fits a propensity score model and then uses the reciprocal of the estimated pro… ▽ More The idea of covariate balance is at the core of causal inference. Inverse propensity weights play a central role because they are the unique set of weights that balance the covariate distributions of different treatment groups. We discuss two broad approaches to estimating these weights: the more traditional one, which fits a propensity score model and then uses the reciprocal of the estimated propensity score to construct weights, and the balancing approach, which estimates the inverse propensity weights essentially by the method of moments, finding weights that achieve balance in the sample. We review ideas from the causal inference, sample surveys, and semiparametric estimation literatures, with particular attention to the role of balance as a sufficient condition for robust inference. We focus on the inverse propensity weighting and augmented inverse propensity weighting estimators for the average treatment effect given strong ignorability and consider generalizations for a broader class of problems including policy evaluation and the estimation of individualized treatment effects. △ Less

Submitted 27 October, 2021; originally announced October 2021.

Comments: 42 pages, 0 figures

MSC Class: 62Gxx

arXiv:2110.07006 [pdf, other]

Estimating the effects of a California gun control program with Multitask Gaussian Processes

Authors: Eli Ben-Michael, David Arbour, Avi Feller, Alex Franks, Steven Raphael

Abstract: Gun violence is a critical public safety concern in the United States. In 2006 California implemented a unique firearm monitoring program, the Armed and Prohibited Persons System (APPS), to address gun violence in the state. The APPS program first identifies those firearm owners who become prohibited from owning one due to federal or state law, then confiscates their firearms. Our goal is to asses… ▽ More Gun violence is a critical public safety concern in the United States. In 2006 California implemented a unique firearm monitoring program, the Armed and Prohibited Persons System (APPS), to address gun violence in the state. The APPS program first identifies those firearm owners who become prohibited from owning one due to federal or state law, then confiscates their firearms. Our goal is to assess the effect of APPS on California murder rates using annual, state-level crime data across the US for the years before and after the introduction of the program. To do so, we adapt a non-parametric Bayesian approach, multitask Gaussian Processes (MTGPs), to the panel data setting. MTGPs allow for flexible and parsimonious panel data models that nest many existing approaches and allow for direct control over both dependence across time and dependence across units, as well as natural uncertainty quantification. We extend this approach to incorporate non-Normal outcomes, auxiliary covariates, and multiple outcome series, which are all important in our application. We also show that this approach has attractive Frequentist properties, including a representation as a weighting estimator with separate weights over units and time periods. Applying this approach, we find that the increased monitoring and enforcement from the APPS program substantially decreased homicides in California. We also find that the effect on murder is driven entirely by declines in gun-related murder with no measurable effect on non-gun murder. Estimated cost per murder avoided are substantially lower than conventional estimates of the value of a statistical life, suggesting a very high benefit-cost ratio for this enforcement effort. △ Less

Submitted 8 June, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

arXiv:2103.14765 [pdf, ps, other]

Is it who you are or where you are? Accounting for compositional differences in cross-site treatment variation

Authors: Benjamin Lu, Eli Ben-Michael, Avi Feller, Luke Miratrix

Abstract: Multisite trials, in which treatment is randomized separately in multiple sites, offer a unique opportunity to disentangle treatment effect variation due to "compositional" differences in the distributions of unit-level features from variation due to "contextual" differences in site-level features. In particular, if we can re-weight (or "transport") each site to have a common distribution of unit-… ▽ More Multisite trials, in which treatment is randomized separately in multiple sites, offer a unique opportunity to disentangle treatment effect variation due to "compositional" differences in the distributions of unit-level features from variation due to "contextual" differences in site-level features. In particular, if we can re-weight (or "transport") each site to have a common distribution of unit-level covariates, the remaining effect variation captures contextual differences across sites. In this paper, we develop a framework for transporting effects in multisite trials using approximate balancing weights, where the weights are chosen to directly optimize unit-level covariate balance between each site and the target distribution. We first develop our approach for the general setting of transporting the effect of a single-site trial. We then extend our method to multisite trials, assess its performance via simulation, and use it to analyze a series of multisite trials of welfare-to-work programs. Our method is available in the balancer R package. △ Less

Submitted 26 March, 2021; originally announced March 2021.

Comments: 22 pages, 9 figures

arXiv:2102.13218 [pdf, other]

doi 10.1093/jrsssa/qnad032

Interpretable Sensitivity Analysis for Balancing Weights

Authors: Dan Soriano, Eli Ben-Michael, Peter J. Bickel, Avi Feller, Samuel D. Pimentel

Abstract: Assessing sensitivity to unmeasured confounding is an important step in observational studies, which typically estimate effects under the assumption that all confounders are measured. In this paper, we develop a sensitivity analysis framework for balancing weights estimators, an increasingly popular approach that solves an optimization problem to obtain weights that directly minimizes covariate im… ▽ More Assessing sensitivity to unmeasured confounding is an important step in observational studies, which typically estimate effects under the assumption that all confounders are measured. In this paper, we develop a sensitivity analysis framework for balancing weights estimators, an increasingly popular approach that solves an optimization problem to obtain weights that directly minimizes covariate imbalance. In particular, we adapt a sensitivity analysis framework using the percentile bootstrap for a broad class of balancing weights estimators. We prove that the percentile bootstrap procedure can, with only minor modifications, yield valid confidence intervals for causal effects under restrictions on the level of unmeasured confounding. We also propose an amplification to allow for interpretable sensitivity parameters in the balancing weights framework. We illustrate our method through extensive real data examples. △ Less

Submitted 31 August, 2023; v1 submitted 25 February, 2021; originally announced February 2021.

arXiv:2102.09052 [pdf, other]

Multilevel calibration weighting for survey data

Authors: Eli Ben-Michael, Avi Feller, Erin Hartman

Abstract: In the November 2016 U.S. presidential election, many state level public opinion polls, particularly in the Upper Midwest, incorrectly predicted the winning candidate. One leading explanation for this polling miss is that the precipitous decline in traditional polling response rates led to greater reliance on statistical methods to adjust for the corresponding bias -- and that these methods failed… ▽ More In the November 2016 U.S. presidential election, many state level public opinion polls, particularly in the Upper Midwest, incorrectly predicted the winning candidate. One leading explanation for this polling miss is that the precipitous decline in traditional polling response rates led to greater reliance on statistical methods to adjust for the corresponding bias -- and that these methods failed to adjust for important interactions between key variables like education, race, and geographic region. Finding calibration weights that account for important interactions remains challenging with traditional survey methods: raking typically balances the margins alone, while post-stratification, which exactly balances all interactions, is only feasible for a small number of variables. In this paper, we propose multilevel calibration weighting, which enforces tight balance constraints for marginal balance and looser constraints for higher-order interactions. This incorporates some of the benefits of post-stratification while retaining the guarantees of raking. We then correct for the bias due to the relaxed constraints via a flexible outcome model; we call this approach Double Regression with Post-stratification (DRP). We characterize the asymptotic properties of these estimators and show that the proposed calibration approach has a dual representation as a multilevel model for survey response. We then use these tools to to re-assess a large-scale survey of voter intention in the 2016 U.S. presidential election, finding meaningful gains from the proposed methods. The approach is available in the multical R package. △ Less

Submitted 12 November, 2021; v1 submitted 17 February, 2021; originally announced February 2021.

arXiv:2011.05826 [pdf]

A trial emulation approach for policy evaluations with group-level longitudinal data

Authors: Eli Ben-Michael, Avi Feller, Elizabeth A. Stuart

Abstract: To limit the spread of the novel coronavirus, governments across the world implemented extraordinary physical distancing policies, such as stay-at-home orders, and numerous studies aim to estimate their effects. Many statistical and econometric methods, such as difference-in-differences, leverage repeated measurements and variation in timing to estimate policy effects, including in the COVID-19 co… ▽ More To limit the spread of the novel coronavirus, governments across the world implemented extraordinary physical distancing policies, such as stay-at-home orders, and numerous studies aim to estimate their effects. Many statistical and econometric methods, such as difference-in-differences, leverage repeated measurements and variation in timing to estimate policy effects, including in the COVID-19 context. While these methods are less common in epidemiology, epidemiologic researchers are well accustomed to handling similar complexities in studies of individual-level interventions. "Target trial emulation" emphasizes the need to carefully design a non-experimental study in terms of inclusion and exclusion criteria, covariates, exposure definition, and outcome measurement -- and the timing of those variables. We argue that policy evaluations using group-level longitudinal ("panel") data need to take a similar careful approach to study design, which we refer to as "policy trial emulation." This is especially important when intervention timing varies across jurisdictions; the main idea is to construct target trials separately for each "treatment cohort" (states that implement the policy at the same time) and then aggregate. We present a stylized analysis of the impact of state-level stay-at-home orders on total coronavirus cases. We argue that estimates from panel methods -- with the right data and careful modeling and diagnostics -- can help add to our understanding of many policies, though doing so is often challenging. △ Less

Submitted 11 November, 2020; originally announced November 2020.

Comments: Forthcoming at Epidemiology

arXiv:2009.01940 [pdf]

COVID-19 Policy Impact Evaluation: A guide to common design issues

Authors: Noah A Haber, Emma Clarke-Deelder, Joshua A Salomon, Avi Feller, Elizabeth A Stuart

Abstract: Policy responses to COVID-19, particularly those related to non-pharmaceutical interventions, are unprecedented in scale and scope. Epidemiologists are more involved in policy decisions and evidence generation than ever before. However, policy impact evaluations always require a complex combination of circumstance, study design, data, statistics, and analysis. Beyond the issues that are faced for… ▽ More Policy responses to COVID-19, particularly those related to non-pharmaceutical interventions, are unprecedented in scale and scope. Epidemiologists are more involved in policy decisions and evidence generation than ever before. However, policy impact evaluations always require a complex combination of circumstance, study design, data, statistics, and analysis. Beyond the issues that are faced for any policy, evaluation of COVID-19 policies is complicated by additional challenges related to infectious disease dynamics and lags, lack of direct observation of key outcomes, and a multiplicity of interventions occurring on an accelerated time scale. The methods needed for policy-level impact evaluation are not often used or taught in epidemiology, and differ in important ways that may not be obvious. The volume and speed, and methodological complications of policy evaluations can make it difficult for decision-makers and researchers to synthesize and evaluate strength of evidence in COVID-19 health policy papers. In this paper, we (1) introduce the basic suite of policy impact evaluation designs for observational data, including cross-sectional analyses, pre/post, interrupted time-series, and difference-in-differences analysis, (2) demonstrate key ways in which the requirements and assumptions underlying these designs are often violated in the context of COVID-19, and (3) provide decision-makers and reviewers a conceptual and graphical guide to identifying these key violations. The overall goal of this paper is to help epidemiologists, policy-makers, journal editors, journalists, researchers, and other research consumers understand and weigh the strengths and limitations of evidence that is essential to decision-making. △ Less

Submitted 16 April, 2021; v1 submitted 3 September, 2020; originally announced September 2020.

arXiv:2008.04394 [pdf, other]

Varying impacts of letters of recommendation on college admissions: Approximate balancing weights for subgroup effects in observational studies

Authors: Eli Ben-Michael, Avi Feller, Jesse Rothstein

Abstract: In a pilot program during the 2016-17 admissions cycle, the University of California, Berkeley invited many applicants for freshman admission to submit letters of recommendation. We use this pilot as the basis for an observational study of the impact of submitting letters of recommendation on subsequent admission, with the goal of estimating how impacts vary across pre-defined subgroups. Understan… ▽ More In a pilot program during the 2016-17 admissions cycle, the University of California, Berkeley invited many applicants for freshman admission to submit letters of recommendation. We use this pilot as the basis for an observational study of the impact of submitting letters of recommendation on subsequent admission, with the goal of estimating how impacts vary across pre-defined subgroups. Understanding this variation is challenging in observational studies, however, because estimated impacts reflect both actual treatment effect variation and differences in covariate balance across groups. To address this, we develop balancing weights that directly optimize for ``local balance'' within subgroups while maintaining global covariate balance between treated and control units. We then show that this approach has a dual representation as a form of inverse propensity score weighting with a hierarchical propensity score model. In the UC Berkeley pilot study, our proposed approach yields excellent local and global balance, unlike more traditional weighting methods, which fail to balance covariates within subgroups. We find that the impact of letters of recommendation increases with the predicted probability of admission, with mixed evidence of differences for under-represented minority applicants. △ Less

Submitted 22 February, 2021; v1 submitted 10 August, 2020; originally announced August 2020.

arXiv:2007.09056 [pdf, other]

Hospital Quality Risk Standardization via Approximate Balancing Weights

Authors: Luke Keele, Eli Ben-Michael, Avi Feller, Rachel Kelz, Luke Miratrix

Abstract: Comparing outcomes across hospitals, often to identify underperforming hospitals, is a critical task in health services research. However, naive comparisons of average outcomes, such as surgery complication rates, can be misleading because hospital case mixes differ -- a hospital's overall complication rate may be lower due to more effective treatments or simply because the hospital serves a healt… ▽ More Comparing outcomes across hospitals, often to identify underperforming hospitals, is a critical task in health services research. However, naive comparisons of average outcomes, such as surgery complication rates, can be misleading because hospital case mixes differ -- a hospital's overall complication rate may be lower due to more effective treatments or simply because the hospital serves a healthier population overall. In this paper, we develop a method of ``direct standardization'' where we re-weight each hospital patient population to be representative of the overall population and then compare the weighted averages across hospitals. Adapting methods from survey sampling and causal inference, we find weights that directly control for imbalance between the hospital patient mix and the target population, even across many patient attributes. Critically, these balancing weights can also be tuned to preserve sample size for more precise estimates. We also derive principled measures of statistical precision, and use outcome modeling and Bayesian shrinkage to increase precision and account for variation in hospital size. We demonstrate these methods using claims data from Pennsylvania, Florida, and New York, estimating standardized hospital complication rates for general surgery patients. We conclude with a discussion of how to detect low performing hospitals. △ Less

Submitted 15 February, 2021; v1 submitted 17 July, 2020; originally announced July 2020.

arXiv:1912.03290 [pdf, other]

Synthetic Controls with Staggered Adoption

Authors: Eli Ben-Michael, Avi Feller, Jesse Rothstein

Abstract: Staggered adoption of policies by different units at different times creates promising opportunities for observational causal inference. Estimation remains challenging, however, and common regression methods can give misleading results. A promising alternative is the synthetic control method (SCM), which finds a weighted average of control units that closely balances the treated unit's pre-treatme… ▽ More Staggered adoption of policies by different units at different times creates promising opportunities for observational causal inference. Estimation remains challenging, however, and common regression methods can give misleading results. A promising alternative is the synthetic control method (SCM), which finds a weighted average of control units that closely balances the treated unit's pre-treatment outcomes. In this paper, we generalize SCM, originally designed to study a single treated unit, to the staggered adoption setting. We first bound the error for the average effect and show that it depends on both the imbalance for each treated unit separately and the imbalance for the average of the treated units. We then propose "partially pooled" SCM weights to minimize a weighted combination of these measures; approaches that focus only on balancing one of the two components can lead to bias. We extend this approach to incorporate unit-level intercept shifts and auxiliary covariates. We assess the performance of the proposed method via extensive simulations and apply our results to the question of whether teacher collective bargaining leads to higher school spending, finding minimal impacts. We implement the proposed method in the augsynth R package. △ Less

Submitted 15 January, 2021; v1 submitted 6 December, 2019; originally announced December 2019.

arXiv:1910.10862 [pdf, other]

A Graph-Theoretic Approach to Randomization Tests of Causal Effects Under General Interference

Authors: David Puelz, Guillaume Basse, Avi Feller, Panos Toulis

Abstract: Interference exists when a unit's outcome depends on another unit's treatment assignment. For example, intensive policing on one street could have a spillover effect on neighboring streets. Classical randomization tests typically break down in this setting because many null hypotheses of interest are no longer sharp under interference. A promising alternative is to instead construct a conditional… ▽ More Interference exists when a unit's outcome depends on another unit's treatment assignment. For example, intensive policing on one street could have a spillover effect on neighboring streets. Classical randomization tests typically break down in this setting because many null hypotheses of interest are no longer sharp under interference. A promising alternative is to instead construct a conditional randomization test on a subset of units and assignments for which a given null hypothesis is sharp. Finding these subsets is challenging, however, and existing methods are limited to special cases or have limited power. In this paper, we propose valid and easy-to-implement randomization tests for a general class of null hypotheses under arbitrary interference between units. Our key idea is to represent the hypothesis of interest as a bipartite graph between units and assignments, and to find an appropriate biclique of this graph. Importantly, the null hypothesis is sharp within this biclique, enabling conditional randomization-based tests. We also connect the size of the biclique to statistical power. Moreover, we can apply off-the-shelf graph clustering methods to find such bicliques efficiently and at scale. We illustrate our approach in settings with clustered interference and show advantages over methods designed specifically for that setting. We then apply our method to a large-scale policing experiment in Medellin, Colombia, where interference has a spatial structure. △ Less

Submitted 25 May, 2021; v1 submitted 23 October, 2019; originally announced October 2019.

arXiv:1907.07592 [pdf, other]

Assessing Treatment Effect Variation in Observational Studies: Results from a Data Challenge

Authors: Carlos Carvalho, Avi Feller, Jared Murray, Spencer Woody, David Yeager

Abstract: A growing number of methods aim to assess the challenging question of treatment effect variation in observational studies. This special section of "Observational Studies" reports the results of a workshop conducted at the 2018 Atlantic Causal Inference Conference designed to understand the similarities and differences across these methods. We invited eight groups of researchers to analyze a synthe… ▽ More A growing number of methods aim to assess the challenging question of treatment effect variation in observational studies. This special section of "Observational Studies" reports the results of a workshop conducted at the 2018 Atlantic Causal Inference Conference designed to understand the similarities and differences across these methods. We invited eight groups of researchers to analyze a synthetic observational data set that was generated using a recent large-scale randomized trial in education. Overall, participants employed a diverse set of methods, ranging from matching and flexible outcome modeling to semiparametric estimation and ensemble approaches. While there was broad consensus on the topline estimate, there were also large differences in estimated treatment effect moderation. This highlights the fact that estimating varying treatment effects in observational studies is often more challenging than estimating the average treatment effect alone. We suggest several directions for future work arising from this workshop. △ Less

Submitted 13 September, 2019; v1 submitted 17 July, 2019; originally announced July 2019.

Comments: 15 pages, 4 figures, 2018 Atlantic Causal Inference Conference

arXiv:1904.02308 [pdf, other]

Randomization tests for peer effects in group formation experiments

Authors: Guillaume Basse, Peng Ding, Avi Feller, Panos Toulis

Abstract: Measuring the effect of peers on individuals' outcomes is a challenging problem, in part because individuals often select peers who are similar in both observable and unobservable ways. Group formation experiments avoid this problem by randomly assigning individuals to groups and observing their responses; for example, do first-year students have better grades when they are randomly assigned roomm… ▽ More Measuring the effect of peers on individuals' outcomes is a challenging problem, in part because individuals often select peers who are similar in both observable and unobservable ways. Group formation experiments avoid this problem by randomly assigning individuals to groups and observing their responses; for example, do first-year students have better grades when they are randomly assigned roommates who have stronger academic backgrounds? In this paper, we propose randomization-based permutation tests for group formation experiments, extending classical Fisher Randomization Tests to this setting. The proposed tests are justified by the randomization itself, require relatively few assumptions, and are exact in finite-samples. This approach can also complement existing strategies, such as linear-in-means models, by using a regression coefficient as the test statistic. We apply the proposed tests to two recent group formation experiments. △ Less

Submitted 7 March, 2023; v1 submitted 3 April, 2019; originally announced April 2019.

arXiv:1811.04170 [pdf, other]

The Augmented Synthetic Control Method

Authors: Eli Ben-Michael, Avi Feller, Jesse Rothstein

Abstract: The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit in panel data settings. The "synthetic control" is a weighted average of control units that balances the treated unit's pre-treatment outcomes as closely as possible. A critical feature of the original proposal is to use SCM only when the fit on pre-treatment outcomes is excellent. We… ▽ More The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit in panel data settings. The "synthetic control" is a weighted average of control units that balances the treated unit's pre-treatment outcomes as closely as possible. A critical feature of the original proposal is to use SCM only when the fit on pre-treatment outcomes is excellent. We propose Augmented SCM as an extension of SCM to settings where such pre-treatment fit is infeasible. Analogous to bias correction for inexact matching, Augmented SCM uses an outcome model to estimate the bias due to imperfect pre-treatment fit and then de-biases the original SCM estimate. Our main proposal, which uses ridge regression as the outcome model, directly controls pre-treatment fit while minimizing extrapolation from the convex hull. This estimator can also be expressed as a solution to a modified synthetic controls problem that allows negative weights on some donor units. We bound the estimation error of this approach under different data generating processes, including a linear factor model, and show how regularization helps to avoid over-fitting to noise. We demonstrate gains from Augmented SCM with extensive simulation studies and apply this framework to estimate the impact of the 2012 Kansas tax cuts on economic growth. We implement the proposed method in the new augsynth R package. △ Less

Submitted 23 July, 2020; v1 submitted 9 November, 2018; originally announced November 2018.

arXiv:1809.00399 [pdf, other]

Flexible sensitivity analysis for observational studies without observable implications

Authors: Alexander Franks, Alexander D'Amour, Avi Feller

Abstract: A fundamental challenge in observational causal inference is that assumptions about unconfoundedness are not testable from data. Assessing sensitivity to such assumptions is therefore important in practice. Unfortunately, some existing sensitivity analysis approaches inadvertently impose restrictions that are at odds with modern causal inference methods, which emphasize flexible models for observe… ▽ More A fundamental challenge in observational causal inference is that assumptions about unconfoundedness are not testable from data. Assessing sensitivity to such assumptions is therefore important in practice. Unfortunately, some existing sensitivity analysis approaches inadvertently impose restrictions that are at odds with modern causal inference methods, which emphasize flexible models for observed data. To address this issue, we propose a framework that allows (1) flexible models for the observed data and (2) clean separation of the identified and unidentified parts of the sensitivity model. Our framework extends an approach from the missing data literature, known as Tukey's factorization, to the causal inference setting. Under this factorization, we can represent the distributions of unobserved potential outcomes in terms of unidentified selection functions that posit an unidentified relationship between the treatment assignment indicator and the observed potential outcomes. The sensitivity parameters in this framework are easily interpreted, and we provide heuristics for calibrating these parameters against observable quantities. We demonstrate the flexibility of this approach in two examples, where we estimate both average treatment effects and quantile treatment effects using Bayesian nonparametric models for the observed data. △ Less

Submitted 13 January, 2019; v1 submitted 2 September, 2018; originally announced September 2018.

arXiv:1805.01868 [pdf, other]

Algorithmic Decision Making in the Presence of Unmeasured Confounding

Authors: Jongbin Jung, Ravi Shroff, Avi Feller, Sharad Goel

Abstract: On a variety of complex decision-making tasks, from doctors prescribing treatment to judges setting bail, machine learning algorithms have been shown to outperform expert human judgments. One complication, however, is that it is often difficult to anticipate the effects of algorithmic policies prior to deployment, making the decision to adopt them risky. In particular, one generally cannot use his… ▽ More On a variety of complex decision-making tasks, from doctors prescribing treatment to judges setting bail, machine learning algorithms have been shown to outperform expert human judgments. One complication, however, is that it is often difficult to anticipate the effects of algorithmic policies prior to deployment, making the decision to adopt them risky. In particular, one generally cannot use historical data to directly observe what would have happened had the actions recommended by the algorithm been taken. One standard strategy is to model potential outcomes for alternative decisions assuming that there are no unmeasured confounders (i.e., to assume ignorability). But if this ignorability assumption is violated, the predicted and actual effects of an algorithmic policy can diverge sharply. In this paper we present a flexible, Bayesian approach to gauge the sensitivity of predicted policy outcomes to unmeasured confounders. We show that this policy evaluation problem is a generalization of estimating heterogeneous treatment effects in observational studies, and so our methods can immediately be applied to that setting. Finally, we show, both theoretically and empirically, that under certain conditions it is possible to construct near-optimal algorithmic policies even when ignorability is violated. We demonstrate the efficacy of our methods on a large dataset of judicial actions, in which one must decide whether defendants awaiting trial should be required to pay bail or can be released without payment. △ Less

Submitted 4 May, 2018; originally announced May 2018.

arXiv:1803.06048 [pdf, other]

Identifying and Estimating Principal Causal Effects in Multi-site Trials

Authors: Lo-Hua Yuan, Avi Feller, Luke W. Miratrix

Abstract: Randomized trials are often conducted with separate randomizations across multiple sites such as schools, voting districts, or hospitals. These sites can differ in important ways, including the site's implementation, local conditions, and the composition of individuals. An important question in practice is whether---and under what assumptions---researchers can leverage this cross-site variation to… ▽ More Randomized trials are often conducted with separate randomizations across multiple sites such as schools, voting districts, or hospitals. These sites can differ in important ways, including the site's implementation, local conditions, and the composition of individuals. An important question in practice is whether---and under what assumptions---researchers can leverage this cross-site variation to learn more about the intervention. We address these questions in the principal stratification framework, which describes causal effects for subgroups defined by post-treatment quantities. We show that researchers can estimate certain principal causal effects via the multi-site design if they are willing to impose the strong assumption that the site-specific effects are uncorrelated with the site-specific distribution of stratum membership. We motivate this approach with a multi-site trial of the Early College High School Initiative, a unique secondary education program with the goal of increasing high school graduation rates and college enrollment. Our analyses corroborate previous studies suggesting that the initiative had positive effects for students who would have otherwise attended a low-quality high school, although power is limited. △ Less

Submitted 15 March, 2018; originally announced March 2018.

arXiv:1709.08036 [pdf, other]

Conditional randomization tests of causal effects with interference between units

Authors: Guillaume Basse, Avi Feller, Panos Toulis

Abstract: Many causal questions involve interactions between units, also known as interference, for example between individuals in households, students in schools, or firms in markets. In this paper, we formalize the concept of a conditioning mechanism, which provides a framework for constructing valid and powerful randomization tests under general forms of interference. We describe our framework in the con… ▽ More Many causal questions involve interactions between units, also known as interference, for example between individuals in households, students in schools, or firms in markets. In this paper, we formalize the concept of a conditioning mechanism, which provides a framework for constructing valid and powerful randomization tests under general forms of interference. We describe our framework in the context of two-stage randomized designs and apply our approach to a randomized evaluation of an intervention targeting student absenteeism in the School District of Philadelphia. We show improvements over existing methods in terms of computational and statistical power. △ Less

Submitted 24 September, 2018; v1 submitted 23 September, 2017; originally announced September 2017.

Comments: Accepted for publication in Biometrika

arXiv:1701.08230 [pdf, other]

doi 10.1145/3097983.309809

Algorithmic decision making and the cost of fairness

Authors: Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, Aziz Huq

Abstract: Algorithms are now regularly used to decide whether defendants awaiting trial are too dangerous to be released back into the community. In some cases, black defendants are substantially more likely than white defendants to be incorrectly classified as high risk. To mitigate such disparities, several techniques recently have been proposed to achieve algorithmic fairness. Here we reformulate algorit… ▽ More Algorithms are now regularly used to decide whether defendants awaiting trial are too dangerous to be released back into the community. In some cases, black defendants are substantially more likely than white defendants to be incorrectly classified as high risk. To mitigate such disparities, several techniques recently have been proposed to achieve algorithmic fairness. Here we reformulate algorithmic fairness as constrained optimization: the objective is to maximize public safety while satisfying formal fairness constraints designed to reduce racial disparities. We show that for several past definitions of fairness, the optimal algorithms that result require detaining defendants above race-specific risk thresholds. We further show that the optimal unconstrained algorithm requires applying a single, uniform threshold to all defendants. The unconstrained algorithm thus maximizes public safety while also satisfying one important understanding of equality: that all individuals are held to the same standard, irrespective of race. Because the optimal constrained and unconstrained algorithms generally differ, there is tension between improving public safety and satisfying prevailing notions of algorithmic fairness. By examining data from Broward County, Florida, we show that this trade-off can be large in practice. We focus on algorithms for pretrial release decisions, but the principles we discuss apply to other domains, and also to human decision makers carrying out structured decision rules. △ Less

Submitted 9 June, 2017; v1 submitted 27 January, 2017; originally announced January 2017.

Comments: To appear in Proceedings of KDD'17

arXiv:1701.03139 [pdf, other]

Bounding, an accessible method for estimating principal causal effects, examined and explained

Authors: Luke Miratrix, Jane Furey, Avi Feller, Todd Grindal, Lindsay C. Page

Abstract: Estimating treatment effects for subgroups defined by post-treatment behavior (i.e., estimating causal effects in a principal stratification framework) can be technically challenging and heavily reliant on strong assumptions. We investigate an alternative path: using bounds to identify ranges of possible effects that are consistent with the data. This simple approach relies on fewer assumptions an… ▽ More Estimating treatment effects for subgroups defined by post-treatment behavior (i.e., estimating causal effects in a principal stratification framework) can be technically challenging and heavily reliant on strong assumptions. We investigate an alternative path: using bounds to identify ranges of possible effects that are consistent with the data. This simple approach relies on fewer assumptions and yet can result in policy-relevant findings. As we show, covariates can be used to substantially tighten bounds in a straightforward manner. Via simulation, we demonstrate which types of covariates are maximally beneficial. We conclude with an analysis of a multi-site experimental study of Early College High Schools. When examining the program's impact on students completing the ninth grade "on-track" for college, we find little impact for ECHS students who would otherwise attend a high quality high school, but substantial effects for those who would not. This suggests potential benefit in expanding these programs in areas primarily served by lower quality schools. △ Less

Submitted 16 August, 2017; v1 submitted 11 January, 2017; originally announced January 2017.

arXiv:1608.06805 [pdf, other]

Analyzing two-stage experiments in the presence of interference

Authors: Guillaume Basse, Avi Feller

Abstract: Two-stage randomization is a powerful design for estimating treatment effects in the presence of interference; that is, when one individual's treatment assignment affects another individual's outcomes. Our motivating example is a two-stage randomized trial evaluating an intervention to reduce student absenteeism in the School District of Philadelphia. In that experiment, households with multiple s… ▽ More Two-stage randomization is a powerful design for estimating treatment effects in the presence of interference; that is, when one individual's treatment assignment affects another individual's outcomes. Our motivating example is a two-stage randomized trial evaluating an intervention to reduce student absenteeism in the School District of Philadelphia. In that experiment, households with multiple students were first assigned to treatment or control; then, in treated households, one student was randomly assigned to treatment. Using this example, we highlight key considerations for analyzing two-stage experiments in practice. Our first contribution is to address additional complexities that arise when household sizes vary; in this case, researchers must decide between assigning equal weight to households or equal weight to individuals. We propose unbiased estimators for a broad class of individual- and household-weighted estimands, with corresponding theoretical and estimated variances. Our second contribution is to connect two common approaches for analyzing two-stage designs: linear regression and randomization inference. We show that, with suitably chosen standard errors, these two approaches yield identical point and variance estimates, which is somewhat surprising given the complex randomization scheme. Finally, we explore options for incorporating covariates to improve precision. We confirm our analytic results via simulation studies and apply these methods to the attendance study, finding substantively meaningful spillover effects. △ Less

Submitted 30 April, 2017; v1 submitted 24 August, 2016; originally announced August 2016.

Comments: Accepted for publication in the Journal of the American Statistical Association

arXiv:1606.02682 [pdf, other]

Principal Score Methods: Assumptions and Extensions

Authors: Avi Feller, Fabrizia Mealli, Luke Miratrix

Abstract: Researchers addressing post-treatment complications in randomized trials often turn to principal stratification to define relevant assumptions and quantities of interest. One approach for estimating causal effects in this framework is to use methods based on the "principal score," typically assuming that stratum membership is as-good-as-randomly assigned given a set of covariates. In this paper, w… ▽ More Researchers addressing post-treatment complications in randomized trials often turn to principal stratification to define relevant assumptions and quantities of interest. One approach for estimating causal effects in this framework is to use methods based on the "principal score," typically assuming that stratum membership is as-good-as-randomly assigned given a set of covariates. In this paper, we clarify the key assumption in this context, known as Principal Ignorability, and argue that versions of this assumption are quite strong in practice. We describe different estimation approaches and demonstrate that weighting-based methods are generally preferable to subgroup-based approaches that discretize the principal score. We then extend these ideas to the case of two-sided noncompliance and propose a natural framework for combining Principal Ignorability with exclusion restrictions and other assumptions. Finally, we apply these ideas to the Head Start Impact Study, a large-scale randomized evaluation of the Head Start program. Overall, we argue that, while principal score methods are useful tools, applied researchers should fully understand the relevant assumptions when using them in practice. △ Less

Submitted 8 June, 2016; originally announced June 2016.

arXiv:1605.06566 [pdf, other]

Decomposing Treatment Effect Variation

Authors: Peng Ding, Avi Feller, Luke Miratrix

Abstract: Understanding and characterizing treatment effect variation in randomized experiments has become essential for going beyond the "black box" of the average treatment effect. Nonetheless, traditional statistical approaches often ignore or assume away such variation. In the context of randomized experiments, this paper proposes a framework for decomposing overall treatment effect variation into a sys… ▽ More Understanding and characterizing treatment effect variation in randomized experiments has become essential for going beyond the "black box" of the average treatment effect. Nonetheless, traditional statistical approaches often ignore or assume away such variation. In the context of randomized experiments, this paper proposes a framework for decomposing overall treatment effect variation into a systematic component explained by observed covariates and a remaining idiosyncratic component. Our framework is fully randomization-based, with estimates of treatment effect variation that are entirely justified by the randomization itself. Our framework can also account for noncompliance, which is an important practical complication. We make several contributions. First, we show that randomization-based estimates of systematic variation are very similar in form to estimates from fully-interacted linear regression and two stage least squares. Second, we use these estimators to develop an omnibus test for systematic treatment effect variation, both with and without noncompliance. Third, we propose an $R^2$-like measure of treatment effect variation explained by covariates and, when applicable, noncompliance. Finally, we assess these methods via simulation studies and apply them to the Head Start Impact Study, a large-scale randomized experiment. △ Less

Submitted 28 July, 2017; v1 submitted 20 May, 2016; originally announced May 2016.

arXiv:1602.06595 [pdf, other]

Weak separation in mixture models and implications for principal stratification

Authors: Avi Feller, Evan Greif, Nhat Ho, Luke Miratrix, Natesh Pillai

Abstract: Principal stratification is a widely used framework for addressing post-randomization complications. After using principal stratification to define causal effects of interest, researchers are increasingly turning to finite mixture models to estimate these quantities. Unfortunately, standard estimators of mixture parameters, like the MLE, are known to exhibit pathological behavior. We study this be… ▽ More Principal stratification is a widely used framework for addressing post-randomization complications. After using principal stratification to define causal effects of interest, researchers are increasingly turning to finite mixture models to estimate these quantities. Unfortunately, standard estimators of mixture parameters, like the MLE, are known to exhibit pathological behavior. We study this behavior in a simple but fundamental example, a two-component Gaussian mixture model in which only the component means and variances are unknown, and focus on the setting in which the components are weakly separated. In this case, we show that the asymptotic convergence rate of the MLE is quite poor, such as $O(n^{-1/6})$ or even $O(n^{-1/8})$. We then demonstrate via theoretical arguments as well as extensive simulations that, in finite samples, the MLE behaves like a threshold estimator, in the sense that the MLE can give strong evidence that the means are equal when the truth is otherwise. We also explore the behavior of the MLE when the MLE is non-zero, showing that it is difficult to estimate both the sign and magnitude of the means in this case. We provide diagnostics for all of these pathologies and apply these ideas to re-analyzing two randomized evaluations of job training programs, JOBS II and Job Corps. Our results suggest that the corresponding maximum likelihood estimates should be interpreted with caution in these cases. △ Less

Submitted 17 August, 2019; v1 submitted 21 February, 2016; originally announced February 2016.

arXiv:1507.02739 [pdf, other]

Design of the Millennium Villages Project Sampling Plan: a simulation study for a multi-module survey

Authors: Shira Mitchell, Rebecca Ross, Susanna Makela, Elizabeth A. Stuart, Avi Feller, Alan M. Zaslavsky, Andrew Gelman

Abstract: The Millennium Villages Project (MVP) is a ten-year integrated rural development project implemented in ten sub-Saharan African sites. At its conclusion we will conduct an evaluation of its causal effect on a variety of development outcomes, measured via household surveys in treatment and comparison areas. Outcomes are measured by six survey modules, with sample sizes for each demographic group de… ▽ More The Millennium Villages Project (MVP) is a ten-year integrated rural development project implemented in ten sub-Saharan African sites. At its conclusion we will conduct an evaluation of its causal effect on a variety of development outcomes, measured via household surveys in treatment and comparison areas. Outcomes are measured by six survey modules, with sample sizes for each demographic group determined by budget, logistics, and the group's vulnerability. We design a sampling plan that aims to reduce effort for survey enumerators and maximize precision for all outcomes. We propose two-stage sampling designs, sampling households at the first stage, followed by a second stage sample that differs across demographic groups. Two-stage designs are usually constructed by simple random sampling (SRS) of households and proportional within-household sampling, or probability proportional to size sampling (PPS) of households with fixed sampling within each. No measure of household size is proportional for all demographic groups, putting PPS schemes at a disadvantage. The SRS schemes have the disadvantage that multiple individuals sampled per household decreases efficiency due to intra-household correlation. We conduct a simulation study (using both design- and model-based survey inference) to understand these tradeoffs and recommend a sampling plan for the Millennium Villages Project. Similar design issues arise in other studies with surveys that target different demographic groups. △ Less

Submitted 9 July, 2015; originally announced July 2015.

arXiv:1412.5000 [pdf, other]

Randomization Inference for Treatment Effect Variation

Authors: Peng Ding, Avi Feller, Luke Miratrix

Abstract: Applied researchers are increasingly interested in whether and how treatment effects vary in randomized evaluations, especially variation not explained by observed covariates. We propose a model-free approach for testing for the presence of such unexplained variation. To use this randomization-based approach, we must address the fact that the average treatment effect, generally the object of inter… ▽ More Applied researchers are increasingly interested in whether and how treatment effects vary in randomized evaluations, especially variation not explained by observed covariates. We propose a model-free approach for testing for the presence of such unexplained variation. To use this randomization-based approach, we must address the fact that the average treatment effect, generally the object of interest in randomized experiments, actually acts as a nuisance parameter in this setting. We explore potential solutions and advocate for a method that guarantees valid tests in finite samples despite this nuisance. We also show how this method readily extends to testing for heterogeneity beyond a given model, which can be useful for assessing the sufficiency of a given scientific theory. We finally apply our method to the National Head Start Impact Study, a large-scale randomized evaluation of a Federal preschool program, finding that there is indeed significant unexplained treatment effect variation. △ Less

Submitted 16 December, 2014; originally announced December 2014.

Showing 1–36 of 36 results for author: Feller, A