-
Intervention effects based on potential benefit
Authors:
Alexander W. Levis,
Eli Ben-Michael,
Edward H. Kennedy
Abstract:
Optimal treatment rules are map**s from individual patient characteristics to tailored treatment assignments that maximize mean outcomes. In this work, we introduce a conditional potential benefit (CPB) metric that measures the expected improvement under an optimally chosen treatment compared to the status quo, within covariate strata. The potential benefit combines (i) the magnitude of the trea…
▽ More
Optimal treatment rules are map**s from individual patient characteristics to tailored treatment assignments that maximize mean outcomes. In this work, we introduce a conditional potential benefit (CPB) metric that measures the expected improvement under an optimally chosen treatment compared to the status quo, within covariate strata. The potential benefit combines (i) the magnitude of the treatment effect, and (ii) the propensity for subjects to naturally select a suboptimal treatment. As a consequence, heterogeneity in the CPB can provide key insights into the mechanism by which a treatment acts and/or highlight potential barriers to treatment access or adverse effects. Moreover, we demonstrate that CPB is the natural prioritization score for individualized treatment policies when intervention capacity is constrained. That is, in the resource-limited setting where treatment options are freely accessible, but the ability to intervene on a portion of the target population is constrained (e.g., if the population is large, and follow-up and encouragement of treatment uptake is labor-intensive), targeting subjects with highest CPB maximizes the mean outcome. Focusing on this resource-limited setting, we derive formulas for optimal constrained treatment rules, and for any given budget, quantify the loss compared to the optimal unconstrained rule. We describe sufficient identification assumptions, and propose nonparametric, robust, and efficient estimators of the proposed quantities emerging from our framework.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Statistical methods to estimate the impact of gun policy on gun violence
Authors:
Eli Ben-Michael,
Mitchell L. Doucette,
Avi Feller,
Alexander D. McCourt,
Elizabeth A. Stuart
Abstract:
Gun violence is a critical public health and safety concern in the United States. There is considerable variability in policy proposals meant to curb gun violence, ranging from increasing gun availability to deter potential assailants (e.g., concealed carry laws or arming school teachers) to restricting access to firearms (e.g., universal background checks or banning assault weapons). Many studies…
▽ More
Gun violence is a critical public health and safety concern in the United States. There is considerable variability in policy proposals meant to curb gun violence, ranging from increasing gun availability to deter potential assailants (e.g., concealed carry laws or arming school teachers) to restricting access to firearms (e.g., universal background checks or banning assault weapons). Many studies use state-level variation in the enactment of these policies in order to quantify their effect on gun violence. In this paper, we discuss the policy trial emulation framework for evaluating the impact of these policies, and show how to apply this framework to estimating impacts via difference-in-differences and synthetic controls when there is staggered adoption of policies across jurisdictions, estimating the impacts of right-to-carry laws on violent crime as a case study.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Does AI help humans make better decisions? A methodological framework for experimental evaluation
Authors:
Eli Ben-Michael,
D. James Greiner,
Melody Huang,
Kosuke Imai,
Zhichao Jiang,
Sooahn Shin
Abstract:
The use of Artificial Intelligence (AI) based on data-driven algorithms has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions as compared to a human alone or AI an alone. We introduce a new methodological framework that can be used to ans…
▽ More
The use of Artificial Intelligence (AI) based on data-driven algorithms has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions as compared to a human alone or AI an alone. We introduce a new methodological framework that can be used to answer experimentally this question with no additional assumptions. We measure a decision maker's ability to make correct decisions using standard classification metrics based on the baseline potential outcome. We consider a single-blinded experimental design, in which the provision of AI-generated recommendations is randomized across cases with a human making final decisions. Under this experimental design, we show how to compare the performance of three alternative decision-making systems--human-alone, human-with-AI, and AI-alone. We apply the proposed methodology to the data from our own randomized controlled trial of a pretrial risk assessment instrument. We find that AI recommendations do not improve the classification accuracy of a judge's decision to impose cash bail. Our analysis also shows that AI-alone decisions generally perform worse than human decisions with or without AI assistance. Finally, AI recommendations tend to impose cash bail on non-white arrestees more often than necessary when compared to white arrestees.
△ Less
Submitted 17 March, 2024;
originally announced March 2024.
-
Optimizing Language Models for Human Preferences is a Causal Inference Problem
Authors:
Victoria Lin,
Eli Ben-Michael,
Louis-Philippe Morency
Abstract:
As large language models (LLMs) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. In this paper, we present an initial exploration of language model optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical o…
▽ More
As large language models (LLMs) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. In this paper, we present an initial exploration of language model optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical outcome measuring the reader's response. We first propose that language model optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. We formalize this causal language optimization problem, and we develop a method--causal preference optimization (CPO)--that solves an unbiased surrogate objective for the problem. We further extend CPO with doubly robust CPO (DR-CPO), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. Finally, we empirically demonstrate the effectiveness of (DR-)CPO in optimizing state-of-the-art LLMs for human preferences on direct outcome data, and we validate the robustness of DR-CPO under difficult confounding conditions.
△ Less
Submitted 5 June, 2024; v1 submitted 22 February, 2024;
originally announced February 2024.
-
Temporal Aggregation for the Synthetic Control Method
Authors:
Liyang Sun,
Eli Ben-Michael,
Avi Feller
Abstract:
The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit with panel data. Two challenges arise with higher frequency data (e.g., monthly versus yearly): (1) achieving excellent pre-treatment fit is typically more challenging; and (2) overfitting to noise is more likely. Aggregating data over time can mitigate these problems but can also des…
▽ More
The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit with panel data. Two challenges arise with higher frequency data (e.g., monthly versus yearly): (1) achieving excellent pre-treatment fit is typically more challenging; and (2) overfitting to noise is more likely. Aggregating data over time can mitigate these problems but can also destroy important signal. In this paper, we bound the bias for SCM with disaggregated and aggregated outcomes and give conditions under which aggregating tightens the bounds. We then propose finding weights that balance both disaggregated and aggregated series.
△ Less
Submitted 15 April, 2024; v1 submitted 22 January, 2024;
originally announced January 2024.
-
Using Multiple Outcomes to Improve the Synthetic Control Method
Authors:
Liyang Sun,
Eli Ben-Michael,
Avi Feller
Abstract:
When there are multiple outcome series of interest, Synthetic Control analyses typically proceed by estimating separate weights for each outcome. In this paper, we instead propose estimating a common set of weights across outcomes, by balancing either a vector of all outcomes or an index or average of them. Under a low-rank factor model, we show that these approaches lead to lower bias bounds than…
▽ More
When there are multiple outcome series of interest, Synthetic Control analyses typically proceed by estimating separate weights for each outcome. In this paper, we instead propose estimating a common set of weights across outcomes, by balancing either a vector of all outcomes or an index or average of them. Under a low-rank factor model, we show that these approaches lead to lower bias bounds than separate weights, and that averaging leads to further gains when the number of outcomes grows. We illustrate this via simulation and in a re-analysis of the impact of the Flint water crisis on educational outcomes.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Text-Transport: Toward Learning Causal Effects of Natural Language
Authors:
Victoria Lin,
Louis-Philippe Morency,
Eli Ben-Michael
Abstract:
As language technologies gain prominence in real-world settings, it is important to understand how changes to language affect reader perceptions. This can be formalized as the causal effect of varying a linguistic attribute (e.g., sentiment) on a reader's response to the text. In this paper, we introduce Text-Transport, a method for estimation of causal effects from natural language under any text…
▽ More
As language technologies gain prominence in real-world settings, it is important to understand how changes to language affect reader perceptions. This can be formalized as the causal effect of varying a linguistic attribute (e.g., sentiment) on a reader's response to the text. In this paper, we introduce Text-Transport, a method for estimation of causal effects from natural language under any text distribution. Current approaches for valid causal effect estimation require strong assumptions about the data, meaning the data from which one can estimate valid causal effects often is not representative of the actual target domain of interest. To address this issue, we leverage the notion of distribution shift to describe an estimator that transports causal effects between domains, bypassing the need for strong assumptions in the target domain. We derive statistical guarantees on the uncertainty of this estimator, and we report empirical results and analyses that support the validity of Text-Transport across data settings. Finally, we use Text-Transport to study a realistic setting--hate speech on social media--in which causal effects do shift significantly between text domains, demonstrating the necessity of transport when conducting causal inference on natural language.
△ Less
Submitted 31 October, 2023;
originally announced October 2023.
-
Bayesian Safe Policy Learning with Chance Constrained Optimization: Application to Military Security Assessment during the Vietnam War
Authors:
Zeyang Jia,
Eli Ben-Michael,
Kosuke Imai
Abstract:
Algorithmic decisions and recommendations are used in many high-stakes decision-making settings such as criminal justice, medicine, and public policy. We investigate whether it would have been possible to improve a security assessment algorithm employed during the Vietnam War, using outcomes measured immediately after its introduction in late 1969. This empirical application raises several methodo…
▽ More
Algorithmic decisions and recommendations are used in many high-stakes decision-making settings such as criminal justice, medicine, and public policy. We investigate whether it would have been possible to improve a security assessment algorithm employed during the Vietnam War, using outcomes measured immediately after its introduction in late 1969. This empirical application raises several methodological challenges that frequently arise in high-stakes algorithmic decision-making. First, before implementing a new algorithm, it is essential to characterize and control the risk of yielding worse outcomes than the existing algorithm. Second, the existing algorithm is deterministic, and learning a new algorithm requires transparent extrapolation. Third, the existing algorithm involves discrete decision tables that are difficult to optimize over.
To address these challenges, we introduce the Average Conditional Risk (ACRisk), which first quantifies the risk that a new algorithmic policy leads to worse outcomes for subgroups of individual units and then averages this over the distribution of subgroups. We also propose a Bayesian policy learning framework that maximizes the posterior expected value while controlling the posterior expected ACRisk. This framework separates the estimation of heterogeneous treatment effects from policy optimization, enabling flexible estimation of effects and optimization over complex policy classes. We characterize the resulting chance-constrained optimization problem as a constrained linear programming problem. Our analysis shows that compared to the actual algorithm used during the Vietnam War, the learned algorithm assesses most regions as more secure and emphasizes economic and political factors over military factors.
△ Less
Submitted 27 May, 2024; v1 submitted 17 July, 2023;
originally announced July 2023.
-
Approximate Balancing Weights for Clustered Observational Study Designs
Authors:
Luke Keele,
Eli Ben-Michael,
Lindsay Page
Abstract:
In a clustered observational study, a treatment is assigned to groups and all units within the group are exposed to the treatment. We develop a new method for statistical adjustment in clustered observational studies using approximate balancing weights, a generalization of inverse propensity score weights that solve a convex optimization problem to find a set of weights that directly minimize a me…
▽ More
In a clustered observational study, a treatment is assigned to groups and all units within the group are exposed to the treatment. We develop a new method for statistical adjustment in clustered observational studies using approximate balancing weights, a generalization of inverse propensity score weights that solve a convex optimization problem to find a set of weights that directly minimize a measure of covariate imbalance, subject to an additional penalty on the variance of the weights. We tailor the approximate balancing weights optimization problem to both adjustment sets by deriving an upper bound on the mean square error for each case and finding weights that minimize this upper bound, linking the level of covariate balance to a bound on the bias. We implement the procedure by specializing the bound to a random cluster-level effects model, leading to a variance penalty that incorporates the signal signal-to-noise ratio and penalizes the weight on individuals and the total weight on groups differently according to the the intra-class correlation.
△ Less
Submitted 3 March, 2023; v1 submitted 12 January, 2023;
originally announced January 2023.
-
Using Balancing Weights to Target the Treatment Effect on the Treated when Overlap is Poor
Authors:
Eli Ben-Michael,
Luke Keele
Abstract:
Inverse probability weights are commonly used in epidemiology to estimate causal effects in observational studies. Researchers can typically focus on either the average treatment effect or the average treatment effect on the treated with inverse probability weighting estimators. However, when overlap between the treated and control groups is poor, this can produce extreme weights that can result i…
▽ More
Inverse probability weights are commonly used in epidemiology to estimate causal effects in observational studies. Researchers can typically focus on either the average treatment effect or the average treatment effect on the treated with inverse probability weighting estimators. However, when overlap between the treated and control groups is poor, this can produce extreme weights that can result in biased estimates and large variances. One alternative to inverse probability weights are overlap weights, which target the population with the most overlap on observed characteristics. While estimates based on overlap weights produce less bias in such contexts, the causal estimand can be difficult to interpret. One alternative to inverse probability weights are balancing weights, which directly target imbalances during the estimation process. Here, we explore whether balancing weights allow analysts to target the average treatment effect on the treated in cases where inverse probability weights are biased due to poor overlap. We conduct three simulation studies and an empirical application. We find that in many cases, balancing weights allow the analyst to still target the average treatment effect on the treated even when overlap is poor. We show that while overlap weights remain a key tool for estimating causal effects, more familiar estimands can be targeted by using balancing weights instead of inverse probability weights.
△ Less
Submitted 4 October, 2022;
originally announced October 2022.
-
Estimating Racial Disparities in Emergency General Surgery
Authors:
Eli Ben-Michael,
Avi Feller,
Rachel Kelz,
Luke Keele
Abstract:
Research documents that Black patients experience worse general surgery outcomes than white patients in the United States. In this paper, we focus on an important but less-examined category: the surgical treatment of emergency general surgery (EGS) conditions, which refers to medical emergencies where the injury is "endogenous," such as a burst appendix. Our goal is to assess racial disparities fo…
▽ More
Research documents that Black patients experience worse general surgery outcomes than white patients in the United States. In this paper, we focus on an important but less-examined category: the surgical treatment of emergency general surgery (EGS) conditions, which refers to medical emergencies where the injury is "endogenous," such as a burst appendix. Our goal is to assess racial disparities for common outcomes after EGS treatment using an administrative database of hospital claims in New York, Florida, and Pennsylvania, and to understand the extent to which differences are attributable to patient-level risk factors versus hospital-level factors. To do so, we use a class of linear weighting estimators that re-weight white patients to have a similar distribution of baseline characteristics as Black patients. This framework nests many common approaches, including matching and linear regression, but offers important advantages over these methods in terms of controlling imbalance between groups, minimizing extrapolation, and reducing computation time. Applying this approach to the claims data, we find that disparities estimates that adjust for the admitting hospital are substantially smaller than estimates that adjust for patient baseline characteristics only, suggesting that hospital-specific factors are important drivers of racial disparities in EGS outcomes.
△ Less
Submitted 9 November, 2023; v1 submitted 9 September, 2022;
originally announced September 2022.
-
Safe Policy Learning under Regression Discontinuity Designs with Multiple Cutoffs
Authors:
Yi Zhang,
Eli Ben-Michael,
Kosuke Imai
Abstract:
The regression discontinuity (RD) design is widely used for program evaluation with observational data. The primary focus of the existing literature has been the estimation of the local average treatment effect at the existing treatment cutoff. In contrast, we consider policy learning under the RD design. Because the treatment assignment mechanism is deterministic, learning better treatment cutoff…
▽ More
The regression discontinuity (RD) design is widely used for program evaluation with observational data. The primary focus of the existing literature has been the estimation of the local average treatment effect at the existing treatment cutoff. In contrast, we consider policy learning under the RD design. Because the treatment assignment mechanism is deterministic, learning better treatment cutoffs requires extrapolation. We develop a robust optimization approach to finding optimal treatment cutoffs that improve upon the existing ones. We first decompose the expected utility into point-identifiable and unidentifiable components. We then propose an efficient doubly-robust estimator for the identifiable parts. To account for the unidentifiable components, we leverage the existence of multiple cutoffs that are common under the RD design. Specifically, we assume that the heterogeneity in the conditional expectations of potential outcomes across different groups vary smoothly along the running variable. Under this assumption, we minimize the worst case utility loss relative to the status quo policy. The resulting new treatment cutoffs have a safety guarantee that they will not yield a worse overall outcome than the existing cutoffs. Finally, we establish the asymptotic regret bounds for the learned policy using semi-parametric efficiency theory. We apply the proposed methodology to empirical and simulated data sets.
△ Less
Submitted 8 July, 2023; v1 submitted 28 August, 2022;
originally announced August 2022.
-
Policy Learning with Asymmetric Counterfactual Utilities
Authors:
Eli Ben-Michael,
Kosuke Imai,
Zhichao Jiang
Abstract:
Data-driven decision making plays an important role even in high stakes settings like medicine and public policy. Learning optimal policies from observed data requires a careful formulation of the utility function whose expected value is maximized across a population. Although researchers typically use utilities that depend on observed outcomes alone, in many settings the decision maker's utility…
▽ More
Data-driven decision making plays an important role even in high stakes settings like medicine and public policy. Learning optimal policies from observed data requires a careful formulation of the utility function whose expected value is maximized across a population. Although researchers typically use utilities that depend on observed outcomes alone, in many settings the decision maker's utility function is more properly characterized by the joint set of potential outcomes under all actions. For example, the Hippocratic principle to "do no harm" implies that the cost of causing death to a patient who would otherwise survive without treatment is greater than the cost of forgoing life-saving treatment. We consider optimal policy learning with asymmetric counterfactual utility functions of this form that consider the joint set of potential outcomes. We show that asymmetric counterfactual utilities lead to an unidentifiable expected utility function, and so we first partially identify it. Drawing on statistical decision theory, we then derive minimax decision rules by minimizing the maximum expected utility loss relative to different alternative policies. We show that one can learn minimax loss decision rules from observed data by solving intermediate classification problems, and establish that the finite sample excess expected utility loss of this procedure is bounded by the regret of these intermediate classifiers. We apply this conceptual framework and methodology to the decision about whether or not to use right heart catheterization for patients with possible pulmonary hypertension.
△ Less
Submitted 28 November, 2023; v1 submitted 21 June, 2022;
originally announced June 2022.
-
The Balancing Act in Causal Inference
Authors:
Eli Ben-Michael,
Avi Feller,
David A. Hirshberg,
José R. Zubizarreta
Abstract:
The idea of covariate balance is at the core of causal inference. Inverse propensity weights play a central role because they are the unique set of weights that balance the covariate distributions of different treatment groups. We discuss two broad approaches to estimating these weights: the more traditional one, which fits a propensity score model and then uses the reciprocal of the estimated pro…
▽ More
The idea of covariate balance is at the core of causal inference. Inverse propensity weights play a central role because they are the unique set of weights that balance the covariate distributions of different treatment groups. We discuss two broad approaches to estimating these weights: the more traditional one, which fits a propensity score model and then uses the reciprocal of the estimated propensity score to construct weights, and the balancing approach, which estimates the inverse propensity weights essentially by the method of moments, finding weights that achieve balance in the sample. We review ideas from the causal inference, sample surveys, and semiparametric estimation literatures, with particular attention to the role of balance as a sufficient condition for robust inference. We focus on the inverse propensity weighting and augmented inverse propensity weighting estimators for the average treatment effect given strong ignorability and consider generalizations for a broader class of problems including policy evaluation and the estimation of individualized treatment effects.
△ Less
Submitted 27 October, 2021;
originally announced October 2021.
-
Estimating the effects of a California gun control program with Multitask Gaussian Processes
Authors:
Eli Ben-Michael,
David Arbour,
Avi Feller,
Alex Franks,
Steven Raphael
Abstract:
Gun violence is a critical public safety concern in the United States. In 2006 California implemented a unique firearm monitoring program, the Armed and Prohibited Persons System (APPS), to address gun violence in the state. The APPS program first identifies those firearm owners who become prohibited from owning one due to federal or state law, then confiscates their firearms. Our goal is to asses…
▽ More
Gun violence is a critical public safety concern in the United States. In 2006 California implemented a unique firearm monitoring program, the Armed and Prohibited Persons System (APPS), to address gun violence in the state. The APPS program first identifies those firearm owners who become prohibited from owning one due to federal or state law, then confiscates their firearms. Our goal is to assess the effect of APPS on California murder rates using annual, state-level crime data across the US for the years before and after the introduction of the program. To do so, we adapt a non-parametric Bayesian approach, multitask Gaussian Processes (MTGPs), to the panel data setting. MTGPs allow for flexible and parsimonious panel data models that nest many existing approaches and allow for direct control over both dependence across time and dependence across units, as well as natural uncertainty quantification. We extend this approach to incorporate non-Normal outcomes, auxiliary covariates, and multiple outcome series, which are all important in our application. We also show that this approach has attractive Frequentist properties, including a representation as a weighting estimator with separate weights over units and time periods. Applying this approach, we find that the increased monitoring and enforcement from the APPS program substantially decreased homicides in California. We also find that the effect on murder is driven entirely by declines in gun-related murder with no measurable effect on non-gun murder. Estimated cost per murder avoided are substantially lower than conventional estimates of the value of a statistical life, suggesting a very high benefit-cost ratio for this enforcement effort.
△ Less
Submitted 8 June, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Safe Policy Learning through Extrapolation: Application to Pre-trial Risk Assessment
Authors:
Eli Ben-Michael,
D. James Greiner,
Kosuke Imai,
Zhichao Jiang
Abstract:
Algorithmic recommendations and decisions have become ubiquitous in today's society. Many of these and other data-driven policies, especially in the realm of public policy, are based on known, deterministic rules to ensure their transparency and interpretability. For example, algorithmic pre-trial risk assessments, which serve as our motivating application, provide relatively simple, deterministic…
▽ More
Algorithmic recommendations and decisions have become ubiquitous in today's society. Many of these and other data-driven policies, especially in the realm of public policy, are based on known, deterministic rules to ensure their transparency and interpretability. For example, algorithmic pre-trial risk assessments, which serve as our motivating application, provide relatively simple, deterministic classification scores and recommendations to help judges make release decisions. How can we use the data based on existing deterministic policies to learn new and better policies? Unfortunately, prior methods for policy learning are not applicable because they require existing policies to be stochastic rather than deterministic. We develop a robust optimization approach that partially identifies the expected utility of a policy, and then finds an optimal policy by minimizing the worst-case regret. The resulting policy is conservative but has a statistical safety guarantee, allowing the policy-maker to limit the probability of producing a worse outcome than the existing policy. We extend this approach to common and important settings where humans make decisions with the aid of algorithmic recommendations. Lastly, we apply the proposed methodology to a unique field experiment on pre-trial risk assessment instruments. We derive new classification and recommendation rules that retain the transparency and interpretability of the existing instrument while potentially leading to better overall outcomes at a lower cost.
△ Less
Submitted 15 February, 2022; v1 submitted 21 September, 2021;
originally announced September 2021.
-
Is it who you are or where you are? Accounting for compositional differences in cross-site treatment variation
Authors:
Benjamin Lu,
Eli Ben-Michael,
Avi Feller,
Luke Miratrix
Abstract:
Multisite trials, in which treatment is randomized separately in multiple sites, offer a unique opportunity to disentangle treatment effect variation due to "compositional" differences in the distributions of unit-level features from variation due to "contextual" differences in site-level features. In particular, if we can re-weight (or "transport") each site to have a common distribution of unit-…
▽ More
Multisite trials, in which treatment is randomized separately in multiple sites, offer a unique opportunity to disentangle treatment effect variation due to "compositional" differences in the distributions of unit-level features from variation due to "contextual" differences in site-level features. In particular, if we can re-weight (or "transport") each site to have a common distribution of unit-level covariates, the remaining effect variation captures contextual differences across sites. In this paper, we develop a framework for transporting effects in multisite trials using approximate balancing weights, where the weights are chosen to directly optimize unit-level covariate balance between each site and the target distribution. We first develop our approach for the general setting of transporting the effect of a single-site trial. We then extend our method to multisite trials, assess its performance via simulation, and use it to analyze a series of multisite trials of welfare-to-work programs. Our method is available in the balancer R package.
△ Less
Submitted 26 March, 2021;
originally announced March 2021.
-
Interpretable Sensitivity Analysis for Balancing Weights
Authors:
Dan Soriano,
Eli Ben-Michael,
Peter J. Bickel,
Avi Feller,
Samuel D. Pimentel
Abstract:
Assessing sensitivity to unmeasured confounding is an important step in observational studies, which typically estimate effects under the assumption that all confounders are measured. In this paper, we develop a sensitivity analysis framework for balancing weights estimators, an increasingly popular approach that solves an optimization problem to obtain weights that directly minimizes covariate im…
▽ More
Assessing sensitivity to unmeasured confounding is an important step in observational studies, which typically estimate effects under the assumption that all confounders are measured. In this paper, we develop a sensitivity analysis framework for balancing weights estimators, an increasingly popular approach that solves an optimization problem to obtain weights that directly minimizes covariate imbalance. In particular, we adapt a sensitivity analysis framework using the percentile bootstrap for a broad class of balancing weights estimators. We prove that the percentile bootstrap procedure can, with only minor modifications, yield valid confidence intervals for causal effects under restrictions on the level of unmeasured confounding. We also propose an amplification to allow for interpretable sensitivity parameters in the balancing weights framework. We illustrate our method through extensive real data examples.
△ Less
Submitted 31 August, 2023; v1 submitted 25 February, 2021;
originally announced February 2021.
-
Multilevel calibration weighting for survey data
Authors:
Eli Ben-Michael,
Avi Feller,
Erin Hartman
Abstract:
In the November 2016 U.S. presidential election, many state level public opinion polls, particularly in the Upper Midwest, incorrectly predicted the winning candidate. One leading explanation for this polling miss is that the precipitous decline in traditional polling response rates led to greater reliance on statistical methods to adjust for the corresponding bias -- and that these methods failed…
▽ More
In the November 2016 U.S. presidential election, many state level public opinion polls, particularly in the Upper Midwest, incorrectly predicted the winning candidate. One leading explanation for this polling miss is that the precipitous decline in traditional polling response rates led to greater reliance on statistical methods to adjust for the corresponding bias -- and that these methods failed to adjust for important interactions between key variables like education, race, and geographic region. Finding calibration weights that account for important interactions remains challenging with traditional survey methods: raking typically balances the margins alone, while post-stratification, which exactly balances all interactions, is only feasible for a small number of variables. In this paper, we propose multilevel calibration weighting, which enforces tight balance constraints for marginal balance and looser constraints for higher-order interactions. This incorporates some of the benefits of post-stratification while retaining the guarantees of raking. We then correct for the bias due to the relaxed constraints via a flexible outcome model; we call this approach Double Regression with Post-stratification (DRP). We characterize the asymptotic properties of these estimators and show that the proposed calibration approach has a dual representation as a multilevel model for survey response. We then use these tools to to re-assess a large-scale survey of voter intention in the 2016 U.S. presidential election, finding meaningful gains from the proposed methods. The approach is available in the multical R package.
△ Less
Submitted 12 November, 2021; v1 submitted 17 February, 2021;
originally announced February 2021.
-
A trial emulation approach for policy evaluations with group-level longitudinal data
Authors:
Eli Ben-Michael,
Avi Feller,
Elizabeth A. Stuart
Abstract:
To limit the spread of the novel coronavirus, governments across the world implemented extraordinary physical distancing policies, such as stay-at-home orders, and numerous studies aim to estimate their effects. Many statistical and econometric methods, such as difference-in-differences, leverage repeated measurements and variation in timing to estimate policy effects, including in the COVID-19 co…
▽ More
To limit the spread of the novel coronavirus, governments across the world implemented extraordinary physical distancing policies, such as stay-at-home orders, and numerous studies aim to estimate their effects. Many statistical and econometric methods, such as difference-in-differences, leverage repeated measurements and variation in timing to estimate policy effects, including in the COVID-19 context. While these methods are less common in epidemiology, epidemiologic researchers are well accustomed to handling similar complexities in studies of individual-level interventions. "Target trial emulation" emphasizes the need to carefully design a non-experimental study in terms of inclusion and exclusion criteria, covariates, exposure definition, and outcome measurement -- and the timing of those variables. We argue that policy evaluations using group-level longitudinal ("panel") data need to take a similar careful approach to study design, which we refer to as "policy trial emulation." This is especially important when intervention timing varies across jurisdictions; the main idea is to construct target trials separately for each "treatment cohort" (states that implement the policy at the same time) and then aggregate. We present a stylized analysis of the impact of state-level stay-at-home orders on total coronavirus cases. We argue that estimates from panel methods -- with the right data and careful modeling and diagnostics -- can help add to our understanding of many policies, though doing so is often challenging.
△ Less
Submitted 11 November, 2020;
originally announced November 2020.
-
Varying impacts of letters of recommendation on college admissions: Approximate balancing weights for subgroup effects in observational studies
Authors:
Eli Ben-Michael,
Avi Feller,
Jesse Rothstein
Abstract:
In a pilot program during the 2016-17 admissions cycle, the University of California, Berkeley invited many applicants for freshman admission to submit letters of recommendation. We use this pilot as the basis for an observational study of the impact of submitting letters of recommendation on subsequent admission, with the goal of estimating how impacts vary across pre-defined subgroups. Understan…
▽ More
In a pilot program during the 2016-17 admissions cycle, the University of California, Berkeley invited many applicants for freshman admission to submit letters of recommendation. We use this pilot as the basis for an observational study of the impact of submitting letters of recommendation on subsequent admission, with the goal of estimating how impacts vary across pre-defined subgroups. Understanding this variation is challenging in observational studies, however, because estimated impacts reflect both actual treatment effect variation and differences in covariate balance across groups. To address this, we develop balancing weights that directly optimize for ``local balance'' within subgroups while maintaining global covariate balance between treated and control units. We then show that this approach has a dual representation as a form of inverse propensity score weighting with a hierarchical propensity score model. In the UC Berkeley pilot study, our proposed approach yields excellent local and global balance, unlike more traditional weighting methods, which fail to balance covariates within subgroups. We find that the impact of letters of recommendation increases with the predicted probability of admission, with mixed evidence of differences for under-represented minority applicants.
△ Less
Submitted 22 February, 2021; v1 submitted 10 August, 2020;
originally announced August 2020.
-
Hospital Quality Risk Standardization via Approximate Balancing Weights
Authors:
Luke Keele,
Eli Ben-Michael,
Avi Feller,
Rachel Kelz,
Luke Miratrix
Abstract:
Comparing outcomes across hospitals, often to identify underperforming hospitals, is a critical task in health services research. However, naive comparisons of average outcomes, such as surgery complication rates, can be misleading because hospital case mixes differ -- a hospital's overall complication rate may be lower due to more effective treatments or simply because the hospital serves a healt…
▽ More
Comparing outcomes across hospitals, often to identify underperforming hospitals, is a critical task in health services research. However, naive comparisons of average outcomes, such as surgery complication rates, can be misleading because hospital case mixes differ -- a hospital's overall complication rate may be lower due to more effective treatments or simply because the hospital serves a healthier population overall. In this paper, we develop a method of ``direct standardization'' where we re-weight each hospital patient population to be representative of the overall population and then compare the weighted averages across hospitals. Adapting methods from survey sampling and causal inference, we find weights that directly control for imbalance between the hospital patient mix and the target population, even across many patient attributes. Critically, these balancing weights can also be tuned to preserve sample size for more precise estimates. We also derive principled measures of statistical precision, and use outcome modeling and Bayesian shrinkage to increase precision and account for variation in hospital size. We demonstrate these methods using claims data from Pennsylvania, Florida, and New York, estimating standardized hospital complication rates for general surgery patients. We conclude with a discussion of how to detect low performing hospitals.
△ Less
Submitted 15 February, 2021; v1 submitted 17 July, 2020;
originally announced July 2020.
-
Synthetic Controls with Staggered Adoption
Authors:
Eli Ben-Michael,
Avi Feller,
Jesse Rothstein
Abstract:
Staggered adoption of policies by different units at different times creates promising opportunities for observational causal inference. Estimation remains challenging, however, and common regression methods can give misleading results. A promising alternative is the synthetic control method (SCM), which finds a weighted average of control units that closely balances the treated unit's pre-treatme…
▽ More
Staggered adoption of policies by different units at different times creates promising opportunities for observational causal inference. Estimation remains challenging, however, and common regression methods can give misleading results. A promising alternative is the synthetic control method (SCM), which finds a weighted average of control units that closely balances the treated unit's pre-treatment outcomes. In this paper, we generalize SCM, originally designed to study a single treated unit, to the staggered adoption setting. We first bound the error for the average effect and show that it depends on both the imbalance for each treated unit separately and the imbalance for the average of the treated units. We then propose "partially pooled" SCM weights to minimize a weighted combination of these measures; approaches that focus only on balancing one of the two components can lead to bias. We extend this approach to incorporate unit-level intercept shifts and auxiliary covariates. We assess the performance of the proposed method via extensive simulations and apply our results to the question of whether teacher collective bargaining leads to higher school spending, finding minimal impacts. We implement the proposed method in the augsynth R package.
△ Less
Submitted 15 January, 2021; v1 submitted 6 December, 2019;
originally announced December 2019.
-
The Augmented Synthetic Control Method
Authors:
Eli Ben-Michael,
Avi Feller,
Jesse Rothstein
Abstract:
The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit in panel data settings. The "synthetic control" is a weighted average of control units that balances the treated unit's pre-treatment outcomes as closely as possible. A critical feature of the original proposal is to use SCM only when the fit on pre-treatment outcomes is excellent. We…
▽ More
The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit in panel data settings. The "synthetic control" is a weighted average of control units that balances the treated unit's pre-treatment outcomes as closely as possible. A critical feature of the original proposal is to use SCM only when the fit on pre-treatment outcomes is excellent. We propose Augmented SCM as an extension of SCM to settings where such pre-treatment fit is infeasible. Analogous to bias correction for inexact matching, Augmented SCM uses an outcome model to estimate the bias due to imperfect pre-treatment fit and then de-biases the original SCM estimate. Our main proposal, which uses ridge regression as the outcome model, directly controls pre-treatment fit while minimizing extrapolation from the convex hull. This estimator can also be expressed as a solution to a modified synthetic controls problem that allows negative weights on some donor units. We bound the estimation error of this approach under different data generating processes, including a linear factor model, and show how regularization helps to avoid over-fitting to noise. We demonstrate gains from Augmented SCM with extensive simulation studies and apply this framework to estimate the impact of the 2012 Kansas tax cuts on economic growth. We implement the proposed method in the new augsynth R package.
△ Less
Submitted 23 July, 2020; v1 submitted 9 November, 2018;
originally announced November 2018.