-
Designing Algorithmic Recommendations to Achieve Human-AI Complementarity
Authors:
Bryce McLaughlin,
Jann Spiess
Abstract:
Algorithms frequently assist, rather than replace, human decision-makers. However, the design and analysis of algorithms often focus on predicting outcomes and do not explicitly model their effect on human decisions. This discrepancy between the design and role of algorithmic assistants becomes of particular concern in light of empirical evidence that suggests that algorithmic assistants again and…
▽ More
Algorithms frequently assist, rather than replace, human decision-makers. However, the design and analysis of algorithms often focus on predicting outcomes and do not explicitly model their effect on human decisions. This discrepancy between the design and role of algorithmic assistants becomes of particular concern in light of empirical evidence that suggests that algorithmic assistants again and again fail to improve human decisions. In this article, we formalize the design of recommendation algorithms that assist human decision-makers without making restrictive ex-ante assumptions about how recommendations affect decisions. We formulate an algorithmic-design problem that leverages the potential-outcomes framework from causal inference to model the effect of recommendations on a human decision-maker's binary treatment choice. Within this model, we introduce a monotonicity assumption that leads to an intuitive classification of human responses to the algorithm. Under this monotonicity assumption, we can express the human's response to algorithmic recommendations in terms of their compliance with the algorithm and the decision they would take if the algorithm sends no recommendation. We showcase the utility of our framework using an online experiment that simulates a hiring task. We argue that our approach explains the relative performance of different recommendation algorithms in the experiment, and can help design solutions that realize human-AI complementarity.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Personalized Assignment to One of Many Treatment Arms via Regularized and Clustered Joint Assignment Forests
Authors:
Rahul Ladhania,
Jann Spiess,
Lyle Ungar,
Wenbo Wu
Abstract:
We consider learning personalized assignments to one of many treatment arms from a randomized controlled trial. Standard methods that estimate heterogeneous treatment effects separately for each arm may perform poorly in this case due to excess variance. We instead propose methods that pool information across treatment arms: First, we consider a regularized forest-based assignment algorithm based…
▽ More
We consider learning personalized assignments to one of many treatment arms from a randomized controlled trial. Standard methods that estimate heterogeneous treatment effects separately for each arm may perform poorly in this case due to excess variance. We instead propose methods that pool information across treatment arms: First, we consider a regularized forest-based assignment algorithm based on greedy recursive partitioning that shrinks effect estimates across arms. Second, we augment our algorithm by a clustering scheme that combines treatment arms with consistently similar outcomes. In a simulation study, we compare the performance of these approaches to predicting arm-wise outcomes separately, and document gains of directly optimizing the treatment assignment with regularization and clustering. In a theoretical model, we illustrate how a high number of treatment arms makes finding the best arm hard, while we can achieve sizable utility gains from personalization by regularized optimization.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
Machine Learning Who to Nudge: Causal vs Predictive Targeting in a Field Experiment on Student Financial Aid Renewal
Authors:
Susan Athey,
Niall Keleher,
Jann Spiess
Abstract:
In many settings, interventions may be more effective for some individuals than others, so that targeting interventions may be beneficial. We analyze the value of targeting in the context of a large-scale field experiment with over 53,000 college students, where the goal was to use "nudges" to encourage students to renew their financial-aid applications before a non-binding deadline. We begin with…
▽ More
In many settings, interventions may be more effective for some individuals than others, so that targeting interventions may be beneficial. We analyze the value of targeting in the context of a large-scale field experiment with over 53,000 college students, where the goal was to use "nudges" to encourage students to renew their financial-aid applications before a non-binding deadline. We begin with baseline approaches to targeting. First, we target based on a causal forest that estimates heterogeneous treatment effects and then assigns students to treatment according to those estimated to have the highest treatment effects. Next, we evaluate two alternative targeting policies, one targeting students with low predicted probability of renewing financial aid in the absence of the treatment, the other targeting those with high probability. The predicted baseline outcome is not the ideal criterion for targeting, nor is it a priori clear whether to prioritize low, high, or intermediate predicted probability. Nonetheless, targeting on low baseline outcomes is common in practice, for example because the relationship between individual characteristics and treatment effects is often difficult or impossible to estimate with historical data. We propose hybrid approaches that incorporate the strengths of both predictive approaches (accurate estimation) and causal approaches (correct criterion); we show that targeting intermediate baseline outcomes is most effective in our specific application, while targeting based on low baseline outcomes is detrimental. In one year of the experiment, nudging all students improved early filing by an average of 6.4 percentage points over a baseline average of 37% filing, and we estimate that targeting half of the students using our preferred policy attains around 75% of this benefit.
△ Less
Submitted 31 May, 2024; v1 submitted 12 October, 2023;
originally announced October 2023.
-
Double and Single Descent in Causal Inference with an Application to High-Dimensional Synthetic Control
Authors:
Jann Spiess,
Guido Imbens,
Amar Venugopal
Abstract:
Motivated by a recent literature on the double-descent phenomenon in machine learning, we consider highly over-parameterized models in causal inference, including synthetic control with many control units. In such models, there may be so many free parameters that the model fits the training data perfectly. We first investigate high-dimensional linear regression for imputing wage data and estimatin…
▽ More
Motivated by a recent literature on the double-descent phenomenon in machine learning, we consider highly over-parameterized models in causal inference, including synthetic control with many control units. In such models, there may be so many free parameters that the model fits the training data perfectly. We first investigate high-dimensional linear regression for imputing wage data and estimating average treatment effects, where we find that models with many more covariates than sample size can outperform simple ones. We then document the performance of high-dimensional synthetic control estimators with many control units. We find that adding control units can help improve imputation performance even beyond the point where the pre-treatment fit is perfect. We provide a unified theoretical perspective on the performance of these high-dimensional models. Specifically, we show that more complex models can be interpreted as model-averaging estimators over simpler ones, which we link to an improvement in average performance. This perspective yields concrete insights into the use of synthetic control when control units are many relative to the number of pre-treatment periods.
△ Less
Submitted 12 October, 2023; v1 submitted 1 May, 2023;
originally announced May 2023.
-
Optimal Pre-Analysis Plans: Statistical Decisions Subject to Implementability
Authors:
Maximilian Kasy,
Jann Spiess
Abstract:
What is the purpose of pre-analysis plans, and how should they be designed? We propose a principal-agent model where a decision-maker relies on selective but truthful reports by an analyst. The analyst has data access, and non-aligned objectives. In this model, the implementation of statistical decision rules (tests, estimators) requires an incentive-compatible mechanism. We first characterize whi…
▽ More
What is the purpose of pre-analysis plans, and how should they be designed? We propose a principal-agent model where a decision-maker relies on selective but truthful reports by an analyst. The analyst has data access, and non-aligned objectives. In this model, the implementation of statistical decision rules (tests, estimators) requires an incentive-compatible mechanism. We first characterize which decision rules can be implemented. We then characterize optimal statistical decision rules subject to implementability. We show that implementation requires pre-analysis plans. Focussing specifically on hypothesis tests, we show that optimal rejection rules pre-register a valid test for the case when all data is reported, and make worst-case assumptions about unreported data. Optimal tests can be found as a solution to a linear-programming problem.
△ Less
Submitted 9 October, 2023; v1 submitted 20 August, 2022;
originally announced August 2022.
-
Algorithmic Assistance with Recommendation-Dependent Preferences
Authors:
Bryce McLaughlin,
Jann Spiess
Abstract:
When an algorithm provides risk assessments, we typically think of them as helpful inputs to human decisions, such as when risk scores are presented to judges or doctors. However, a decision-maker may not only react to the information provided by the algorithm. The decision-maker may also view the algorithmic recommendation as a default action, making it costly for them to deviate, such as when a…
▽ More
When an algorithm provides risk assessments, we typically think of them as helpful inputs to human decisions, such as when risk scores are presented to judges or doctors. However, a decision-maker may not only react to the information provided by the algorithm. The decision-maker may also view the algorithmic recommendation as a default action, making it costly for them to deviate, such as when a judge is reluctant to overrule a high-risk assessment for a defendant or a doctor fears the consequences of deviating from recommended procedures. To address such unintended consequences of algorithmic assistance, we propose a principal-agent model of joint human-machine decision-making. Within this model, we consider the effect and design of algorithmic recommendations when they affect choices not just by shifting beliefs, but also by altering preferences. We motivate this assumption from institutional factors, such as a desire to avoid audits, as well as from well-established models in behavioral science that predict loss aversion relative to a reference point, which here is set by the algorithm. We show that recommendation-dependent preferences create inefficiencies where the decision-maker is overly responsive to the recommendation. As a potential remedy, we discuss algorithms that strategically withhold recommendations, and show how they can improve the quality of final decisions.
△ Less
Submitted 19 January, 2024; v1 submitted 16 August, 2022;
originally announced August 2022.
-
On the Fairness of Machine-Assisted Human Decisions
Authors:
Talia Gillis,
Bryce McLaughlin,
Jann Spiess
Abstract:
When machine-learning algorithms are used in high-stakes decisions, we want to ensure that their deployment leads to fair and equitable outcomes. This concern has motivated a fast-growing literature that focuses on diagnosing and addressing disparities in machine predictions. However, many machine predictions are deployed to assist in decisions where a human decision-maker retains the ultimate dec…
▽ More
When machine-learning algorithms are used in high-stakes decisions, we want to ensure that their deployment leads to fair and equitable outcomes. This concern has motivated a fast-growing literature that focuses on diagnosing and addressing disparities in machine predictions. However, many machine predictions are deployed to assist in decisions where a human decision-maker retains the ultimate decision authority. In this article, we therefore consider in a formal model and in a lab experiment how properties of machine predictions affect the resulting human decisions. In our formal model of statistical decision-making, we show that the inclusion of a biased human decision-maker can revert common relationships between the structure of the algorithm and the qualities of resulting decisions. Specifically, we document that excluding information about protected groups from the prediction may fail to reduce, and may even increase, ultimate disparities. In the lab experiment, we demonstrate how predictions informed by gender-specific information can reduce average gender disparities in decisions. While our concrete theoretical results rely on specific assumptions about the data, algorithm, and decision-maker, and the experiment focuses on a particular prediction task, our findings show more broadly that any study of critical properties of complex decision systems, such as the fairness of machine-assisted human decisions, should go beyond focusing on the underlying algorithmic predictions in isolation.
△ Less
Submitted 23 September, 2023; v1 submitted 28 October, 2021;
originally announced October 2021.
-
Unpacking the Black Box: Regulating Algorithmic Decisions
Authors:
Laura Blattner,
Scott Nelson,
Jann Spiess
Abstract:
What should regulators of complex algorithms regulate? We propose a model of oversight over 'black-box' algorithms used in high-stakes applications such as lending, medical testing, or hiring. In our model, a regulator is limited in how much she can learn about a black-box model deployed by an agent with misaligned preferences. The regulator faces two choices: first, whether to allow for the use o…
▽ More
What should regulators of complex algorithms regulate? We propose a model of oversight over 'black-box' algorithms used in high-stakes applications such as lending, medical testing, or hiring. In our model, a regulator is limited in how much she can learn about a black-box model deployed by an agent with misaligned preferences. The regulator faces two choices: first, whether to allow for the use of complex algorithms; and second, which key properties of algorithms to regulate. We show that limiting agents to algorithms that are simple enough to be fully transparent is inefficient as long as the misalignment is limited and complex algorithms have sufficiently better performance than simple ones. Allowing for complex algorithms can improve welfare, but the gains depend on how the regulator regulates them. Regulation that focuses on the overall average behavior of algorithms, for example based on standard explainer tools, will generally be inefficient. Targeted regulation that focuses on the source of incentive misalignment, e.g., excess false positives or racial disparities, can provide second-best solutions. We provide empirical support for our theoretical findings using an application in consumer lending, where we document that complex models regulated based on context-specific explanation tools outperform simple, fully transparent models. This gain from complex models represents a Pareto improvement across our empirical applications that is preferred both by the lender and from the perspective of the financial regulator.
△ Less
Submitted 31 May, 2024; v1 submitted 5 October, 2021;
originally announced October 2021.
-
Revisiting Event Study Designs: Robust and Efficient Estimation
Authors:
Kirill Borusyak,
Xavier Jaravel,
Jann Spiess
Abstract:
We develop a framework for difference-in-differences designs with staggered treatment adoption and heterogeneous causal effects. We show that conventional regression-based estimators fail to provide unbiased estimates of relevant estimands absent strong restrictions on treatment-effect homogeneity. We then derive the efficient estimator addressing this challenge, which takes an intuitive "imputati…
▽ More
We develop a framework for difference-in-differences designs with staggered treatment adoption and heterogeneous causal effects. We show that conventional regression-based estimators fail to provide unbiased estimates of relevant estimands absent strong restrictions on treatment-effect homogeneity. We then derive the efficient estimator addressing this challenge, which takes an intuitive "imputation" form when treatment-effect heterogeneity is unrestricted. We characterize the asymptotic behavior of the estimator, propose tools for inference, and develop tests for identifying assumptions. Our method applies with time-varying controls, in triple-difference designs, and with certain non-binary treatments. We show the practical relevance of our results in a simulation study and an application. Studying the consumption response to tax rebates in the United States, we find that the notional marginal propensity to consume is between 8 and 11 percent in the first quarter - about half as large as benchmark estimates used to calibrate macroeconomic models - and predominantly occurs in the first month after the rebate.
△ Less
Submitted 16 January, 2024; v1 submitted 27 August, 2021;
originally announced August 2021.
-
Improving Inference from Simple Instruments through Compliance Estimation
Authors:
Stephen Coussens,
Jann Spiess
Abstract:
Instrumental variables (IV) regression is widely used to estimate causal treatment effects in settings where receipt of treatment is not fully random, but there exists an instrument that generates exogenous variation in treatment exposure. While IV can recover consistent treatment effect estimates, they are often noisy. Building upon earlier work in biostatistics (Joffe and Brensinger, 2003) and r…
▽ More
Instrumental variables (IV) regression is widely used to estimate causal treatment effects in settings where receipt of treatment is not fully random, but there exists an instrument that generates exogenous variation in treatment exposure. While IV can recover consistent treatment effect estimates, they are often noisy. Building upon earlier work in biostatistics (Joffe and Brensinger, 2003) and relating to an evolving literature in econometrics (including Abadie et al., 2019; Huntington-Klein, 2020; Borusyak and Hull, 2020), we study how to improve the efficiency of IV estimates by exploiting the predictable variation in the strength of the instrument. In the case where both the treatment and instrument are binary and the instrument is independent of baseline covariates, we study weighting each observation according to its estimated compliance (that is, its conditional probability of being affected by the instrument), which we motivate from a (constrained) solution of the first-stage prediction problem implicit to IV. The resulting estimator can leverage machine learning to estimate compliance as a function of baseline covariates. We derive the large-sample properties of a specific implementation of a weighted IV estimator in the potential outcomes and local average treatment effect (LATE) frameworks, and provide tools for inference that remain valid even when the weights are estimated nonparametrically. With both theoretical results and a simulation study, we demonstrate that compliance weighting meaningfully reduces the variance of IV estimates when first-stage heterogeneity is present, and that this improvement often outweighs any difference between the compliance-weighted and unweighted IV estimands. These results suggest that in a variety of applied settings, the precision of IV estimates can be substantially improved by incorporating compliance estimation.
△ Less
Submitted 8 August, 2021;
originally announced August 2021.
-
Finding Subgroups with Significant Treatment Effects
Authors:
Jann Spiess,
Vasilis Syrgkanis,
Victor Yaneng Wang
Abstract:
Researchers often run resource-intensive randomized controlled trials (RCTs) to estimate the causal effects of interventions on outcomes of interest. Yet these outcomes are often noisy, and estimated overall effects can be small or imprecise. Nevertheless, we may still be able to produce reliable evidence of the efficacy of an intervention by finding subgroups with significant effects. In this pap…
▽ More
Researchers often run resource-intensive randomized controlled trials (RCTs) to estimate the causal effects of interventions on outcomes of interest. Yet these outcomes are often noisy, and estimated overall effects can be small or imprecise. Nevertheless, we may still be able to produce reliable evidence of the efficacy of an intervention by finding subgroups with significant effects. In this paper, we propose a machine-learning method that is specifically optimized for finding such subgroups in noisy data. Unlike available methods for personalized treatment assignment, our tool is fundamentally designed to take significance testing into account: it produces a subgroup that is chosen to maximize the probability of obtaining a statistically significant positive treatment effect. We provide a computationally efficient implementation using decision trees and demonstrate its gain over selecting subgroups based on positive (estimated) treatment effects. Compared to standard tree-based regression and classification tools, this approach tends to yield higher power in detecting subgroups affected by the treatment.
△ Less
Submitted 20 December, 2023; v1 submitted 11 March, 2021;
originally announced March 2021.
-
A Design-Based Perspective on Synthetic Control Methods
Authors:
Lea Bottmer,
Guido Imbens,
Jann Spiess,
Merrill Warnick
Abstract:
Since their introduction in Abadie and Gardeazabal (2003), Synthetic Control (SC) methods have quickly become one of the leading methods for estimating causal effects in observational studies in settings with panel data. Formal discussions often motivate SC methods by the assumption that the potential outcomes were generated by a factor model. Here we study SC methods from a design-based perspecti…
▽ More
Since their introduction in Abadie and Gardeazabal (2003), Synthetic Control (SC) methods have quickly become one of the leading methods for estimating causal effects in observational studies in settings with panel data. Formal discussions often motivate SC methods by the assumption that the potential outcomes were generated by a factor model. Here we study SC methods from a design-based perspective, assuming a model for the selection of the treated unit(s) and period(s). We show that the standard SC estimator is generally biased under random assignment. We propose a Modified Unbiased Synthetic Control (MUSC) estimator that guarantees unbiasedness under random assignment and derive its exact, randomization-based, finite-sample variance. We also propose an unbiased estimator for this variance. We document in settings with real data that under random assignment, SC-type estimators can have root mean-squared errors that are substantially lower than that of other common estimators. We show that such an improvement is weakly guaranteed if the treated period is similar to the other periods, for example, if the treated period was randomly selected. While our results only directly apply in settings where treatment is assigned randomly, we believe that they can complement model-based approaches even for observational studies.
△ Less
Submitted 19 July, 2023; v1 submitted 22 January, 2021;
originally announced January 2021.
-
Bias Reduction in Instrumental Variable Estimation through First-Stage Shrinkage
Authors:
Jann Spiess
Abstract:
The two-stage least-squares (2SLS) estimator is known to be biased when its first-stage fit is poor. I show that better first-stage prediction can alleviate this bias. In a two-stage linear regression model with Normal noise, I consider shrinkage in the estimation of the first-stage instrumental variable coefficients. For at least four instrumental variables and a single endogenous regressor, I es…
▽ More
The two-stage least-squares (2SLS) estimator is known to be biased when its first-stage fit is poor. I show that better first-stage prediction can alleviate this bias. In a two-stage linear regression model with Normal noise, I consider shrinkage in the estimation of the first-stage instrumental variable coefficients. For at least four instrumental variables and a single endogenous regressor, I establish that the standard 2SLS estimator is dominated with respect to bias. The dominating IV estimator applies James-Stein type shrinkage in a first-stage high-dimensional Normal-means problem followed by a control-function approach in the second stage. It preserves invariances of the structural instrumental variable equations.
△ Less
Submitted 31 October, 2017; v1 submitted 21 August, 2017;
originally announced August 2017.
-
Unbiased Shrinkage Estimation
Authors:
Jann Spiess
Abstract:
Shrinkage estimation usually reduces variance at the cost of bias. But when we care only about some parameters of a model, I show that we can reduce variance without incurring bias if we have additional information about the distribution of covariates. In a linear regression model with homoscedastic Normal noise, I consider shrinkage estimation of the nuisance parameters associated with control va…
▽ More
Shrinkage estimation usually reduces variance at the cost of bias. But when we care only about some parameters of a model, I show that we can reduce variance without incurring bias if we have additional information about the distribution of covariates. In a linear regression model with homoscedastic Normal noise, I consider shrinkage estimation of the nuisance parameters associated with control variables. For at least three control variables and exogenous treatment, I establish that the standard least-squares estimator is dominated with respect to squared-error loss in the treatment effect even among unbiased estimators and even when the target parameter is low-dimensional. I construct the dominating estimator by a variant of James-Stein shrinkage in a high-dimensional Normal-means problem. It can be interpreted as an invariant generalized Bayes estimator with an uninformative (improper) Jeffreys prior in the target parameter.
△ Less
Submitted 31 October, 2017; v1 submitted 21 August, 2017;
originally announced August 2017.
-
Machine-Learning Tests for Effects on Multiple Outcomes
Authors:
Jens Ludwig,
Sendhil Mullainathan,
Jann Spiess
Abstract:
In this paper we present tools for applied researchers that re-purpose off-the-shelf methods from the computer-science field of machine learning to create a "discovery engine" for data from randomized controlled trials (RCTs). The applied problem we seek to solve is that economists invest vast resources into carrying out RCTs, including the collection of a rich set of candidate outcome measures. B…
▽ More
In this paper we present tools for applied researchers that re-purpose off-the-shelf methods from the computer-science field of machine learning to create a "discovery engine" for data from randomized controlled trials (RCTs). The applied problem we seek to solve is that economists invest vast resources into carrying out RCTs, including the collection of a rich set of candidate outcome measures. But given concerns about inference in the presence of multiple testing, economists usually wind up exploring just a small subset of the hypotheses that the available data could be used to test. This prevents us from extracting as much information as possible from each RCT, which in turn impairs our ability to develop new theories or strengthen the design of policy interventions. Our proposed solution combines the basic intuition of reverse regression, where the dependent variable of interest now becomes treatment assignment itself, with methods from machine learning that use the data themselves to flexibly identify whether there is any function of the outcomes that predicts (or has signal about) treatment group status. This leads to correctly-sized tests with appropriate $p$-values, which also have the important virtue of being easy to implement in practice. One open challenge that remains with our work is how to meaningfully interpret the signal that these methods find.
△ Less
Submitted 9 May, 2019; v1 submitted 5 July, 2017;
originally announced July 2017.