-
Monte Carlo Integration in Simple and Complex Simulation Designs
Authors:
Ashley I. Naimi,
David Benkeser,
Jacqueline E. Rudolph
Abstract:
Simulation studies are used to evaluate and compare the properties of statistical methods in controlled experimental settings. In most cases, performing a simulation study requires knowledge of the true value of the parameter, or estimand, of interest. However, in many simulation designs, the true value of the estimand is difficult to compute analytically. Here, we illustrate the use of Monte Carl…
▽ More
Simulation studies are used to evaluate and compare the properties of statistical methods in controlled experimental settings. In most cases, performing a simulation study requires knowledge of the true value of the parameter, or estimand, of interest. However, in many simulation designs, the true value of the estimand is difficult to compute analytically. Here, we illustrate the use of Monte Carlo integration to compute true estimand values in simple and complex simulation designs. We provide general pseudocode that can be replicated in any software program of choice to demonstrate key principles in using Monte Carlo integration in two scenarios: a simple three variable simulation where interest lies in the marginally adjusted odds ratio; and a more complex causal mediation analysis where interest lies in the controlled direct effect in the presence of mediator-outcome confounders affected by the exposure. We discuss general strategies that can be used to minimize Monte Carlo error, and to serve as checks on the simulation program to avoid coding errors. R programming code is provided illustrating the application of our pseudocode in these settings.
△ Less
Submitted 3 July, 2024; v1 submitted 21 June, 2024;
originally announced June 2024.
-
Nonparametric Motion Control in Functional Connectivity Studies in Children with Autism Spectrum Disorder
Authors:
Jialu Ran,
Sarah Shultz,
Benjamin B. Risk,
David Benkeser
Abstract:
Autism Spectrum Disorder (ASD) is a neurodevelopmental condition associated with difficulties with social interactions, communication, and restricted or repetitive behaviors. To characterize ASD, investigators often use functional connectivity derived from resting-state functional magnetic resonance imaging of the brain. However, participants' head motion during the scanning session can induce mot…
▽ More
Autism Spectrum Disorder (ASD) is a neurodevelopmental condition associated with difficulties with social interactions, communication, and restricted or repetitive behaviors. To characterize ASD, investigators often use functional connectivity derived from resting-state functional magnetic resonance imaging of the brain. However, participants' head motion during the scanning session can induce motion artifacts. Many studies remove scans with excessive motion, which can lead to drastic reductions in sample size and introduce selection bias. To avoid such exclusions, we propose an estimand inspired by causal inference methods that quantifies the difference in average functional connectivity in autistic and non-ASD children while standardizing motion relative to the low motion distribution in scans that pass motion quality control. We introduce a nonparametric estimator for motion control, called MoCo, that uses all participants and flexibly models the impacts of motion and other relevant features using an ensemble of machine learning methods. We establish large-sample efficiency and multiple robustness of our proposed estimator. The framework is applied to estimate the difference in functional connectivity between 132 autistic and 245 non-ASD children, of which 34 and 126 pass motion quality control. MoCo appears to dramatically reduce motion artifacts relative to no participant removal, while more efficiently utilizing participant data and accounting for possible selection biases relative to the naïve approach with participant removal.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Statistical learning for constrained functional parameters in infinite-dimensional models with applications in fair machine learning
Authors:
Razieh Nabi,
Nima S. Hejazi,
Mark J. van der Laan,
David Benkeser
Abstract:
Constrained learning has become increasingly important, especially in the realm of algorithmic fairness and machine learning. In these settings, predictive models are developed specifically to satisfy pre-defined notions of fairness. Here, we study the general problem of constrained statistical machine learning through a statistical functional lens. We consider learning a function-valued parameter…
▽ More
Constrained learning has become increasingly important, especially in the realm of algorithmic fairness and machine learning. In these settings, predictive models are developed specifically to satisfy pre-defined notions of fairness. Here, we study the general problem of constrained statistical machine learning through a statistical functional lens. We consider learning a function-valued parameter of interest under the constraint that one or several pre-specified real-valued functional parameters equal zero or are otherwise bounded. We characterize the constrained functional parameter as the minimizer of a penalized risk criterion using a Lagrange multiplier formulation. We show that closed-form solutions for the optimal constrained parameter are often available, providing insight into mechanisms that drive fairness in predictive models. Our results also suggest natural estimators of the constrained parameter that can be constructed by combining estimates of unconstrained parameters of the data generating distribution. Thus, our estimation procedure for constructing fair machine learning algorithms can be applied in conjunction with any statistical learning approach and off-the-shelf software. We demonstrate the generality of our method by explicitly considering a number of examples of statistical fairness constraints and implementing the approach using several popular learning approaches.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Targeted Machine Learning for Average Causal Effect Estimation Using the Front-Door Functional
Authors:
Anna Guo,
David Benkeser,
Razieh Nabi
Abstract:
Evaluating the average causal effect (ACE) of a treatment on an outcome often involves overcoming the challenges posed by confounding factors in observational studies. A traditional approach uses the back-door criterion, seeking adjustment sets to block confounding paths between treatment and outcome. However, this method struggles with unmeasured confounders. As an alternative, the front-door cri…
▽ More
Evaluating the average causal effect (ACE) of a treatment on an outcome often involves overcoming the challenges posed by confounding factors in observational studies. A traditional approach uses the back-door criterion, seeking adjustment sets to block confounding paths between treatment and outcome. However, this method struggles with unmeasured confounders. As an alternative, the front-door criterion offers a solution, even in the presence of unmeasured confounders between treatment and outcome. This method relies on identifying mediators that are not directly affected by these confounders and that completely mediate the treatment's effect. Here, we introduce novel estimation strategies for the front-door criterion based on the targeted minimum loss-based estimation theory. Our estimators work across diverse scenarios, handling binary, continuous, and multivariate mediators. They leverage data-adaptive machine learning algorithms, minimizing assumptions and ensuring key statistical properties like asymptotic linearity, double-robustness, efficiency, and valid estimates within the target parameter space. We establish conditions under which the nuisance functional estimations ensure the root n-consistency of ACE estimators. Our numerical experiments show the favorable finite sample performance of the proposed estimators. We demonstrate the applicability of these estimators to analyze the effect of early stage academic performance on future yearly income using data from the Finnish Social Science Data Archive.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
A Huber loss-based super learner with applications to healthcare expenditures
Authors:
Ziyue Wu,
David Benkeser
Abstract:
Complex distributions of the healthcare expenditure pose challenges to statistical modeling via a single model. Super learning, an ensemble method that combines a range of candidate models, is a promising alternative for cost estimation and has shown benefits over a single model. However, standard approaches to super learning may have poor performance in settings where extreme values are present,…
▽ More
Complex distributions of the healthcare expenditure pose challenges to statistical modeling via a single model. Super learning, an ensemble method that combines a range of candidate models, is a promising alternative for cost estimation and has shown benefits over a single model. However, standard approaches to super learning may have poor performance in settings where extreme values are present, such as healthcare expenditure data. We propose a super learner based on the Huber loss, a "robust" loss function that combines squared error loss with absolute loss to down-weight the influence of outliers. We derive oracle inequalities that establish bounds on the finite-sample and asymptotic performance of the method. We show that the proposed method can be used both directly to optimize Huber risk, as well as in finite-sample settings where optimizing mean squared error is the ultimate goal. For this latter scenario, we provide two methods for performing a grid search for values of the robustification parameter indexing the Huber loss. Simulations and real data analysis demonstrate appreciable finite-sample gains in cost prediction and causal effect estimation using our proposed method.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
Efficient estimation of modified treatment policy effects based on the generalized propensity score
Authors:
Nima S. Hejazi,
David Benkeser,
Iván Díaz,
Mark J. van der Laan
Abstract:
Continuous treatments have posed a significant challenge for causal inference, both in the formulation and identification of scientifically meaningful effects and in their robust estimation. Traditionally, focus has been placed on techniques applicable to binary or categorical treatments with few levels, allowing for the application of propensity score-based methodology with relative ease. Efforts…
▽ More
Continuous treatments have posed a significant challenge for causal inference, both in the formulation and identification of scientifically meaningful effects and in their robust estimation. Traditionally, focus has been placed on techniques applicable to binary or categorical treatments with few levels, allowing for the application of propensity score-based methodology with relative ease. Efforts to accommodate continuous treatments introduced the generalized propensity score, yet estimators of this nuisance parameter commonly utilize parametric regression strategies that sharply limit the robustness and efficiency of inverse probability weighted estimators of causal effect parameters. We formulate and investigate a novel, flexible estimator of the generalized propensity score based on a nonparametric function estimator that provably converges at a suitably fast rate to the target functional so as to facilitate statistical inference. With this estimator, we demonstrate the construction of nonparametric inverse probability weighted estimators of a class of causal effect estimands tailored to continuous treatments. To ensure the asymptotic efficiency of our proposed estimators, we outline several non-restrictive selection procedures for utilizing a sieve estimation framework to undersmooth estimators of the generalized propensity score. We provide the first characterization of such inverse probability weighted estimators achieving the nonparametric efficiency bound in a setting with continuous treatments, demonstrating this in numerical experiments. We further evaluate the higher-order efficiency of our proposed estimators by deriving and numerically examining the second-order remainder of the corresponding efficient influence function in the nonparametric model. Open source software implementing our proposed estimation techniques, the haldensify R package, is briefly discussed.
△ Less
Submitted 28 June, 2022; v1 submitted 11 May, 2022;
originally announced May 2022.
-
Inference for natural mediation effects under case-cohort sampling with applications in identifying COVID-19 vaccine correlates of protection
Authors:
David Benkeser,
Iván Díaz,
Jialu Ran
Abstract:
Combating the SARS-CoV2 pandemic will require the fast development of effective preventive vaccines. Regulatory agencies may open accelerated approval pathways for vaccines if an immunological marker can be established as a mediator of a vaccine's protection. A rich source of information for identifying such correlates are large-scale efficacy trials of COVID-19 vaccines, where immune responses ar…
▽ More
Combating the SARS-CoV2 pandemic will require the fast development of effective preventive vaccines. Regulatory agencies may open accelerated approval pathways for vaccines if an immunological marker can be established as a mediator of a vaccine's protection. A rich source of information for identifying such correlates are large-scale efficacy trials of COVID-19 vaccines, where immune responses are measured subject to a case-cohort sampling design. We propose two approaches to estimation of mediation parameters in the context of case-cohort sampling designs. We establish the theoretical large-sample efficiency of our proposed estimators and evaluate them in a realistic simulation to understand whether they can be employed in the analysis of COVID-19 vaccine efficacy trials.
△ Less
Submitted 3 March, 2021;
originally announced March 2021.
-
Efficient nonparametric inference on the effects of stochastic interventions under two-phase sampling, with applications to vaccine efficacy trials
Authors:
Nima S. Hejazi,
Mark J. van der Laan,
Holly E. Janes,
Peter B. Gilbert,
David C. Benkeser
Abstract:
The advent and subsequent widespread availability of preventive vaccines has altered the course of public health over the past century. Despite this success, effective vaccines to prevent many high-burden diseases, including HIV, have been slow to develop. Vaccine development can be aided by the identification of immune response markers that serve as effective surrogates for clinically significant…
▽ More
The advent and subsequent widespread availability of preventive vaccines has altered the course of public health over the past century. Despite this success, effective vaccines to prevent many high-burden diseases, including HIV, have been slow to develop. Vaccine development can be aided by the identification of immune response markers that serve as effective surrogates for clinically significant infection or disease endpoints. However, measuring immune response marker activity is often costly, which has motivated the usage of two-phase sampling for immune response evaluation in clinical trials of preventive vaccines. In such trials, the measurement of immunological markers is performed on a subset of trial participants, where enrollment in this second phase is potentially contingent on the observed study outcome and other participant-level information. We propose nonparametric methodology for efficiently estimating a counterfactual parameter that quantifies the impact of a given immune response marker on the subsequent probability of infection. Along the way, we fill in theoretical gaps pertaining to the asymptotic behavior of nonparametric efficient estimators in the context of two-phase sampling, including a multiple robustness property enjoyed by our estimators. Techniques for constructing confidence intervals and hypothesis tests are presented, and an open source software implementation of the methodology, the txshift R package, is introduced. We illustrate the proposed techniques using data from a recent preventive HIV vaccine efficacy trial.
△ Less
Submitted 3 April, 2020; v1 submitted 30 March, 2020;
originally announced March 2020.
-
Nonparametric inference for interventional effects with multiple mediators
Authors:
David Benkeser
Abstract:
Understanding the pathways whereby an intervention has an effect on an outcome is a common scientific goal. A rich body of literature provides various decompositions of the total intervention effect into pathway specific effects. Interventional direct and indirect effects provide one such decomposition. Existing estimators of these effects are based on parametric models with confidence interval es…
▽ More
Understanding the pathways whereby an intervention has an effect on an outcome is a common scientific goal. A rich body of literature provides various decompositions of the total intervention effect into pathway specific effects. Interventional direct and indirect effects provide one such decomposition. Existing estimators of these effects are based on parametric models with confidence interval estimation facilitated via the nonparametric bootstrap. We provide theory that allows for more flexible, possibly machine learning-based, estimation techniques to be considered. In particular, we establish weak convergence results that facilitate the construction of closed-form confidence intervals and hypothesis tests. Finally, we demonstrate multiple robustness properties of the proposed estimators. Simulations show that inference based on large-sample theory has adequate small-sample performance. Our work thus provides a means of leveraging modern statistical learning techniques in estimation of interventional mediation effects.
△ Less
Submitted 16 January, 2020;
originally announced January 2020.
-
Efficient Estimation of Pathwise Differentiable Target Parameters with the Undersmoothed Highly Adaptive Lasso
Authors:
Mark J. van der Laan,
David Benkeser,
Weixin Cai
Abstract:
We consider estimation of a functional parameter of a realistically modeled data distribution based on observing independent and identically distributed observations. We define an $m$-th order Spline Highly Adaptive Lasso Minimum Loss Estimator (Spline HAL-MLE) of a functional parameter that is defined by minimizing the empirical risk function over an $m$-th order smoothness class of functions. We…
▽ More
We consider estimation of a functional parameter of a realistically modeled data distribution based on observing independent and identically distributed observations. We define an $m$-th order Spline Highly Adaptive Lasso Minimum Loss Estimator (Spline HAL-MLE) of a functional parameter that is defined by minimizing the empirical risk function over an $m$-th order smoothness class of functions. We show that this $m$-th order smoothness class consists of all functions that can be represented as an infinitesimal linear combination of tensor products of $\leq m$-th order spline-basis functions, and involves assuming $m$-derivatives in each coordinate. By selecting $m$ with cross-validation we obtain a Spline-HAL-MLE that is able to adapt to the underlying unknown smoothness of the true function, while guaranteeing a rate of convergence faster than $n^{-1/4}$, as long as the true function is cadlag (right-continuous with left-hand limits) and has finite sectional variation norm. The $m=0$-smoothness class consists of all cadlag functions with finite sectional variation norm and corresponds with the original HAL-MLE defined in van der Laan (2015).
In this article we establish that this Spline-HAL-MLE yields an asymptotically efficient estimator of any smooth feature of the functional parameter under an easily verifiable global undersmoothing condition. A sufficient condition for the latter condition is that the minimum of the empirical mean of the selected basis functions is smaller than a constant times $n^{-1/2}$, which is not parameter specific and enforces the selection of the $L_1$-norm in the lasso to be large enough to include sparsely supported basis. We demonstrate our general result for the $m=0$-HAL-MLE of the average treatment effect and of the integral of the square of the data density. We also present simulations for these two examples confirming the theory.
△ Less
Submitted 2 July, 2021; v1 submitted 14 August, 2019;
originally announced August 2019.
-
Design and analysis considerations for a sequentially randomized HIV prevention trial
Authors:
David Benkeser,
Keith Horvath,
Cathy Reback,
Joshua Rusow,
Michael Hudgens
Abstract:
TechStep is a randomized trial of a mobile health interventions targeted towards transgender adolescents. The interventions include a short message system, a mobile-optimized web application, and electronic counseling. The primary outcomes are self-reported sexual risk behaviors and uptake of HIV preventing medication. In order that we may evaluate the efficacy of several different combinations of…
▽ More
TechStep is a randomized trial of a mobile health interventions targeted towards transgender adolescents. The interventions include a short message system, a mobile-optimized web application, and electronic counseling. The primary outcomes are self-reported sexual risk behaviors and uptake of HIV preventing medication. In order that we may evaluate the efficacy of several different combinations of interventions, the trial has a sequentially randomized design. We use a causal framework to formalize the estimands of the primary and key secondary analyses of the TechStep trial data. Targeted minimum loss-based estimators of these quantities are described and studied in simulation.
△ Less
Submitted 21 June, 2019;
originally announced June 2019.
-
A nonparametric super-efficient estimator of the average treatment effect
Authors:
David Benkeser,
Weixin Cai,
Mark J van der Laan
Abstract:
Doubly robust estimators of causal effects are a popular means of estimating causal effects. Such estimators combine an estimate of the conditional mean of the outcome given treatment and confounders (the so-called outcome regression) with an estimate of the conditional probability of treatment given confounders (the propensity score) to generate an estimate of the effect of interest. In addition…
▽ More
Doubly robust estimators of causal effects are a popular means of estimating causal effects. Such estimators combine an estimate of the conditional mean of the outcome given treatment and confounders (the so-called outcome regression) with an estimate of the conditional probability of treatment given confounders (the propensity score) to generate an estimate of the effect of interest. In addition to enjoying the double-robustness property, these estimators have additional benefits. First, flexible regression tools, such as those developed in the field of machine learning, can be utilized to estimate the relevant regressions, while the estimators of the treatment effects retain desirable statistical properties. Furthermore, these estimators are often statistically efficient, achieving the lower bound on the variance of regular, asymptotically linear estimators. However, in spite of their asymptotic optimality, in problems where causal estimands are weakly identifiable, these estimators may behave erratically. We propose two new estimation techniques for use in these challenging settings. Our estimators build on two existing frameworks for efficient estimation: targeted minimum loss estimation and one-step estimation. However, rather than using an estimate of the propensity score in their construction, we instead opt for an alternative regression quantity when building our estimators: the conditional probability of treatment given the conditional mean outcome. We discuss the theoretical implications and demonstrate the estimators' performance in simulated and real data.
△ Less
Submitted 15 January, 2019;
originally announced January 2019.
-
Robust inference on the average treatment effect using the outcome highly adaptive lasso
Authors:
Cheng Ju,
David Benkeser,
Mark J. van der Laan
Abstract:
Many estimators of the average effect of a treatment on an outcome require estimation of the propensity score, the outcome regression, or both. It is often beneficial to utilize flexible techniques such as semiparametric regression or machine learning to estimate these quantities. However, optimal estimation of these regressions does not necessarily lead to optimal estimation of the average treatm…
▽ More
Many estimators of the average effect of a treatment on an outcome require estimation of the propensity score, the outcome regression, or both. It is often beneficial to utilize flexible techniques such as semiparametric regression or machine learning to estimate these quantities. However, optimal estimation of these regressions does not necessarily lead to optimal estimation of the average treatment effect, particularly in settings with strong instrumental variables. A recent proposal addressed these issues via the outcome-adaptive lasso, a penalized regression technique for estimating the propensity score that seeks to minimize the impact of instrumental variables on treatment effect estimators. However, a notable limitation of this approach is that its application is restricted to parametric models. We propose a more flexible alternative that we call the outcome highly adaptive lasso. We discuss large sample theory for this estimator and propose closed form confidence intervals based on the proposed estimator. We show via simulation that our method offers benefits over several popular approaches.
△ Less
Submitted 12 May, 2019; v1 submitted 18 June, 2018;
originally announced June 2018.
-
A machine learning-based approach for estimating and testing associations with multivariate outcomes
Authors:
David Benkeser,
Andrew Mertens,
Benjamin F. Arnold,
John M. Colford Jr.,
Alan Hubbard,
Aryeh Stein,
N. Lntshotshole Jumbe,
Mark van der Laan
Abstract:
We propose a method for summarizing the strength of association between a set of variables and a multivariate outcome. Classical summary measures are appropriate when linear relationships exist between covariates and outcomes, while our approach provides an alternative that is useful in situations where complex relationships may be present. We utilize ensemble machine learning to detect nonlinear…
▽ More
We propose a method for summarizing the strength of association between a set of variables and a multivariate outcome. Classical summary measures are appropriate when linear relationships exist between covariates and outcomes, while our approach provides an alternative that is useful in situations where complex relationships may be present. We utilize ensemble machine learning to detect nonlinear relationships and covariate interactions and propose a measure of association that captures these relationships. A hypothesis test about the proposed associative measure can be used to test the strong null hypothesis of no association between a set of variables and a multivariate outcome. Simulations demonstrate that this hypothesis test has greater power than existing methods against alternatives where covariates have nonlinear relationships with outcomes. We additionally propose measures of variable importance for groups of variables, which summarize each groups' association with the outcome. We demonstrate our methodology using data from a birth cohort study on childhood health and nutrition in the Philippines.
△ Less
Submitted 14 March, 2018; v1 submitted 13 March, 2018;
originally announced March 2018.