-
Group LASSO Variable Selection Method for Treatment Effect Generalization
Authors:
Chuyu Deng,
Brandon Koch,
David M. Vock,
Joseph S. Koopmeiners
Abstract:
Often in public health, we are interested in the treatment effect of an intervention on a population that is systemically different from the experimental population the intervention was originally evaluated in. When treatment effect heterogeneity is present in a randomized controlled trial, generalizing the treatment effect from this experimental population to a target population of interest is a…
▽ More
Often in public health, we are interested in the treatment effect of an intervention on a population that is systemically different from the experimental population the intervention was originally evaluated in. When treatment effect heterogeneity is present in a randomized controlled trial, generalizing the treatment effect from this experimental population to a target population of interest is a complex problem; it requires the characterization of both the treatment effect heterogeneity and the baseline covariate mismatch between the two populations. Despite the importance of this problem, the literature for variable selection in this context is limited. In this paper, we present a Group LASSO-based approach to variable selection in the context of treatment effect generalization, with an application to generalize the treatment effect of very low nicotine content cigarettes to the overall U.S. smoking population.
△ Less
Submitted 7 February, 2023;
originally announced February 2023.
-
Estimating Longitudinal Causal Effects with Unobserved Noncompliance Using a Semi-Parametric G-computation Algorithm
Authors:
Ross L Peterson,
David M Vock,
Joseph S Koopmeiners
Abstract:
Participant noncompliance, in which participants do not follow their assigned treatment protocol, often obscures the causal relationship between treatment and treatment effect in randomized trials. In the longitudinal setting, the G-computation algorithm can adjust for confounding to estimate causal effects. Typically, G-computation assumes that both 1) compliance is observed; and 2) the densities…
▽ More
Participant noncompliance, in which participants do not follow their assigned treatment protocol, often obscures the causal relationship between treatment and treatment effect in randomized trials. In the longitudinal setting, the G-computation algorithm can adjust for confounding to estimate causal effects. Typically, G-computation assumes that both 1) compliance is observed; and 2) the densities of the confounders can be correctly specified. We aim to develop a G-computation estimator in the setting where both assumptions are violated. For 1), in place of unobserved compliance, we substitute in probability weights derived from modeling a biomarker associated with compliance. For 2), we fit semiparametric models using predictive mean matching. Specifically, we parametrically specify only the conditional mean of the confounders, and then use predictive mean matching to randomly generate confounder data for G-computation. In both the simulation and application, we compare multiple causal estimators already established in the literature with those derived from our method. For the simulation, we generated data across different sample sizes and levels of confounding. For the application, we apply our method to a trial that sought to evaluate the effect of cigarettes with low nicotine on cigarette consumption (Center for the Evaluation of Nicotine in Cigarettes Project 2 - CENIC-P2).
△ Less
Submitted 7 February, 2023;
originally announced February 2023.
-
Practical Guidance on Modeling Choices for the Virtual Twins Method
Authors:
Chuyu Deng,
David M. Vock,
Dana M. Carroll,
Jeffrey A. Boatman,
Dorothy K. Hatsukami,
Ning Leng,
Joseph S. Koopmeiners
Abstract:
Individuals can vary drastically in their response to the same treatment, and this heterogeneity has driven the push for more personalized medicine. Accurate and interpretable methods to identify subgroups that respond to the treatment differently from the population average are necessary to achieving this goal. The Virtual Twins (VT) method by Foster et al. \cite{Foster} is a highly cited and imp…
▽ More
Individuals can vary drastically in their response to the same treatment, and this heterogeneity has driven the push for more personalized medicine. Accurate and interpretable methods to identify subgroups that respond to the treatment differently from the population average are necessary to achieving this goal. The Virtual Twins (VT) method by Foster et al. \cite{Foster} is a highly cited and implemented method for subgroup identification because of its intuitive framework. However, since its initial publication, many researchers still rely heavily on the authors' initial modeling suggestions without examining newer and more powerful alternatives. This leaves much of the potential of the method untapped. We comprehensively evaluate the performance of VT with different combinations of methods in each of its component steps, under a collection of linear and nonlinear problem settings. Our simulations show that the method choice for step 1 of VT is highly influential in the overall accuracy of the method, and Superlearner is a promising choice. We illustrate our findings by using VT to identify subgroups with heterogeneous treatment effects in a randomized, double-blind nicotine reduction trial.
△ Less
Submitted 16 November, 2021;
originally announced November 2021.
-
Borrowing from Supplemental Sources to Estimate Causal Effects from a Primary Data Source
Authors:
Jeffrey A. Boatman,
David M. Vock,
Joseph S. Koopmeiners
Abstract:
The increasing multiplicity of data sources offers exciting possibilities in estimating the effects of a treatment, intervention, or exposure, particularly if observational and experimental sources could be used simultaneously. Borrowing between sources can potentially result in more efficient estimators, but it must be done in a principled manner to mitigate increased bias and Type I error. Furth…
▽ More
The increasing multiplicity of data sources offers exciting possibilities in estimating the effects of a treatment, intervention, or exposure, particularly if observational and experimental sources could be used simultaneously. Borrowing between sources can potentially result in more efficient estimators, but it must be done in a principled manner to mitigate increased bias and Type I error. Furthermore, when the effect of treatment is confounded, as in observational sources or in clinical trials with noncompliance, causal effect estimators are needed to simultaneously adjust for confounding and to estimate effects across data sources. We consider the problem of estimating causal effects from a primary source and borrowing from any number of supplemental sources. We propose using regression-based estimators that borrow based on assuming exchangeability of the regression coefficients and parameters between data sources. Borrowing is accomplished with multisource exchangeability models and Bayesian model averaging. We show via simulation that a Bayesian linear model and Bayesian additive regression trees both have desirable properties and borrow under appropriate circumstances. We apply the estimators to recently completed trials of very low nicotine content cigarettes investigating their impact on smoking behavior.
△ Less
Submitted 21 March, 2020;
originally announced March 2020.
-
Data mining for censored time-to-event data: A Bayesian network model for predicting cardiovascular risk from electronic health record data
Authors:
Sunayan Bandyopadhyay,
Julian Wolfson,
David M. Vock,
Gabriela Vazquez-Benitez,
Gediminas Adomavicius,
Mohamed Elidrisi,
Paul E. Johnson,
Patrick J. O'Connor
Abstract:
Models for predicting the risk of cardiovascular events based on individual patient characteristics are important tools for managing patient care. Most current and commonly used risk prediction models have been built from carefully selected epidemiological cohorts. However, the homogeneity and limited size of such cohorts restricts the predictive power and generalizability of these risk models to…
▽ More
Models for predicting the risk of cardiovascular events based on individual patient characteristics are important tools for managing patient care. Most current and commonly used risk prediction models have been built from carefully selected epidemiological cohorts. However, the homogeneity and limited size of such cohorts restricts the predictive power and generalizability of these risk models to other populations. Electronic health data (EHD) from large health care systems provide access to data on large, heterogeneous, and contemporaneous patient populations. The unique features and challenges of EHD, including missing risk factor information, non-linear relationships between risk factors and cardiovascular event outcomes, and differing effects from different patient subgroups, demand novel machine learning approaches to risk model development. In this paper, we present a machine learning approach based on Bayesian networks trained on EHD to predict the probability of having a cardiovascular event within five years. In such data, event status may be unknown for some individuals as the event time is right-censored due to disenrollment and incomplete follow-up. Since many traditional data mining methods are not well-suited for such data, we describe how to modify both modelling and assessment techniques to account for censored observation times. We show that our approach can lead to better predictive performance than the Cox proportional hazards model (i.e., a regression-based approach commonly used for censored, time-to-event data) or a Bayesian network with {\em{ad hoc}} approaches to right-censoring. Our techniques are motivated by and illustrated on data from a large U.S. Midwestern health care system.
△ Less
Submitted 8 April, 2014;
originally announced April 2014.