Search | arXiv e-print repository

When Do Natural Mediation Effects Differ from Their Randomized Interventional Analogues: Test and Theory

Abstract: In causal mediation analysis, the natural direct and indirect effects (natural effects) are nonparametrically unidentifiable in the presence of treatment-induced confounding, which motivated the development of randomized interventional analogues (RIAs) of the natural effects. The RIAs are easier to identify and widely used in practice. Applied researchers often interpret RIA estimates as if they w… ▽ More In causal mediation analysis, the natural direct and indirect effects (natural effects) are nonparametrically unidentifiable in the presence of treatment-induced confounding, which motivated the development of randomized interventional analogues (RIAs) of the natural effects. The RIAs are easier to identify and widely used in practice. Applied researchers often interpret RIA estimates as if they were the natural effects, even though the RIAs could be poor proxies for the natural effects. This calls for practical and theoretical guidance on when the RIAs differ from or coincide with the natural effects, which this paper aims to address. We develop a novel empirical test for the divergence between the RIAs and the natural effects under the weak assumptions sufficient for identifying the RIAs and illustrate the test using the Moving to Opportunity Study. We also provide new theoretical insights on the relationship between the RIAs and the natural effects from a covariance perspective and a structural equation perspective. Additionally, we discuss previously undocumented connections between the natural effects, the RIAs, and estimands in instrumental variable analysis and Wilcoxon-Mann-Whitney tests. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2405.02612 [pdf, other]

Learning Linear Utility Functions From Pairwise Comparison Queries

Authors: Luise Ge, Brendan Juba, Yevgeniy Vorobeychik

Abstract: We study learnability of linear utility functions from pairwise comparison queries. In particular, we consider two learning objectives. The first objective is to predict out-of-sample responses to pairwise comparisons, whereas the second is to approximately recover the true parameters of the utility function. We show that in the passive learning setting, linear utilities are efficiently learnable… ▽ More We study learnability of linear utility functions from pairwise comparison queries. In particular, we consider two learning objectives. The first objective is to predict out-of-sample responses to pairwise comparisons, whereas the second is to approximately recover the true parameters of the utility function. We show that in the passive learning setting, linear utilities are efficiently learnable with respect to the first objective, both when query responses are uncorrupted by noise, and under Tsybakov noise when the distributions are sufficiently "nice". In contrast, we show that utility parameters are not learnable for a large set of data distributions without strong modeling assumptions, even when query responses are noise-free. Next, we proceed to analyze the learning problem in an active learning setting. In this case, we show that even the second objective is efficiently learnable, and present algorithms for both the noise-free and noisy query response settings. Our results thus exhibit a qualitative learnability gap between passive and active learning from pairwise preference queries, demonstrating the value of the ability to select pairwise queries for utility learning. △ Less

Submitted 19 June, 2024; v1 submitted 4 May, 2024; originally announced May 2024.

Comments: Submitted to ECAI for review

arXiv:2403.14044 [pdf]

Statistical tests for comparing the associations of multiple exposures with a common outcome in Cox proportional hazard models

Authors: Rikuta Hamaya, Peilu Wang, Lin Ge, Edward L. Giovannucci, Molin Wang

Abstract: With advancement of medicine, alternative exposures or interventions are emerging with respect to a common outcome, and there are needs to formally test the difference in the associations of multiple exposures. We propose a duplication method-based multivariate Wald test in the Cox proportional hazard regression analyses to test the difference in the associations of multiple exposures with a same… ▽ More With advancement of medicine, alternative exposures or interventions are emerging with respect to a common outcome, and there are needs to formally test the difference in the associations of multiple exposures. We propose a duplication method-based multivariate Wald test in the Cox proportional hazard regression analyses to test the difference in the associations of multiple exposures with a same outcome. The proposed method applies to linear or categorical exposures. To illustrate our method, we applied our method to compare the associations between alignment to two different dietary patterns, either as continuous or quartile exposures, and incident chronic diseases, defined as a composite of CVD, cancer, and diabetes, in the Health Professional Follow-up Study. Relevant sample codes in R that implement the proposed approach are provided. The proposed duplication-method-based approach offers a flexible, formal statistical test of multiple exposures for the common outcome with minimal assumptions. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2312.17122 [pdf, other]

Large Language Model for Causal Decision Making

Authors: Haitao Jiang, Lin Ge, Yuhe Gao, Jianian Wang, Rui Song

Abstract: Large Language Models (LLMs) have shown their success in language understanding and reasoning on general topics. However, their capability to perform inference based on user-specified structured data and knowledge in corpus-rare concepts, such as causal decision-making is still limited. In this work, we explore the possibility of fine-tuning an open-sourced LLM into LLM4Causal, which can identify… ▽ More Large Language Models (LLMs) have shown their success in language understanding and reasoning on general topics. However, their capability to perform inference based on user-specified structured data and knowledge in corpus-rare concepts, such as causal decision-making is still limited. In this work, we explore the possibility of fine-tuning an open-sourced LLM into LLM4Causal, which can identify the causal task, execute a corresponding function, and interpret its numerical results based on users' queries and the provided dataset. Meanwhile, we propose a data generation process for more controllable GPT prompting and present two instruction-tuning datasets: (1) Causal-Retrieval-Bench for causal problem identification and input parameter extraction for causal function calling and (2) Causal-Interpret-Bench for in-context causal interpretation. By conducting end-to-end evaluations and two ablation studies, we showed that LLM4Causal can deliver end-to-end solutions for causal problems and provide easy-to-understand answers, which significantly outperforms the baselines. △ Less

Submitted 11 April, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

arXiv:2307.00214 [pdf, ps, other]

Utilizing a Capture-Recapture Strategy to Accelerate Infectious Disease Surveillance

Authors: Lin Ge, Yuzi Zhang, Lance A. Waller, Robert H. Lyles

Abstract: Monitoring key elements of disease dynamics (e.g., prevalence, case counts) is of great importance in infectious disease prevention and control, as emphasized during the COVID-19 pandemic. To facilitate this effort, we propose a new capture-recapture (CRC) analysis strategy that takes misclassification into account from easily-administered, imperfect diagnostic test kits, such as the Rapid Antigen… ▽ More Monitoring key elements of disease dynamics (e.g., prevalence, case counts) is of great importance in infectious disease prevention and control, as emphasized during the COVID-19 pandemic. To facilitate this effort, we propose a new capture-recapture (CRC) analysis strategy that takes misclassification into account from easily-administered, imperfect diagnostic test kits, such as the Rapid Antigen Test-kits or saliva tests. Our method is based on a recently proposed "anchor stream" design, whereby an existing voluntary surveillance data stream is augmented by a smaller and judiciously drawn random sample. It incorporates manufacturer-specified sensitivity and specificity parameters to account for imperfect diagnostic results in one or both data streams. For inference to accompany case count estimation, we improve upon traditional Wald-type confidence intervals by develo** an adapted Bayesian credible interval for the CRC estimator that yields favorable frequentist coverage properties. When feasible, the proposed design and analytic strategy provides a more efficient solution than traditional CRC methods or random sampling-based biased-corrected estimation to monitor disease prevalence while accounting for misclassification. We demonstrate the benefits of this approach through simulation studies that underscore its potential utility in practice for economical disease monitoring among a registered closed population. △ Less

Submitted 30 June, 2023; originally announced July 2023.

arXiv:2306.10666 [pdf]

On some pitfalls of the log-linear modeling framework for capture-recapture studies in disease surveillance

Authors: Yuzi Zhang, Lin Ge, Lance A. Waller, Robert H. Lyles

Abstract: In epidemiological studies, the capture-recapture (CRC) method is a powerful tool that can be used to estimate the number of diseased cases or potentially disease prevalence based on data from overlap** surveillance systems. Estimators derived from log-linear models are widely applied by epidemiologists when analyzing CRC data. The popularity of the log-linear model framework is largely associat… ▽ More In epidemiological studies, the capture-recapture (CRC) method is a powerful tool that can be used to estimate the number of diseased cases or potentially disease prevalence based on data from overlap** surveillance systems. Estimators derived from log-linear models are widely applied by epidemiologists when analyzing CRC data. The popularity of the log-linear model framework is largely associated with its accessibility and the fact that interaction terms can allow for certain types of dependency among data streams. In this work, we shed new light on significant pitfalls associated with the log-linear model framework in the context of CRC using real data examples and simulation studies. First, we demonstrate that the log-linear model paradigm is highly exclusionary. That is, it can exclude, by design, many possible estimates that are potentially consistent with the observed data. Second, we clarify the ways in which regularly used model selection metrics (e.g., information criteria) are fundamentally deceiving in the effort to select a best model in this setting. By focusing attention on these important cautionary points and on the fundamental untestable dependency assumption made when fitting a log-linear model to CRC data, we hope to improve the quality of and transparency associated with subsequent surveillance-based CRC estimates of case counts. △ Less

Submitted 18 June, 2023; originally announced June 2023.

arXiv:2302.03558 [pdf]

doi 10.1080/00031305.2023.2250401

Enhanced Inference for Finite Population Sampling-Based Prevalence Estimation with Misclassification Errors

Authors: Lin Ge, Yuzi Zhang, Lance A. Waller, Robert H. Lyles

Abstract: Epidemiologic screening programs often make use of tests with small, but non-zero probabilities of misdiagnosis. In this article, we assume the target population is finite with a fixed number of true cases, and that we apply an imperfect test with known sensitivity and specificity to a sample of individuals from the population. In this setting, we propose an enhanced inferential approach for use i… ▽ More Epidemiologic screening programs often make use of tests with small, but non-zero probabilities of misdiagnosis. In this article, we assume the target population is finite with a fixed number of true cases, and that we apply an imperfect test with known sensitivity and specificity to a sample of individuals from the population. In this setting, we propose an enhanced inferential approach for use in conjunction with sampling-based bias-corrected prevalence estimation. While ignoring the finite nature of the population can yield markedly conservative estimates, direct application of a standard finite population correction (FPC) conversely leads to underestimation of variance. We uncover a way to leverage the typical FPC indirectly toward valid statistical inference. In particular, we derive a readily estimable extra variance component induced by misclassification in this specific but arguably common diagnostic testing scenario. Our approach yields a standard error estimate that properly captures the sampling variability of the usual bias-corrected maximum likelihood estimator of disease prevalence. Finally, we develop an adapted Bayesian credible interval for the true prevalence that offers improved frequentist properties (i.e., coverage and width) relative to a Wald-type confidence interval. We report the simulation results to demonstrate the enhanced performance of the proposed inferential methods. △ Less

Submitted 13 August, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

arXiv:2301.13348 [pdf, other]

A Reinforcement Learning Framework for Dynamic Mediation Analysis

Authors: Lin Ge, Jitao Wang, Chengchun Shi, Zhenke Wu, Rui Song

Abstract: Mediation analysis learns the causal effect transmitted via mediator variables between treatments and outcomes and receives increasing attention in various scientific domains to elucidate causal relations. Most existing works focus on point-exposure studies where each subject only receives one treatment at a single time point. However, there are a number of applications (e.g., mobile health) where… ▽ More Mediation analysis learns the causal effect transmitted via mediator variables between treatments and outcomes and receives increasing attention in various scientific domains to elucidate causal relations. Most existing works focus on point-exposure studies where each subject only receives one treatment at a single time point. However, there are a number of applications (e.g., mobile health) where the treatments are sequentially assigned over time and the dynamic mediation effects are of primary interest. Proposing a reinforcement learning (RL) framework, we are the first to evaluate dynamic mediation effects in settings with infinite horizons. We decompose the average treatment effect into an immediate direct effect, an immediate mediation effect, a delayed direct effect, and a delayed mediation effect. Upon the identification of each effect component, we further develop robust and semi-parametrically efficient estimators under the RL framework to infer these causal effects. The superior performance of the proposed method is demonstrated through extensive numerical studies, theoretical results, and an analysis of a mobile health dataset. △ Less

Submitted 2 September, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

arXiv:2212.04911 [pdf]

doi 10.1093/aje/kwad177

A Design and Analytic Strategy for Monitoring Disease Positivity and Case Characteristics in Accessible Closed Populations

Authors: Robert H. Lyles, Yuzi Zhang, Lin Ge, Lance A. Waller

Abstract: We propose a monitoring strategy for efficient and robust estimation of disease prevalence and case numbers within closed and enumerated populations such as schools, workplaces, or retirement communities. The proposed design relies largely on voluntary testing, notoriously biased (e.g., in the case of COVID-19) due to non-representative sampling. The approach yields unbiased and comparatively prec… ▽ More We propose a monitoring strategy for efficient and robust estimation of disease prevalence and case numbers within closed and enumerated populations such as schools, workplaces, or retirement communities. The proposed design relies largely on voluntary testing, notoriously biased (e.g., in the case of COVID-19) due to non-representative sampling. The approach yields unbiased and comparatively precise estimates with no assumptions about factors underlying selection of individuals for voluntary testing, building on the strength of what can be a small random sampling component. This component unlocks a previously proposed "anchor stream" estimator, a well-calibrated alternative to classical capture-recapture (CRC) estimators based on two data streams. We show here that this estimator is equivalent to a direct standardization based on "capture", i.e., selection (or not) by the voluntary testing program, made possible by means of a key parameter identified by design. This equivalency simultaneously allows for novel two-stream CRC-like estimation of general means (e.g., of continuous variables such as antibody levels or biomarkers). For inference, we propose adaptations of a Bayesian credible interval when estimating case counts and bootstrap** when estimating means of continuous variables. We use simulations to demonstrate significant precision benefits relative to random sampling alone. △ Less

Submitted 9 December, 2022; originally announced December 2022.

arXiv:2211.13842 [pdf, ps, other]

doi 10.1002/sim.9759

Tailoring Capture-Recapture Methods to Estimate Registry-Based Case Counts Based on Error-Prone Diagnostic Signals

Authors: Lin Ge, Yuzi Zhang, Kevin C. Ward, Timothy L. Lash, Lance A. Waller, Robert H. Lyles

Abstract: Surveillance research is of great importance for effective and efficient epidemiological monitoring of case counts and disease prevalence. Taking specific motivation from ongoing efforts to identify recurrent cases based on the Georgia Cancer Registry, we extend recently proposed "anchor stream" sampling design and estimation methodology. Our approach offers a more efficient and defensible alterna… ▽ More Surveillance research is of great importance for effective and efficient epidemiological monitoring of case counts and disease prevalence. Taking specific motivation from ongoing efforts to identify recurrent cases based on the Georgia Cancer Registry, we extend recently proposed "anchor stream" sampling design and estimation methodology. Our approach offers a more efficient and defensible alternative to traditional capture-recapture (CRC) methods by leveraging a relatively small random sample of participants whose recurrence status is obtained through a principled application of medical records abstraction. This sample is combined with one or more existing signaling data streams, which may yield data based on arbitrarily non-representative subsets of the full registry population. The key extension developed here accounts for the common problem of false positive or negative diagnostic signals from the existing data stream(s). In particular, we show that the design only requires documentation of positive signals in these non-anchor surveillance streams, and permits valid estimation of the true case count based on an estimable positive predictive value (PPV) parameter. We borrow ideas from the multiple imputation paradigm to provide accompanying standard errors, and develop an adapted Bayesian credible interval approach that yields favorable frequentist coverage properties. We demonstrate the benefits of the proposed methods through simulation studies, and provide a data example targeting estimation of the breast cancer recurrence case count among Metro Atlanta area patients from the Georgia Cancer Registry-based Cancer Recurrence Information and Surveillance Program (CRISP) database. △ Less

Submitted 24 November, 2022; originally announced November 2022.

arXiv:2202.12819 [pdf, other]

Exploratory Hidden Markov Factor Models for Longitudinal Mobile Health Data: Application to Adverse Posttraumatic Neuropsychiatric Sequelae

Authors: Lin Ge, Xinming An, Donglin Zeng, Samuel McLean, Ronald Kessler, Rui Song

Abstract: Adverse posttraumatic neuropsychiatric sequelae (APNS) are common among veterans and millions of Americans after traumatic exposures, resulting in substantial burdens for trauma survivors and society. Despite numerous studies conducted on APNS over the past decades, there has been limited progress in understanding the underlying neurobiological mechanisms due to several unique challenges. One of t… ▽ More Adverse posttraumatic neuropsychiatric sequelae (APNS) are common among veterans and millions of Americans after traumatic exposures, resulting in substantial burdens for trauma survivors and society. Despite numerous studies conducted on APNS over the past decades, there has been limited progress in understanding the underlying neurobiological mechanisms due to several unique challenges. One of these challenges is the reliance on subjective self-report measures to assess APNS, which can easily result in measurement errors and biases (e.g., recall bias). To mitigate this issue, in this paper, we investigate the potential of leveraging the objective longitudinal mobile device data to identify homogeneous APNS states and study the dynamic transitions and potential risk factors of APNS after trauma exposure. To handle specific challenges posed by longitudinal mobile device data, we developed exploratory hidden Markov factor models and designed a Stabilized Expectation-Maximization algorithm for parameter estimation. Simulation studies were conducted to evaluate the performance of parameter estimation and model selection. Finally, to demonstrate the practical utility of the method, we applied it to mobile device data collected from the Advancing Understanding of RecOvery afteR traumA (AURORA) study. △ Less

Submitted 4 June, 2023; v1 submitted 25 February, 2022; originally announced February 2022.

arXiv:2101.04783 [pdf, other]

Variable bandwidth kernel regression estimation

Authors: Janet Nakarmi, Hailin Sang, Lin Ge

Abstract: In this paper we propose a variable bandwidth kernel regression estimator for $i.i.d.$ observations in $\mathbb{R}^2$ to improve the classical Nadaraya-Watson estimator. The bias is improved to the order of $O(h_n^4)$ under the condition that the fifth order derivative of the density function and the sixth order derivative of the regression function are bounded and continuous. We also establish th… ▽ More In this paper we propose a variable bandwidth kernel regression estimator for $i.i.d.$ observations in $\mathbb{R}^2$ to improve the classical Nadaraya-Watson estimator. The bias is improved to the order of $O(h_n^4)$ under the condition that the fifth order derivative of the density function and the sixth order derivative of the regression function are bounded and continuous. We also establish the central limit theorems for the proposed ideal and true variable kernel regression estimators. The simulation study confirms our results and demonstrates the advantage of the variable bandwidth kernel method over the classical kernel method. △ Less

Submitted 12 January, 2021; originally announced January 2021.

Comments: accepted by ESAIM: PS. 36 pages, 3 figures

MSC Class: 62G07; 62E20; 62H12

arXiv:2009.09161 [pdf, other]

Label-Based Diversity Measure Among Hidden Units of Deep Neural Networks: A Regularization Method

Authors: Chenguang Zhang, Yuexian Hou, Dawei Song, Liangzhu Ge, Yaoshuai Yao

Abstract: Although the deep structure guarantees the powerful expressivity of deep networks (DNNs), it also triggers serious overfitting problem. To improve the generalization capacity of DNNs, many strategies were developed to improve the diversity among hidden units. However, most of these strategies are empirical and heuristic in absence of either a theoretical derivation of the diversity measure or a cl… ▽ More Although the deep structure guarantees the powerful expressivity of deep networks (DNNs), it also triggers serious overfitting problem. To improve the generalization capacity of DNNs, many strategies were developed to improve the diversity among hidden units. However, most of these strategies are empirical and heuristic in absence of either a theoretical derivation of the diversity measure or a clear connection from the diversity to the generalization capacity. In this paper, from an information theoretic perspective, we introduce a new definition of redundancy to describe the diversity of hidden units under supervised learning settings by formalizing the effect of hidden layers on the generalization capacity as the mutual information. We prove an opposite relationship existing between the defined redundancy and the generalization capacity, i.e., the decrease of redundancy generally improving the generalization capacity. The experiments show that the DNNs using the redundancy as the regularizer can effectively reduce the overfitting and decrease the generalization error, which well supports above points. △ Less

Submitted 3 April, 2021; v1 submitted 19 September, 2020; originally announced September 2020.

Showing 1–13 of 13 results for author: Ge, L