Search | arXiv e-print repository

Evaluation of Missing Data Analytical Techniques in Longitudinal Research: Traditional and Machine Learning Approaches

Abstract: Missing Not at Random (MNAR) and nonnormal data are challenging to handle. Traditional missing data analytical techniques such as full information maximum likelihood estimation (FIML) may fail with nonnormal data as they are built on normal distribution assumptions. Two-Stage Robust Estimation (TSRE) does manage nonnormal data, but both FIML and TSRE are less explored in longitudinal studies under… ▽ More Missing Not at Random (MNAR) and nonnormal data are challenging to handle. Traditional missing data analytical techniques such as full information maximum likelihood estimation (FIML) may fail with nonnormal data as they are built on normal distribution assumptions. Two-Stage Robust Estimation (TSRE) does manage nonnormal data, but both FIML and TSRE are less explored in longitudinal studies under MNAR conditions with nonnormal distributions. Unlike traditional statistical approaches, machine learning approaches do not require distributional assumptions about the data. More importantly, they have shown promise for MNAR data; however, their application in longitudinal studies, addressing both Missing at Random (MAR) and MNAR scenarios, is also underexplored. This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework. These techniques include traditional approaches like FIML and TSRE, machine learning approaches by single imputation (K-Nearest Neighbors and missForest), and machine learning approaches by multiple imputation (micecart and miceForest). We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation. Our findings indicate that FIML is most effective for MNAR data among the tested approaches. TSRE excels in handling MAR data, while missForest is only advantageous in limited conditions with a combination of very skewed distributions, very large sample sizes (e.g., n larger than 1000), and low missing data rates. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 47 pages, 3 tables, 8 figures

arXiv:2405.15090 [pdf, other]

Pure Exploration for Constrained Best Mixed Arm Identification with a Fixed Budget

Authors: Dengwang Tang, Rahul Jain, Ashutosh Nayyar, Pierluigi Nuzzo

Abstract: In this paper, we introduce the constrained best mixed arm identification (CBMAI) problem with a fixed budget. This is a pure exploration problem in a stochastic finite armed bandit model. Each arm is associated with a reward and multiple types of costs from unknown distributions. Unlike the unconstrained best arm identification problem, the optimal solution for the CBMAI problem may be a randomiz… ▽ More In this paper, we introduce the constrained best mixed arm identification (CBMAI) problem with a fixed budget. This is a pure exploration problem in a stochastic finite armed bandit model. Each arm is associated with a reward and multiple types of costs from unknown distributions. Unlike the unconstrained best arm identification problem, the optimal solution for the CBMAI problem may be a randomized mixture of multiple arms. The goal thus is to find the best mixed arm that maximizes the expected reward subject to constraints on the expected costs with a given learning budget $N$. We propose a novel, parameter-free algorithm, called the Score Function-based Successive Reject (SFSR) algorithm, that combines the classical successive reject framework with a novel score-function-based rejection criteria based on linear programming theory to identify the optimal support. We provide a theoretical upper bound on the mis-identification (of the the support of the best mixed arm) probability and show that it decays exponentially in the budget $N$ and some constants that characterize the hardness of the problem instance. We also develop an information theoretic lower bound on the error probability that shows that these constants appropriately characterize the problem difficulty. We validate this empirically on a number of average and hard instances. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 7 pages, 5 figures, 1 table

arXiv:2401.12937 [pdf, other]

doi 10.1080/10705511.2024.2351102

Are the Signs of Factor Loadings Arbitrary in Confirmatory Factor Analysis? Problems and Solutions

Authors: Dandan Tang, Steven M. Boker, Xin Tong

Abstract: The replication crisis in social and behavioral sciences has raised concerns about the reliability and validity of empirical studies. While research in the literature has explored contributing factors to this crisis, the issues related to analytical tools have received less attention. This study focuses on a widely used analytical tool - confirmatory factor analysis (CFA) - and investigates one is… ▽ More The replication crisis in social and behavioral sciences has raised concerns about the reliability and validity of empirical studies. While research in the literature has explored contributing factors to this crisis, the issues related to analytical tools have received less attention. This study focuses on a widely used analytical tool - confirmatory factor analysis (CFA) - and investigates one issue that is typically overlooked in practice: accurately estimating factor-loading signs. Incorrect loading signs can distort the relationship between observed variables and latent factors, leading to unreliable or invalid results in subsequent analyses. Our study aims to investigate and address the estimation problem of factor-loading signs in CFA models. Based on an empirical demonstration and Monte Carlo simulation studies, we found current methods have drawbacks in estimating loading signs. To address this problem, three solutions are proposed and proven to work effectively. The applications of these solutions are discussed and elaborated. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: 35 pages, 3 figures, 8 tables

Journal ref: Structural Equation Modeling: A Multidisciplinary Journal 2024

arXiv:2312.17363 [pdf, other]

A Comparison of Full Information Maximum Likelihood and Machine Learning Missing Data Analytical Methods in Growth Curve Modeling

Authors: Dandan Tang, Xin Tong

Abstract: Missing data are inevitable in longitudinal studies. Traditional methods, such as the full information maximum likelihood (FIML), are commonly used to handle ignorable missing data. However, they may lead to biased model estimation due to missing not at random data that often appear in longitudinal studies. Recently, machine learning methods, such as random forests (RF) and K-nearest neighbors (KN… ▽ More Missing data are inevitable in longitudinal studies. Traditional methods, such as the full information maximum likelihood (FIML), are commonly used to handle ignorable missing data. However, they may lead to biased model estimation due to missing not at random data that often appear in longitudinal studies. Recently, machine learning methods, such as random forests (RF) and K-nearest neighbors (KNN) imputation methods, have been proposed to cope with missing values. Although machine learning imputation methods have been gaining popularity, few studies have investigated the tenability and utility of these methods in longitudinal research. Through Monte Carlo simulations, this study evaluates and compares the performance of traditional and machine learning approaches (FIML, RF, and KNN) in growth curve modeling. The effects of sample size, the rate of missingness, and the missing data mechanism on model estimation are investigated. Results indicate that FIML is a better choice than the two machine learning imputation methods in terms of model estimation accuracy and efficiency. △ Less

Submitted 28 December, 2023; originally announced December 2023.

Comments: 8 pages, 2 figures, and This proceeding was accepted by The Annual Meeting of the Psychometric Society

Journal ref: The Annual Meeting of the Psychometric Society 2023

arXiv:2310.11531 [pdf, ps, other]

Efficient Online Learning with Offline Datasets for Infinite Horizon MDPs: A Bayesian Approach

Authors: Dengwang Tang, Rahul Jain, Botao Hao, Zheng Wen

Abstract: In this paper, we study the problem of efficient online reinforcement learning in the infinite horizon setting when there is an offline dataset to start with. We assume that the offline dataset is generated by an expert but with unknown level of competence, i.e., it is not perfect and not necessarily using the optimal policy. We show that if the learning agent models the behavioral policy (paramet… ▽ More In this paper, we study the problem of efficient online reinforcement learning in the infinite horizon setting when there is an offline dataset to start with. We assume that the offline dataset is generated by an expert but with unknown level of competence, i.e., it is not perfect and not necessarily using the optimal policy. We show that if the learning agent models the behavioral policy (parameterized by a competence parameter) used by the expert, it can do substantially better in terms of minimizing cumulative regret, than if it doesn't do that. We establish an upper bound on regret of the exact informed PSRL algorithm that scales as $\tilde{O}(\sqrt{T})$. This requires a novel prior-dependent regret analysis of Bayesian online learning algorithms for the infinite horizon setting. We then propose the Informed RLSVI algorithm to efficiently approximate the iPSRL algorithm. △ Less

Submitted 1 February, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: 22 pages

MSC Class: 93E35

arXiv:2310.10107 [pdf, other]

Posterior Sampling-based Online Learning for Episodic POMDPs

Authors: Dengwang Tang, Dongze Ye, Rahul Jain, Ashutosh Nayyar, Pierluigi Nuzzo

Abstract: Learning in POMDPs is known to be significantly harder than MDPs. In this paper, we consider the online learning problem for episodic POMDPs with unknown transition and observation models. We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms… ▽ More Learning in POMDPs is known to be significantly harder than MDPs. In this paper, we consider the online learning problem for episodic POMDPs with unknown transition and observation models. We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms for POMDPs. We show that the Bayesian regret of the proposed algorithm scales as the square root of the number of episodes, matching the lower bound, and is polynomial in the other parameters. In a general setting, its regret scales exponentially in the horizon length $H$, and we show that this is inevitable by providing a lower bound. However, when the POMDP is undercomplete and weakly revealing (a common assumption in the recent literature), we establish a polynomial Bayesian regret bound. We finally propose a posterior sampling algorithm for multi-agent POMDPs, and show it too has sublinear regret. △ Less

Submitted 23 May, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: 32 pages, 4 figures

MSC Class: 93E35

arXiv:2304.01098 [pdf, other]

The synthetic instrument: From sparse association to sparse causation

Authors: Dingke Tang, Dehan Kong, Linbo Wang

Abstract: In many observational studies, researchers are often interested in studying the effects of multiple exposures on a single outcome. Standard approaches for high-dimensional data such as the lasso assume the associations between the exposures and the outcome are sparse. These methods, however, do not estimate the causal effects in the presence of unmeasured confounding. In this paper, we consider an… ▽ More In many observational studies, researchers are often interested in studying the effects of multiple exposures on a single outcome. Standard approaches for high-dimensional data such as the lasso assume the associations between the exposures and the outcome are sparse. These methods, however, do not estimate the causal effects in the presence of unmeasured confounding. In this paper, we consider an alternative approach that assumes the causal effects in view are sparse. We show that with sparse causation, the causal effects are identifiable even with unmeasured confounding. At the core of our proposal is a novel device, called the synthetic instrument, that in contrast to standard instrumental variables, can be constructed using the observed exposures directly. We show that under linear structural equation models, the problem of causal effect estimation can be formulated as an $\ell_0$-penalization problem, and hence can be solved efficiently using off-the-shelf software. Simulations show that our approach outperforms state-of-art methods in both low-dimensional and high-dimensional settings. We further illustrate our method using a mouse obesity dataset. △ Less

Submitted 3 April, 2023; originally announced April 2023.

arXiv:2303.01954 [pdf, other]

Synthetic Data Generator for Adaptive Interventions in Global Health

Authors: Aditya Rastogi, Juan Francisco Garamendi, Ana Fernández del Río, Anna Guitart, Moiz Hassan Khan, Dexian Tang, África Periáñez

Abstract: Artificial Intelligence and digital health have the potential to transform global health. However, having access to representative data to test and validate algorithms in realistic production environments is essential. We introduce HealthSyn, an open-source synthetic data generator of user behavior for testing reinforcement learning algorithms in the context of mobile health interventions. The gen… ▽ More Artificial Intelligence and digital health have the potential to transform global health. However, having access to representative data to test and validate algorithms in realistic production environments is essential. We introduce HealthSyn, an open-source synthetic data generator of user behavior for testing reinforcement learning algorithms in the context of mobile health interventions. The generator utilizes Markov processes to generate diverse user actions, with individual user behavioral patterns that can change in reaction to personalized interventions (i.e., reminders, recommendations, and incentives). These actions are translated into actual logs using an ML-purposed data schema specific to the mobile health application functionality included with HealthKit, and open-source SDK. The logs can be fed to pipelines to obtain user metrics. The generated data, which is based on real-world behaviors and simulation techniques, can be used to develop, test, and evaluate, both ML algorithms in research and end-to-end operational RL-based intervention delivery frameworks. △ Less

Submitted 27 April, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

arXiv:2012.05849 [pdf, ps, other]

The Promises of Parallel Outcomes

Authors: Ying Zhou, Dingke Tang, Dehan Kong, Linbo Wang

Abstract: A key challenge in causal inference from observational studies is the identification and estimation of causal effects in the presence of unmeasured confounding. In this paper, we introduce a novel approach for causal inference that leverages information in multiple outcomes to deal with unmeasured confounding. The key assumption in our approach is conditional independence among multiple outcomes.… ▽ More A key challenge in causal inference from observational studies is the identification and estimation of causal effects in the presence of unmeasured confounding. In this paper, we introduce a novel approach for causal inference that leverages information in multiple outcomes to deal with unmeasured confounding. The key assumption in our approach is conditional independence among multiple outcomes. In contrast to existing proposals in the literature, the roles of multiple outcomes in our key identification assumption are symmetric, hence the name parallel outcomes. We show nonparametric identifiability with at least three parallel outcomes and provide parametric estimation tools under a set of linear structural equation models. Our proposal is evaluated through a set of synthetic and real data analyses. △ Less

Submitted 14 October, 2022; v1 submitted 10 December, 2020; originally announced December 2020.

arXiv:2007.14190 [pdf, other]

Ultra-high Dimensional Variable Selection for Doubly Robust Causal Inference

Authors: Dingke Tang, Dehan Kong, Wenliang Pan, Linbo Wang

Abstract: Causal inference has been increasingly reliant on observational studies with rich covariate information. To build tractable causal procedures, such as the doubly robust estimators, it is imperative to first extract important features from high or even ultra-high dimensional data. In this paper, we propose causal ball screening for confounder selection from modern ultra-high dimensional data sets.… ▽ More Causal inference has been increasingly reliant on observational studies with rich covariate information. To build tractable causal procedures, such as the doubly robust estimators, it is imperative to first extract important features from high or even ultra-high dimensional data. In this paper, we propose causal ball screening for confounder selection from modern ultra-high dimensional data sets. Unlike the familiar task of variable selection for prediction modeling, our confounder selection procedure aims to control for confounding while improving efficiency in the resulting causal effect estimate. Previous empirical and theoretical studies suggest excluding causes of the treatment that are not confounders. Motivated by these results, our goal is to keep all the predictors of the outcome in both the propensity score and outcome regression models. A distinctive feature of our proposal is that we use an outcome model-free procedure for propensity score model selection, thereby maintaining double robustness in the resulting causal effect estimator. Our theoretical analyses show that the proposed procedure enjoys a number of properties, including model selection consistency and point-wise normality. Synthetic and real data analysis show that our proposal performs favorably with existing methods in a range of realistic settings. Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. △ Less

Submitted 6 February, 2022; v1 submitted 28 July, 2020; originally announced July 2020.

Comments: To appear in Biometrics

arXiv:1906.06419 [pdf, other]

Learning Correlated Latent Representations with Adaptive Priors

Authors: Da Tang, Dawen Liang, Nicholas Ruozzi, Tony Jebara

Abstract: Variational Auto-Encoders (VAEs) have been widely applied for learning compact, low-dimensional latent representations of high-dimensional data. When the correlation structure among data points is available, previous work proposed Correlated Variational Auto-Encoders (CVAEs), which employ a structured mixture model as prior and a structured variational posterior for each mixture component to enfor… ▽ More Variational Auto-Encoders (VAEs) have been widely applied for learning compact, low-dimensional latent representations of high-dimensional data. When the correlation structure among data points is available, previous work proposed Correlated Variational Auto-Encoders (CVAEs), which employ a structured mixture model as prior and a structured variational posterior for each mixture component to enforce that the learned latent representations follow the same correlation structure. However, as we demonstrate in this work, such a choice cannot guarantee that CVAEs capture all the correlations. Furthermore, it prevents us from obtaining a tractable joint and marginal variational distribution. To address these issues, we propose Adaptive Correlated Variational Auto-Encoders (ACVAEs), which apply an adaptive prior distribution that can be adjusted during training and can learn a tractable joint variational distribution. Its tractable form also enables further refinement with belief propagation. Experimental results on link prediction and hierarchical clustering show that ACVAEs significantly outperform CVAEs among other benchmarks. △ Less

Submitted 18 December, 2019; v1 submitted 14 June, 2019; originally announced June 2019.

Comments: 16 pages, 1 figure, 5 tables

arXiv:1905.05335 [pdf, other]

Correlated Variational Auto-Encoders

Authors: Da Tang, Dawen Liang, Tony Jebara, Nicholas Ruozzi

Abstract: Variational Auto-Encoders (VAEs) are capable of learning latent representations for high dimensional data. However, due to the i.i.d. assumption, VAEs only optimize the singleton variational distributions and fail to account for the correlations between data points, which might be crucial for learning latent representations from dataset where a priori we know correlations exist. We propose Correla… ▽ More Variational Auto-Encoders (VAEs) are capable of learning latent representations for high dimensional data. However, due to the i.i.d. assumption, VAEs only optimize the singleton variational distributions and fail to account for the correlations between data points, which might be crucial for learning latent representations from dataset where a priori we know correlations exist. We propose Correlated Variational Auto-Encoders (CVAEs) that can take the correlation structure into consideration when learning latent representations with VAEs. CVAEs apply a prior based on the correlation structure. To address the intractability introduced by the correlated prior, we develop an approximation by average of a set of tractable lower bounds over all maximal acyclic subgraphs of the undirected correlation graph. Experimental results on matching and link prediction on public benchmark rating datasets and spectral clustering on a synthetic dataset show the effectiveness of the proposed method over baseline algorithms. △ Less

Submitted 17 April, 2020; v1 submitted 13 May, 2019; originally announced May 2019.

Comments: International Conference on Machine Learning (ICML), 2019

arXiv:1903.02984 [pdf, other]

The Variational Predictive Natural Gradient

Authors: Da Tang, Rajesh Ranganath

Abstract: Variational inference transforms posterior inference into parametric optimization thereby enabling the use of latent variable models where otherwise impractical. However, variational inference can be finicky when different variational parameters control variables that are strongly correlated under the model. Traditional natural gradients based on the variational approximation fail to correct for c… ▽ More Variational inference transforms posterior inference into parametric optimization thereby enabling the use of latent variable models where otherwise impractical. However, variational inference can be finicky when different variational parameters control variables that are strongly correlated under the model. Traditional natural gradients based on the variational approximation fail to correct for correlations when the approximation is not the true posterior. To address this, we construct a new natural gradient called the Variational Predictive Natural Gradient (VPNG). Unlike traditional natural gradients for variational inference, this natural gradient accounts for the relationship between model parameters and variational parameters. We demonstrate the insight with a simple example as well as the empirical value on a classification task, a deep generative model of images, and probabilistic matrix factorization for recommendation. △ Less

Submitted 29 November, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

Comments: International Conference on Machine Learning (ICML), 2019

arXiv:1807.06651 [pdf, other]

doi 10.1145/3270323.327032

Item Recommendation with Variational Autoencoders and Heterogenous Priors

Authors: Giannis Karamanolakis, Kevin Raji Cherian, Ananth Ravi Narayan, Jie Yuan, Da Tang, Tony Jebara

Abstract: In recent years, Variational Autoencoders (VAEs) have been shown to be highly effective in both standard collaborative filtering applications and extensions such as incorporation of implicit feedback. We extend VAEs to collaborative filtering with side information, for instance when ratings are combined with explicit text feedback from the user. Instead of using a user-agnostic standard Gaussian p… ▽ More In recent years, Variational Autoencoders (VAEs) have been shown to be highly effective in both standard collaborative filtering applications and extensions such as incorporation of implicit feedback. We extend VAEs to collaborative filtering with side information, for instance when ratings are combined with explicit text feedback from the user. Instead of using a user-agnostic standard Gaussian prior, we incorporate user-dependent priors in the latent VAE space to encode users' preferences as functions of the review text. Taking into account both the rating and the text information to represent users in this multimodal latent space is promising to improve recommendation quality. Our proposed model is shown to outperform the existing VAE models for collaborative filtering (up to 29.41% relative improvement in ranking metric) along with other baselines that incorporate both user ratings and text for item recommendation. △ Less

Submitted 6 October, 2018; v1 submitted 17 July, 2018; originally announced July 2018.

Comments: Accepted for the 3rd Workshop on Deep Learning for Recommender Systems (DLRS 2018), held in conjunction with the 12th ACM Conference on Recommender Systems (RecSys 2018) in Vancouver, Canada

arXiv:1611.00838 [pdf, other]

Initialization and Coordinate Optimization for Multi-way Matching

Authors: Da Tang, Tony Jebara

Abstract: We consider the problem of consistently matching multiple sets of elements to each other, which is a common task in fields such as computer vision. To solve the underlying NP-hard objective, existing methods often relax or approximate it, but end up with unsatisfying empirical performance due to a misaligned objective. We propose a coordinate update algorithm that directly optimizes the target obj… ▽ More We consider the problem of consistently matching multiple sets of elements to each other, which is a common task in fields such as computer vision. To solve the underlying NP-hard objective, existing methods often relax or approximate it, but end up with unsatisfying empirical performance due to a misaligned objective. We propose a coordinate update algorithm that directly optimizes the target objective. By using pairwise alignment information to build an undirected graph and initializing the permutation matrices along the edges of its Maximum Spanning Tree, our algorithm successfully avoids bad local optima. Theoretically, with high probability our algorithm guarantees an optimal solution under reasonable noise assumptions. Empirically, our algorithm consistently and significantly outperforms existing methods on several benchmark tasks on real datasets. △ Less

Submitted 18 July, 2019; v1 submitted 2 November, 2016; originally announced November 2016.

Comments: Artificial Intelligence and Statistics (AISTATS), 2017

arXiv:1512.06273 [pdf, ps, other]

A Marked Cox Model for IBNR Claims: Model and Theory

Authors: Andrei L. Badescu, X. Sheldon Lin, Dameng Tang

Abstract: Incurred but not reported (IBNR) loss reserving is an important issue for Property & Casualty (P&C) insurers. The modeling of the claim arrival process, especially its temporal dependence, has not been closely examined in many of the current loss reserving models. In this paper, we propose modeling the claim arrival process together with its reporting delays as a marked Cox process. Our model is… ▽ More Incurred but not reported (IBNR) loss reserving is an important issue for Property & Casualty (P&C) insurers. The modeling of the claim arrival process, especially its temporal dependence, has not been closely examined in many of the current loss reserving models. In this paper, we propose modeling the claim arrival process together with its reporting delays as a marked Cox process. Our model is versatile in modeling temporal dependence, allowing also for natural interpretations. This paper focuses mainly on the theoretical aspects of the proposed model. We show that the associated reported claim process and IBNR claim process are both marked Cox processes with easily convertible intensity functions and marking distributions. The proposed model can also account for fluctuations in the exposure. By an order statistics property, we show that the corresponding discretely observed process preserves all the information about the claim arrival epochs. Finally, we derive closed-form expressions for both the autocorrelation function (ACF) and the distributions of the numbers of reported claims and IBNR claims. Model estimation and its applications are considered in a subsequent paper, Badescu et al.(2015b) △ Less

Submitted 19 December, 2015; originally announced December 2015.

Comments: 25 pages, working paper

Showing 1–16 of 16 results for author: Tang, D