Search | arXiv e-print repository

doi 10.1093/jrsssc/qlae010

Testing unit root non-stationarity in the presence of missing data in univariate time series of mobile health studies

Authors: Charlotte Fowler, Xiaoxuan Cai, Justin T. Baker, Jukka-Pekka Onnela, Linda Valeri

Abstract: The use of digital devices to collect data in mobile health (mHealth) studies introduces a novel application of time series methods, with the constraint of potential data missing at random (MAR) or missing not at random (MNAR). In time series analysis, testing for stationarity is an important preliminary step to inform appropriate later analyses. The augmented Dickey-Fuller (ADF) test was develope… ▽ More The use of digital devices to collect data in mobile health (mHealth) studies introduces a novel application of time series methods, with the constraint of potential data missing at random (MAR) or missing not at random (MNAR). In time series analysis, testing for stationarity is an important preliminary step to inform appropriate later analyses. The augmented Dickey-Fuller (ADF) test was developed to test the null hypothesis of unit root non-stationarity, under no missing data. Beyond recommendations under data missing completely at random (MCAR) for complete case analysis or last observation carry forward imputation, researchers have not extended unit root non-stationarity testing to a context with more complex missing data mechanisms. Multiple imputation with chained equations, Kalman smoothing imputation, and linear interpolation have also been proposed for time series data, however such methods impose constraints on the autocorrelation structure, and thus impact unit root testing. We propose maximum likelihood estimation and multiple imputation using state space model approaches to adapt the ADF test to a context with missing data. We further develop sensitivity analysis techniques to examine the impact of MNAR data. We evaluate the performance of existing and proposed methods across different missing mechanisms in extensive simulations and in their application to a multi-year smartphone study of bipolar patients. △ Less

Submitted 10 October, 2022; originally announced October 2022.

arXiv:2206.14343 [pdf, other]

State space model multiple imputation for missing data in non-stationary multivariate time series with application in digital Psychiatry

Authors: Xiaoxuan Cai, Xinru Wang, Li Zeng, Habiballah Rahimi Eichi, Dost Ongur, Lisa Dixon, Justin T. Baker, Jukka-Pekka Onnela, Linda Valeri

Abstract: Mobile technology enables unprecedented continuous monitoring of an individual's behavior, social interactions, symptoms, and other health conditions, presenting an enormous opportunity for therapeutic advancements and scientific discoveries regarding the etiology of psychiatric illness. Continuous collection of mobile data results in the generation of a new type of data: entangled multivariate ti… ▽ More Mobile technology enables unprecedented continuous monitoring of an individual's behavior, social interactions, symptoms, and other health conditions, presenting an enormous opportunity for therapeutic advancements and scientific discoveries regarding the etiology of psychiatric illness. Continuous collection of mobile data results in the generation of a new type of data: entangled multivariate time series of outcome, exposure, and covariates. Missing data is a pervasive problem in biomedical and social science research, and the Ecological Momentary Assessment (EMA) using mobile devices in psychiatric research is no exception. However, the complex structure of multivariate time series introduces new challenges in handling missing data for proper causal inference. Data imputation is commonly recommended to enhance data utility and estimation efficiency. The majority of available imputation methods are either designed for longitudinal data with limited follow-up times or for stationary time series, which are incompatible with potentially non-stationary time series. In the field of psychiatry, non-stationary data are frequently encountered as symptoms and treatment regimens may experience dramatic changes over time. To address missing data in possibly non-stationary multivariate time series, we propose a novel multiple imputation strategy based on the state space model (SSMmp) and a more computationally efficient variant (SSMimpute). We demonstrate their advantages over other widely used missing data strategies by evaluating their theoretical properties and empirical performance in simulations of both stationary and non-stationary time series, subject to various missing mechanisms. We apply the SSMimpute to investigate the association between social network size and negative mood using a multi-year observational smartphone study of bipolar patients, controlling for confounding variables. △ Less

Submitted 12 April, 2023; v1 submitted 28 June, 2022; originally announced June 2022.

arXiv:1806.07137 [pdf, other]

Large-Scale Stochastic Sampling from the Probability Simplex

Authors: Jack Baker, Paul Fearnhead, Emily B Fox, Christopher Nemeth

Abstract: Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for scalable Bayesian inference. These methods are based on sampling a discrete-time approximation to a continuous time process, such as the Langevin diffusion. When applied to distributions defined on a constrained space the time-discretization error can dominate when we are near the boundary of the space. We demons… ▽ More Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for scalable Bayesian inference. These methods are based on sampling a discrete-time approximation to a continuous time process, such as the Langevin diffusion. When applied to distributions defined on a constrained space the time-discretization error can dominate when we are near the boundary of the space. We demonstrate that because of this, current SGMCMC methods for the simplex struggle with sparse simplex spaces; when many of the components are close to zero. Unfortunately, many popular large-scale Bayesian models, such as network or topic models, require inference on sparse simplex spaces. To avoid the biases caused by this discretization error, we propose the stochastic Cox-Ingersoll-Ross process (SCIR), which removes all discretization error and we prove that samples from the SCIR process are asymptotically unbiased. We discuss how this idea can be extended to target other constrained spaces. Use of the SCIR process within a SGMCMC algorithm is shown to give substantially better performance for a topic model and a Dirichlet process mixture model than existing SGMCMC approaches. △ Less

Submitted 26 October, 2018; v1 submitted 19 June, 2018; originally announced June 2018.

Comments: Accepted to Advances in Neural Information Processing Systems (2018)

arXiv:1710.00578 [pdf, other]

sgmcmc: An R Package for Stochastic Gradient Markov Chain Monte Carlo

Authors: Jack Baker, Paul Fearnhead, Emily B. Fox, Christopher Nemeth

Abstract: This paper introduces the R package sgmcmc; which can be used for Bayesian inference on problems with large datasets using stochastic gradient Markov chain Monte Carlo (SGMCMC). Traditional Markov chain Monte Carlo (MCMC) methods, such as Metropolis-Hastings, are known to run prohibitively slowly as the dataset size increases. SGMCMC solves this issue by only using a subset of data at each iterati… ▽ More This paper introduces the R package sgmcmc; which can be used for Bayesian inference on problems with large datasets using stochastic gradient Markov chain Monte Carlo (SGMCMC). Traditional Markov chain Monte Carlo (MCMC) methods, such as Metropolis-Hastings, are known to run prohibitively slowly as the dataset size increases. SGMCMC solves this issue by only using a subset of data at each iteration. SGMCMC requires calculating gradients of the log likelihood and log priors, which can be time consuming and error prone to perform by hand. The sgmcmc package calculates these gradients itself using automatic differentiation, making the implementation of these methods much easier. To do this, the package uses the software library TensorFlow, which has a variety of statistical distributions and mathematical operations as standard, meaning a wide class of models can be built using this framework. SGMCMC has become widely adopted in the machine learning literature, but less so in the statistics community. We believe this may be partly due to lack of software; this package aims to bridge this gap. △ Less

Submitted 13 April, 2018; v1 submitted 2 October, 2017; originally announced October 2017.

arXiv:1706.05439 [pdf, other]

Control Variates for Stochastic Gradient MCMC

Authors: Jack Baker, Paul Fearnhead, Emily B. Fox, Christopher Nemeth

Abstract: It is well known that Markov chain Monte Carlo (MCMC) methods scale poorly with dataset size. A popular class of methods for solving this issue is stochastic gradient MCMC. These methods use a noisy estimate of the gradient of the log posterior, which reduces the per iteration computational cost of the algorithm. Despite this, there are a number of results suggesting that stochastic gradient Lange… ▽ More It is well known that Markov chain Monte Carlo (MCMC) methods scale poorly with dataset size. A popular class of methods for solving this issue is stochastic gradient MCMC. These methods use a noisy estimate of the gradient of the log posterior, which reduces the per iteration computational cost of the algorithm. Despite this, there are a number of results suggesting that stochastic gradient Langevin dynamics (SGLD), probably the most popular of these methods, still has computational cost proportional to the dataset size. We suggest an alternative log posterior gradient estimate for stochastic gradient MCMC, which uses control variates to reduce the variance. We analyse SGLD using this gradient estimate, and show that, under log-concavity assumptions on the target distribution, the computational cost required for a given level of accuracy is independent of the dataset size. Next we show that a different control variate technique, known as zero variance control variates can be applied to SGMCMC algorithms for free. This post-processing step improves the inference of the algorithm by reducing the variance of the MCMC output. Zero variance control variates rely on the gradient of the log posterior; we explore how the variance reduction is affected by replacing this with the noisy gradient estimate calculated by SGMCMC. △ Less

Submitted 14 December, 2017; v1 submitted 16 June, 2017; originally announced June 2017.

arXiv:1509.01228 [pdf, other]

doi 10.3847/0004-637X/818/1/55

Machine Learning Model of the Swift/BAT Trigger Algorithm for Long GRB Population Studies

Authors: Philip B Graff, Amy Y Lien, John G Baker, Takanori Sakamoto

Abstract: To draw inferences about gamma-ray burst (GRB) source populations based on Swift observations, it is essential to understand the detection efficiency of the Swift burst alert telescope (BAT). This study considers the problem of modeling the Swift/BAT triggering algorithm for long GRBs, a computationally expensive procedure, and models it using machine learning algorithms. A large sample of simulat… ▽ More To draw inferences about gamma-ray burst (GRB) source populations based on Swift observations, it is essential to understand the detection efficiency of the Swift burst alert telescope (BAT). This study considers the problem of modeling the Swift/BAT triggering algorithm for long GRBs, a computationally expensive procedure, and models it using machine learning algorithms. A large sample of simulated GRBs from Lien 2014 is used to train various models: random forests, boosted decision trees (with AdaBoost), support vector machines, and artificial neural networks. The best models have accuracies of $\gtrsim97\%$ ($\lesssim 3\%$ error), which is a significant improvement on a cut in GRB flux which has an accuracy of $89.6\%$ ($10.4\%$ error). These models are then used to measure the detection efficiency of Swift as a function of redshift $z$, which is used to perform Bayesian parameter estimation on the GRB rate distribution. We find a local GRB rate density of $n_0 \sim 0.48^{+0.41}_{-0.23} \ {\rm Gpc}^{-3} {\rm yr}^{-1}$ with power-law indices of $n_1 \sim 1.7^{+0.6}_{-0.5}$ and $n_2 \sim -5.9^{+5.7}_{-0.1}$ for GRBs above and below a break point of $z_1 \sim 6.8^{+2.8}_{-3.2}$. This methodology is able to improve upon earlier studies by more accurately modeling Swift detection and using this for fully Bayesian model fitting. The code used in this is analysis is publicly available online (https://github.com/PBGraff/SwiftGRB_PEanalysis). △ Less

Submitted 8 February, 2016; v1 submitted 3 September, 2015; originally announced September 2015.

Comments: 16 pages, 18 figures, 5 tables, published by ApJ

Journal ref: ApJ, 818, 55 (2016)

Showing 1–6 of 6 results for author: Baker, J