Search | arXiv e-print repository

arXiv:2006.07305 [pdf, other]

Reflection on modern methods: Good practices for applied statistical learning in epidemiology

Authors: Yanelli Nunez, Elizabeth A. Gibson, Eva M. Tanner, Chris Gennings, Brent A. Coull, Jeff A. Goldsmith, Marianthi-Anna Kioumourtzoglou

Abstract: Statistical learning (SL) includes methods that extract knowledge from complex data. SL methods beyond generalized linear models are being increasingly implemented in public health research and epidemiology because they can perform better in instances with complex or high-dimensional data---settings when traditional statistical methods fail. These novel methods, however, often include random sampl… ▽ More Statistical learning (SL) includes methods that extract knowledge from complex data. SL methods beyond generalized linear models are being increasingly implemented in public health research and epidemiology because they can perform better in instances with complex or high-dimensional data---settings when traditional statistical methods fail. These novel methods, however, often include random sampling which may induce variability in results. Best practices in data science can help to ensure robustness. As a case study, we included four SL models that have been applied previously to analyze the relationship between environmental mixtures and health outcomes. We ran each model across 100 initializing values for random number generation, or "seeds," and assessed variability in resulting estimation and inference. All methods exhibited some seed-dependent variability in results. The degree of variability differed across methods and exposure of interest. Any SL method reliant on a random seed will exhibit some degree of seed sensitivity. We recommend that researchers repeat their analysis with various seeds as a sensitivity analysis when implementing these methods to enhance interpretability and robustness of results. △ Less

Submitted 2 October, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: 19 pages, 5 figures, 1 table. For associated code, visit https://github.com/yanellinunez/Commentary-to-mixture-methods-paper

arXiv:2004.04734 [pdf, other]

Learning as We Go: An Examination of the Statistical Accuracy of COVID19 Daily Death Count Predictions

Authors: Roman Marchant, Noelle I. Samia, Ori Rosen, Martin A. Tanner, Sally Cripps

Abstract: This paper provides a formal evaluation of the predictive performance of a model (and its various updates) developed by the Institute for Health Metrics and Evaluation (IHME) for predicting daily deaths attributed to COVID19 for each state in the United States. The IHME models have received extensive attention in social and mass media, and have influenced policy makers at the highest levels of the… ▽ More This paper provides a formal evaluation of the predictive performance of a model (and its various updates) developed by the Institute for Health Metrics and Evaluation (IHME) for predicting daily deaths attributed to COVID19 for each state in the United States. The IHME models have received extensive attention in social and mass media, and have influenced policy makers at the highest levels of the United States government. For effective policy making the accurate assessment of uncertainty, as well as accurate point predictions, are necessary because the risks inherent in a decision must be taken into account, especially in the present setting of a novel disease affecting millions of lives. To assess the accuracy of the IHME models, we examine both forecast accuracy as well as the predictive performance of the 95% prediction intervals provided by the IHME models. We find that the initial IHME model underestimates the uncertainty surrounding the number of daily deaths substantially. Specifically, the true number of next day deaths fell outside the IHME prediction intervals as much as 70% of the time, in comparison to the expected value of 5%. In addition, we note that the performance of the initial model does not improve with shorter forecast horizons. Regarding the updated models, our analyses indicate that the later models do not show any improvement in the accuracy of the point estimate predictions. In fact, there is some evidence that this accuracy has actually decreased over the initial models. Moreover, when considering the updated models, while we observe a larger percentage of states having actual values lying inside the 95% prediction intervals (PI), our analysis suggests that this observation may be attributed to the widening of the PIs. The width of these intervals calls into question the usefulness of the predictions to drive policy making and resource allocation. △ Less

Submitted 24 May, 2020; v1 submitted 8 April, 2020; originally announced April 2020.

arXiv:1804.05803 [pdf, other]

doi 10.1017/pan.2019.19

Ecological Regression with Partial Identification

Authors: Wenxin Jiang, Gary King, Allen Schmaltz, Martin A. Tanner

Abstract: Ecological inference (EI) is the process of learning about individual behavior from aggregate data. We study a partially identified linear contextual effects model for EI and describe how to estimate the district level parameter averaging over many precincts in the presence of the non-identified parameter of the contextual effect. This may be regarded as a first attempt in this venerable literatur… ▽ More Ecological inference (EI) is the process of learning about individual behavior from aggregate data. We study a partially identified linear contextual effects model for EI and describe how to estimate the district level parameter averaging over many precincts in the presence of the non-identified parameter of the contextual effect. This may be regarded as a first attempt in this venerable literature to limit the scope of the key form of non-identifiability in EI. To study the operating characteristics of our model, we have amassed the largest collection of data with known ground truth ever applied to evaluate solutions to the EI problem. We collect and study 459 datasets from a variety of fields including public health, political science, and sociology. The datasets contain a total of 2,370,854 geographic units (e.g., precincts), with an average of 5,165 geographic units per dataset. Our replication data are publicly available via the Harvard Dataverse (Jiang et al. 2018) and may serve as a useful resource for future researchers. For all real data sets in our collection that fit our proposed rules, our approach reduces the width of the Duncan and Davis (1953) deterministic bound, on average, by about 45\%, while still capturing the true district level parameter in excess of 97\% of the time. . △ Less

Submitted 23 April, 2018; v1 submitted 16 April, 2018; originally announced April 2018.

MSC Class: 62P25; 62J99

Journal ref: Polit. Anal. 28 (2020) 65-86

arXiv:1301.7390 [pdf]

Hierarchical Mixtures-of-Experts for Exponential Family Regression Models with Generalized Linear Mean Functions: A Survey of Approximation and Consistency Results

Authors: Wenxin Jiang, Martin A. Tanner

Abstract: We investigate a class of hierarchical mixtures-of-experts (HME) models where exponential family regression models with generalized linear mean functions of the form psi(ga+fx^Tfgb) are mixed. Here psi(...) is the inverse link function. Suppose the true response y follows an exponential family regression model with mean function belonging to a class of smooth functions of the form psi(h(fx)) wh… ▽ More We investigate a class of hierarchical mixtures-of-experts (HME) models where exponential family regression models with generalized linear mean functions of the form psi(ga+fx^Tfgb) are mixed. Here psi(...) is the inverse link function. Suppose the true response y follows an exponential family regression model with mean function belonging to a class of smooth functions of the form psi(h(fx)) where h(...)in W_2^infty (a Sobolev class over [0,1]^{s}). It is shown that the HME probability density functions can approximate the true density, at a rate of O(m^{-2/s}) in L_p norm, and at a rate of O(m^{-4/s}) in Kullback-Leibler divergence. These rates can be achieved within the family of HME structures with no more than s-layers, where s is the dimension of the predictor fx. It is also shown that likelihood-based inference based on HME is consistent in recovering the truth, in the sense that as the sample size n and the number of experts m both increase, the mean square error of the predicted mean response goes to zero. Conditions for such results to hold are stated and discussed. △ Less

Submitted 30 January, 2013; originally announced January 2013.

Comments: Appears in Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI1998)

Report number: UAI-P-1998-PG-296-303

arXiv:1104.2210 [pdf, ps, other]

doi 10.1214/10-STS341

From EM to Data Augmentation: The Emergence of MCMC Bayesian Computation in the 1980s

Authors: Martin A. Tanner, Wing H. Wong

Abstract: It was known from Metropolis et al. [J. Chem. Phys. 21 (1953) 1087--1092] that one can sample from a distribution by performing Monte Carlo simulation from a Markov chain whose equilibrium distribution is equal to the target distribution. However, it took several decades before the statistical community embraced Markov chain Monte Carlo (MCMC) as a general computational tool in Bayesian inference.… ▽ More It was known from Metropolis et al. [J. Chem. Phys. 21 (1953) 1087--1092] that one can sample from a distribution by performing Monte Carlo simulation from a Markov chain whose equilibrium distribution is equal to the target distribution. However, it took several decades before the statistical community embraced Markov chain Monte Carlo (MCMC) as a general computational tool in Bayesian inference. The usual reasons that are advanced to explain why statisticians were slow to catch on to the method include lack of computing power and unfamiliarity with the early dynamic Monte Carlo papers in the statistical physics literature. We argue that there was a deeper reason, namely, that the structure of problems in the statistical mechanics and those in the standard statistical literature are different. To make the methods usable in standard Bayesian problems, one had to exploit the power that comes from the introduction of judiciously chosen auxiliary variables and collective moves. This paper examines the development in the critical period 1980--1990, when the ideas of Markov chain simulation from the statistical physics literature and the latent variable formulation in maximum likelihood computation (i.e., EM algorithm) came together to spark the widespread application of MCMC methods in Bayesian computation. △ Less

Submitted 12 April, 2011; originally announced April 2011.

Comments: Published in at http://dx.doi.org/10.1214/10-STS341 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS341

Journal ref: Statistical Science 2010, Vol. 25, No. 4, 506-516

arXiv:0810.5655 [pdf, ps, other]

doi 10.1214/07-AOS547

Gibbs posterior for variable selection in high-dimensional classification and data mining

Authors: Wenxin Jiang, Martin A. Tanner

Abstract: In the popular approach of "Bayesian variable selection" (BVS), one uses prior and posterior distributions to select a subset of candidate variables to enter the model. A completely new direction will be considered here to study BVS with a Gibbs posterior originating in statistical mechanics. The Gibbs posterior is constructed from a risk function of practical interest (such as the classificatio… ▽ More In the popular approach of "Bayesian variable selection" (BVS), one uses prior and posterior distributions to select a subset of candidate variables to enter the model. A completely new direction will be considered here to study BVS with a Gibbs posterior originating in statistical mechanics. The Gibbs posterior is constructed from a risk function of practical interest (such as the classification error) and aims at minimizing a risk function without modeling the data probabilistically. This can improve the performance over the usual Bayesian approach, which depends on a probability model which may be misspecified. Conditions will be provided to achieve good risk performance, even in the presence of high dimensionality, when the number of candidate variables "$K$" can be much larger than the sample size "$n$." In addition, we develop a convenient Markov chain Monte Carlo algorithm to implement BVS with the Gibbs posterior. △ Less

Submitted 31 October, 2008; originally announced October 2008.

Comments: Published in at http://dx.doi.org/10.1214/07-AOS547 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS547 MSC Class: 62F99 (Primary); 82-08 (Secondary)

Journal ref: Annals of Statistics 2008, Vol. 36, No. 5, 2207-2231

arXiv:0709.3545 [pdf, other]

Locally Adaptive Nonparametric Binary Regression

Authors: Sally Wood, Robert Kohn, Remy Cottet, Wenxin Jiang, Martin Tanner

Abstract: A nonparametric and locally adaptive Bayesian estimator is proposed for estimating a binary regression. Flexibility is obtained by modeling the binary regression as a mixture of probit regressions with the argument of each probit regression having a thin plate spline prior with its own smoothing parameter and with the mixture weights depending on the covariates. The estimator is compared to a si… ▽ More A nonparametric and locally adaptive Bayesian estimator is proposed for estimating a binary regression. Flexibility is obtained by modeling the binary regression as a mixture of probit regressions with the argument of each probit regression having a thin plate spline prior with its own smoothing parameter and with the mixture weights depending on the covariates. The estimator is compared to a single spline estimator and to a recently proposed locally adaptive estimator. The methodology is illustrated by applying it to both simulated and real examples. △ Less

Submitted 21 September, 2007; originally announced September 2007.

Comments: 31 pages, 10 figures

Showing 1–7 of 7 results for author: Tanner, M