-
What's the Weight? Estimating Controlled Outcome Differences in Complex Surveys for Health Disparities Research
Authors:
Stephen Salerno,
Emily K. Roberts,
Belinda L. Needham,
Tyler H. McCormick,
Bhramar Mukherjee,
Xu Shi
Abstract:
A basic descriptive question in statistics often asks whether there are differences in mean outcomes between groups based on levels of a discrete covariate (e.g., racial disparities in health outcomes). However, when this categorical covariate of interest is correlated with other factors related to the outcome, direct comparisons may lead to biased estimates and invalid inferential conclusions wit…
▽ More
A basic descriptive question in statistics often asks whether there are differences in mean outcomes between groups based on levels of a discrete covariate (e.g., racial disparities in health outcomes). However, when this categorical covariate of interest is correlated with other factors related to the outcome, direct comparisons may lead to biased estimates and invalid inferential conclusions without appropriate adjustment. Propensity score methods are broadly employed with observational data as a tool to achieve covariate balance, but how to implement them in complex surveys is less studied - in particular, when the survey weights depend on the group variable under comparison. In this work, we focus on a specific example when sample selection depends on race. We propose identification formulas to properly estimate the average controlled difference (ACD) in outcomes between Black and White individuals, with appropriate weighting for covariate imbalance across the two racial groups and generalizability. Via extensive simulation, we show that our proposed methods outperform traditional analytic approaches in terms of bias, mean squared error, and coverage. We are motivated by the interplay between race and social determinants of health when estimating racial differences in telomere length using data from the National Health and Nutrition Examination Survey. We build a propensity for race to properly adjust for other social determinants while characterizing the controlled effect of race on telomere length. We find that evidence of racial differences in telomere length between Black and White individuals attenuates after accounting for confounding by socioeconomic factors and after utilizing appropriate propensity score and survey weighting techniques. Software to implement these methods can be found in the R package svycdiff at https://github.com/salernos/svycdiff.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Model-Based Inference and Experimental Design for Interference Using Partial Network Data
Authors:
Steven Wilkins Reeves,
Shane Lubold,
Arun G. Chandrasekhar,
Tyler H. McCormick
Abstract:
The stable unit treatment value assumption states that the outcome of an individual is not affected by the treatment statuses of others, however in many real world applications, treatments can have an effect on many others beyond the immediately treated. Interference can generically be thought of as mediated through some network structure. In many empirically relevant situations however, complete…
▽ More
The stable unit treatment value assumption states that the outcome of an individual is not affected by the treatment statuses of others, however in many real world applications, treatments can have an effect on many others beyond the immediately treated. Interference can generically be thought of as mediated through some network structure. In many empirically relevant situations however, complete network data (required to adjust for these spillover effects) are too costly or logistically infeasible to collect. Partially or indirectly observed network data (e.g., subsamples, aggregated relational data (ARD), egocentric sampling, or respondent-driven sampling) reduce the logistical and financial burden of collecting network data, but the statistical properties of treatment effect adjustments from these design strategies are only beginning to be explored. In this paper, we present a framework for the estimation and inference of treatment effect adjustments using partial network data through the lens of structural causal models. We also illustrate procedures to assign treatments using only partial network data, with the goal of either minimizing estimator variance or optimally seeding. We derive single network asymptotic results applicable to a variety of choices for an underlying graph model. We validate our approach using simulated experiments on observed graphs with applications to information diffusion in India and Malawi.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives
Authors:
Shuxian Fan,
Adam Visokay,
Kentaro Hoffman,
Stephen Salerno,
Li Liu,
Jeffrey T. Leek,
Tyler H. McCormick
Abstract:
In settings where most deaths occur outside the healthcare system, verbal autopsies (VAs) are a common tool to monitor trends in causes of death (COD). VAs are interviews with a surviving caregiver or relative that are used to predict the decedent's COD. Turning VAs into actionable insights for researchers and policymakers requires two steps (i) predicting likely COD using the VA interview and (ii…
▽ More
In settings where most deaths occur outside the healthcare system, verbal autopsies (VAs) are a common tool to monitor trends in causes of death (COD). VAs are interviews with a surviving caregiver or relative that are used to predict the decedent's COD. Turning VAs into actionable insights for researchers and policymakers requires two steps (i) predicting likely COD using the VA interview and (ii) performing inference with predicted CODs (e.g. modeling the breakdown of causes by demographic factors using a sample of deaths). In this paper, we develop a method for valid inference using outcomes (in our case COD) predicted from free-form text using state-of-the-art NLP techniques. This method, which we call multiPPI++, extends recent work in "prediction-powered inference" to multinomial classification. We leverage a suite of NLP techniques for COD prediction and, through empirical analysis of VA data, demonstrate the effectiveness of our approach in handling transportability issues. multiPPI++ recovers ground truth estimates, regardless of which NLP model produced predictions and regardless of whether they were produced by a more accurate predictor like GPT-4-32k or a less accurate predictor like KNN. Our findings demonstrate the practical importance of inference correction for public health decision-making and suggests that if inference tasks are the end goal, having a small amount of contextually relevant, high quality labeled data is essential regardless of the NLP algorithm.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Robustly estimating heterogeneity in factorial data using Rashomon Partitions
Authors:
Aparajithan Venkateswaran,
Anirudh Sankar,
Arun G. Chandrasekhar,
Tyler H. McCormick
Abstract:
Many statistical analyses, in both observational data and randomized control trials, ask: how does the outcome of interest vary with combinations of observable covariates? How do various drug combinations affect health outcomes, or how does technology adoption depend on incentives and demographics? Our goal is to partition this factorial space into "pools" of covariate combinations where the outco…
▽ More
Many statistical analyses, in both observational data and randomized control trials, ask: how does the outcome of interest vary with combinations of observable covariates? How do various drug combinations affect health outcomes, or how does technology adoption depend on incentives and demographics? Our goal is to partition this factorial space into "pools" of covariate combinations where the outcome differs across the pools (but not within a pool). Existing approaches (i) search for a single "optimal" partition under assumptions about the association between covariates or (ii) sample from the entire set of possible partitions. Both these approaches ignore the reality that, especially with correlation structure in covariates, many ways to partition the covariate space may be statistically indistinguishable, despite very different implications for policy or science. We develop an alternative perspective, called Rashomon Partition Sets (RPSs). Each item in the RPS partitions the space of covariates using a tree-like geometry. RPSs incorporate all partitions that have posterior values near the maximum a posteriori partition, even if they offer substantively different explanations, and do so using a prior that makes no assumptions about associations between covariates. This prior is the $\ell_0$ prior, which we show is minimax optimal. Given the RPS we calculate the posterior of any measurable function of the feature effects vector on outcomes, conditional on being in the RPS. We also characterize approximation error relative to the entire posterior and provide bounds on the size of the RPS. Simulations demonstrate this framework allows for robust conclusions relative to conventional regularization techniques. We apply our method to three empirical settings: price effects on charitable giving, chromosomal structure (telomere length), and the introduction of microfinance.
△ Less
Submitted 25 June, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
Bayesian analysis of verbal autopsy data using factor models with age- and sex-dependent associations between symptoms
Authors:
Tsuyoshi Kunihama,
Zehang Richard Li,
Samuel J. Clark,
Tyler H. McCormick
Abstract:
Verbal autopsies (VAs) are extensively used to investigate the population-level distributions of deaths by cause in low-resource settings without well-organized vital statistics systems. Computer-based methods are often adopted to assign causes of death to deceased individuals based on the interview responses of their family members or caregivers. In this article, we develop a new Bayesian approac…
▽ More
Verbal autopsies (VAs) are extensively used to investigate the population-level distributions of deaths by cause in low-resource settings without well-organized vital statistics systems. Computer-based methods are often adopted to assign causes of death to deceased individuals based on the interview responses of their family members or caregivers. In this article, we develop a new Bayesian approach that extracts information about cause-of-death distributions from VA data considering the age- and sex-related variation in the associations between symptoms. Its performance is compared with that of existing approaches using gold-standard data from the Population Health Metrics Research Consortium. In addition, we compute the relevance of predictors to causes of death based on information-theoretic measures.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Non-robustness of diffusion estimates on networks with measurement error
Authors:
Arun G. Chandrasekhar,
Paul Goldsmith-Pinkham,
Tyler H. McCormick,
Samuel Thau,
Jerry Wei
Abstract:
Network diffusion models are used to study things like disease transmission, information spread, and technology adoption. However, small amounts of mismeasurement are extremely likely in the networks constructed to operationalize these models. We show that estimates of diffusions are highly non-robust to this measurement error. First, we show that even when measurement error is vanishingly small,…
▽ More
Network diffusion models are used to study things like disease transmission, information spread, and technology adoption. However, small amounts of mismeasurement are extremely likely in the networks constructed to operationalize these models. We show that estimates of diffusions are highly non-robust to this measurement error. First, we show that even when measurement error is vanishingly small, such that the share of missed links is close to zero, forecasts about the extent of diffusion will greatly underestimate the truth. Second, a small mismeasurement in the identity of the initial seed generates a large shift in the locations of expected diffusion path. We show that both of these results still hold when the vanishing measurement error is only local in nature. Such non-robustness in forecasting exists even under conditions where the basic reproductive number is consistently estimable. Possible solutions, such as estimating the measurement error or implementing widespread detection efforts, still face difficulties because the number of missed links are so small. Finally, we conduct Monte Carlo simulations on simulated networks, and real networks from three settings: travel data from the COVID-19 pandemic in the western US, a mobile phone marketing campaign in rural India, and in an insurance experiment in China.
△ Less
Submitted 11 June, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Do We Really Even Need Data?
Authors:
Kentaro Hoffman,
Stephen Salerno,
Awan Afiaz,
Jeffrey T. Leek,
Tyler H. McCormick
Abstract:
As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g. rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as outcome variables. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association b…
▽ More
As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g. rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as outcome variables. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to this so-called ``inference with predicted data'' problem and elucidate three potential sources of error: (i) the relationship between predicted outcomes and their true, unobserved counterparts, (ii) robustness of the machine learning model to resampling or uncertainty about the training data, and (iii) appropriately propagating not just bias but also uncertainty from predictions into the ultimate inference procedure.
△ Less
Submitted 2 February, 2024; v1 submitted 14 January, 2024;
originally announced January 2024.
-
Feasible contact tracing
Authors:
Aparajithan Venkateswaran,
Jishnu Das,
Tyler H. McCormick
Abstract:
Contact tracing is one of the most important tools for preventing the spread of infectious diseases, but as the experience of COVID-19 showed, it is also next-to-impossible to implement when the disease is spreading rapidly. We show how to substantially improve the efficiency of contact tracing by combining standard microeconomic tools that measure heterogeneity in how infectious a sick person is…
▽ More
Contact tracing is one of the most important tools for preventing the spread of infectious diseases, but as the experience of COVID-19 showed, it is also next-to-impossible to implement when the disease is spreading rapidly. We show how to substantially improve the efficiency of contact tracing by combining standard microeconomic tools that measure heterogeneity in how infectious a sick person is with ideas from machine learning about sequential optimization. Our contributions are twofold. First, we incorporate heterogeneity in individual infectiousness in a multi-armed bandit to establish optimal algorithms. At the heart of this strategy is a focus on learning. In the typical conceptualization of contact tracing, contacts of an infected person are tested to find more infections. Under a learning-first framework, however, contacts of infected persons are tested to ascertain whether the infected person is likely to be a "high infector" and to find additional infections only if it is likely to be highly fruitful. Second, we demonstrate using three administrative contact tracing datasets from India and Pakistan during COVID-19 that this strategy improves efficiency. Using our algorithm, we find 80% of infections with just 40% of contacts while current approaches test twice as many contacts to identify the same number of infections. We further show that a simple strategy that can be easily implemented in the field performs at nearly optimal levels, allowing for, what we call, feasible contact tracing. These results are immediately transferable to contact tracing in any epidemic.
△ Less
Submitted 9 December, 2023;
originally announced December 2023.
-
Respondent-Driven Sampling: An Overview in the Context of Human Trafficking
Authors:
Jessica P. Kunke,
Adam Visokay,
Tyler H. McCormick
Abstract:
Respondent-driven sampling (RDS) is both a sampling strategy and an estimation method. It is commonly used to study individuals that are difficult to access with standard sampling techniques. As with any sampling strategy, RDS has advantages and challenges. This article examines recent work using RDS in the context of human trafficking. We begin with an overview of the RDS process and methodology,…
▽ More
Respondent-driven sampling (RDS) is both a sampling strategy and an estimation method. It is commonly used to study individuals that are difficult to access with standard sampling techniques. As with any sampling strategy, RDS has advantages and challenges. This article examines recent work using RDS in the context of human trafficking. We begin with an overview of the RDS process and methodology, then discuss RDS in the particular context of trafficking. We end with a description of recent work and potential future directions.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
General Covariance-Based Conditions for Central Limit Theorems with Dependent Triangular Arrays
Authors:
Arun G. Chandrasekhar,
Matthew O. Jackson,
Tyler H. McCormick,
Vydhourie Thiyageswaran
Abstract:
We present a general central limit theorem with simple, easy-to-check covariance-based sufficient conditions for triangular arrays of random vectors when all variables could be interdependent. The result is constructed from Stein's method, but the conditions are distinct from related work. We show that these covariance conditions nest standard assumptions studied in the literature such as $M$-depe…
▽ More
We present a general central limit theorem with simple, easy-to-check covariance-based sufficient conditions for triangular arrays of random vectors when all variables could be interdependent. The result is constructed from Stein's method, but the conditions are distinct from related work. We show that these covariance conditions nest standard assumptions studied in the literature such as $M$-dependence, mixing random fields, non-mixing autoregressive processes, and dependency graphs, which themselves need not imply each other. This permits researchers to work with high-level but intuitive conditions based on overall correlation instead of more complicated and restrictive conditions such as strong mixing in random fields that may not have any obvious micro-foundation. As examples of the implications, we show how the theorem implies asymptotic normality in estimating: treatment effects with spillovers in more settings than previously admitted, covariance matrices, processes with global dependencies such as epidemic spread and information diffusion, and spatial process with Matérn dependencies.
△ Less
Submitted 14 December, 2023; v1 submitted 23 August, 2023;
originally announced August 2023.
-
Estimating and Correcting Degree Ratio Bias in the Network Scale-up Method
Authors:
Ian Laga,
Jessica P. Kunke,
Tyler H. McCormick,
Xiaoyue Niu
Abstract:
The Network Scale-up Method (NSUM) uses social networks and answers to "How many X's do you know?" questions to estimate sizes of groups excluded by standard surveys. This paper addresses the bias caused by varying average social network sizes across populations, commonly referred to as the degree ratio bias. This bias is especially important for marginalized populations like sex workers and drug…
▽ More
The Network Scale-up Method (NSUM) uses social networks and answers to "How many X's do you know?" questions to estimate sizes of groups excluded by standard surveys. This paper addresses the bias caused by varying average social network sizes across populations, commonly referred to as the degree ratio bias. This bias is especially important for marginalized populations like sex workers and drug users, where members tend to have smaller social networks than the average person. We show how the degree ratio affects size estimates and provide a method to estimate degree ratios without collecting additional data. We demonstrate that our adjustment procedure improves the accuracy of NSUM size estimates using simulations and data from two data sources.
△ Less
Submitted 25 March, 2024; v1 submitted 7 May, 2023;
originally announced May 2023.
-
Comparing the Robustness of Simple Network Scale-Up Method (NSUM) Estimators
Authors:
Jessica P. Kunke,
Ian Laga,
Xiaoyue Niu,
Tyler H. McCormick
Abstract:
The network scale-up method (NSUM) is a cost-effective approach to estimating the size or prevalence of a group of people that is hard to reach through a standard survey. The basic NSUM involves two steps: estimating respondents' degrees by one of various methods (in this paper we focus on the probe group method which uses the number of people a respondent knows in various groups of known size), a…
▽ More
The network scale-up method (NSUM) is a cost-effective approach to estimating the size or prevalence of a group of people that is hard to reach through a standard survey. The basic NSUM involves two steps: estimating respondents' degrees by one of various methods (in this paper we focus on the probe group method which uses the number of people a respondent knows in various groups of known size), and estimating the prevalence of the hard-to-reach population of interest using respondents' estimated degrees and the number of people they report knowing in the hard-to-reach group. Each of these two steps involves taking either an average of ratios or a ratio of averages. Using the ratio of averages for each step has so far been the most common approach. However, we present theoretical arguments that using the average of ratios at the second, prevalence-estimation step often has lower mean squared error when the random mixing assumption is violated, which seems likely in practice; this estimator which uses the ratio of averages for degree estimates and the average of ratios for prevalence was proposed early in NSUM development but has largely been unexplored and unused. Simulation results using an example network data set also support these findings. Based on this theoretical and empirical evidence, we suggest that future surveys that use a simple estimator may want to use this mixed estimator, and estimation methods based on this estimator may produce new improvements.
△ Less
Submitted 17 January, 2024; v1 submitted 13 March, 2023;
originally announced March 2023.
-
Bayesian Age Category Reconciliation for Age- and Cause-specific Under-five Mortality Estimates
Authors:
Shuxian Fan,
Li Liu,
Jamie Perin,
Tyler H. McCormick
Abstract:
Age-disaggregated health data is crucial for effective public health planning and monitoring. Monitoring under-five mortality, for example, requires highly detailed age data since the distribution of potential causes of death varies substantially within the first few years of life. Comparative researchers often have to rely on multiple data sources yet, these sources often have ages aggregated at…
▽ More
Age-disaggregated health data is crucial for effective public health planning and monitoring. Monitoring under-five mortality, for example, requires highly detailed age data since the distribution of potential causes of death varies substantially within the first few years of life. Comparative researchers often have to rely on multiple data sources yet, these sources often have ages aggregated at different levels, making it difficult to combine the data into a single, coherent picture. To address this challenge in the context of under-five cause-specific mortality, we propose a Bayesian approach, that calibrates data with different age structures to produce unified and accurate estimates of the standardized age group distributions. We consider age-disaggregated death counts as fully-classified multinomial data and show that by incorporating partially-classified aggregated data, we can construct an improved Bayes estimator of the multinomial parameters under the Kullback-Leibler (KL) loss. We illustrate the method using both synthetic and real data, demonstrating that the proposed method achieves adequate performance in imputing incomplete classification. Finally, we present the results of numerical studies examining the conditions necessary for obtaining improved estimators. These studies provide insights and interpretations that can be used to aid future research and inform guidance for practitioners on appropriate levels of age disaggregation, with the aim of improving the accuracy and reliability of under-five cause-specific mortality estimates.
△ Less
Submitted 21 February, 2023;
originally announced February 2023.
-
Bayesian Hyperbolic Multidimensional Scaling
Authors:
Bolun Liu,
Shane Lubold,
Adrian E. Raftery,
Tyler H. McCormick
Abstract:
Multidimensional scaling (MDS) is a widely used approach to representing high-dimensional, dependent data. MDS works by assigning each observation a location on a low-dimensional geometric manifold, with distance on the manifold representing similarity. We propose a Bayesian approach to multidimensional scaling when the low-dimensional manifold is hyperbolic. Using hyperbolic space facilitates rep…
▽ More
Multidimensional scaling (MDS) is a widely used approach to representing high-dimensional, dependent data. MDS works by assigning each observation a location on a low-dimensional geometric manifold, with distance on the manifold representing similarity. We propose a Bayesian approach to multidimensional scaling when the low-dimensional manifold is hyperbolic. Using hyperbolic space facilitates representing tree-like structures common in many settings (e.g. text or genetic data with hierarchical structure). A Bayesian approach provides regularization that minimizes the impact of measurement error in the observed data and assesses uncertainty. We also propose a case-control likelihood approximation that allows for efficient sampling from the posterior distribution in larger data settings, reducing computational complexity from approximately $O(n^2)$ to $O(n)$. We evaluate the proposed method against state-of-the-art alternatives using simulations, canonical reference datasets, Indian village network data, and human gene expression data.
△ Less
Submitted 15 August, 2023; v1 submitted 26 October, 2022;
originally announced October 2022.
-
The openVA Toolkit for Verbal Autopsies
Authors:
Zehang Richard Li,
Jason Thomas,
Eungang Choi,
Tyler H. McCormick,
Samuel J. Clark
Abstract:
Verbal autopsy (VA) is a survey-based tool widely used to infer cause of death (COD) in regions without complete-coverage civil registration and vital statistics systems. In such settings, many deaths happen outside of medical facilities and are not officially documented by a medical professional. VA surveys, consisting of signs and symptoms reported by a person close to the decedent, are used to…
▽ More
Verbal autopsy (VA) is a survey-based tool widely used to infer cause of death (COD) in regions without complete-coverage civil registration and vital statistics systems. In such settings, many deaths happen outside of medical facilities and are not officially documented by a medical professional. VA surveys, consisting of signs and symptoms reported by a person close to the decedent, are used to infer the cause of death for an individual, and to estimate and monitor the cause of death distribution in the population. Several classification algorithms have been developed and widely used to assign cause of death using VA data. However, The incompatibility between different idiosyncratic model implementations and required data structure makes it difficult to systematically apply and compare different methods. The openVA package provides the first standardized framework for analyzing VA data that is compatible with all openly available methods and data structure. It provides an open-sourced, R implementation of several most widely used VA methods. It supports different data input and output formats, and customizable information about the associations between causes and symptoms. The paper discusses the relevant algorithms, their implementations in R packages under the openVA suite, and demonstrates the pipeline of model fitting, summary, comparison, and visualization in the R environment.
△ Less
Submitted 1 October, 2022; v1 submitted 16 September, 2021;
originally announced September 2021.
-
Spectral goodness-of-fit tests for complete and partial network data
Authors:
Shane Lubold,
Bolun Liu,
Tyler H. McCormick
Abstract:
Networks describe the, often complex, relationships between individual actors. In this work, we address the question of how to determine whether a parametric model, such as a stochastic block model or latent space model, fits a dataset well and will extrapolate to similar data. We use recent results in random matrix theory to derive a general goodness-of-fit test for dyadic data. We show that our…
▽ More
Networks describe the, often complex, relationships between individual actors. In this work, we address the question of how to determine whether a parametric model, such as a stochastic block model or latent space model, fits a dataset well and will extrapolate to similar data. We use recent results in random matrix theory to derive a general goodness-of-fit test for dyadic data. We show that our method, when applied to a specific model of interest, provides an straightforward, computationally fast way of selecting parameters in a number of commonly used network models. For example, we show how to select the dimension of the latent space in latent space models. Unlike other network goodness-of-fit methods, our general approach does not require simulating from a candidate parametric model, which can be cumbersome with large graphs, and eliminates the need to choose a particular set of statistics on the graph for comparison. It also allows us to perform goodness-of-fit tests on partial network data, such as Aggregated Relational Data. We show with simulations that our method performs well in many situations of interest. We analyze several empirically relevant networks and show that our method leads to improved community detection algorithms. R code to implement our method is available on Github.
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
Inference for Network Regression Models with Community Structure
Authors:
Mengjie Pan,
Tyler H. McCormick,
Bailey K. Fosdick
Abstract:
Network regression models, where the outcome comprises the valued edge in a network and the predictors are actor or dyad-level covariates, are used extensively in the social and biological sciences. Valid inference relies on accurately modeling the residual dependencies among the relations. Frequently homogeneity assumptions are placed on the errors which are commonly incorrect and ignore critical…
▽ More
Network regression models, where the outcome comprises the valued edge in a network and the predictors are actor or dyad-level covariates, are used extensively in the social and biological sciences. Valid inference relies on accurately modeling the residual dependencies among the relations. Frequently homogeneity assumptions are placed on the errors which are commonly incorrect and ignore critical, natural clustering of the actors. In this work, we present a novel regression modeling framework that models the errors as resulting from a community-based dependence structure and exploits the subsequent exchangeability properties of the error distribution to obtain parsimonious standard errors for regression parameters.
△ Less
Submitted 8 June, 2021;
originally announced June 2021.
-
Identifying the latent space geometry of network models through analysis of curvature
Authors:
Shane Lubold,
Arun G. Chandrasekhar,
Tyler H. McCormick
Abstract:
A common approach to modeling networks assigns each node to a position on a low-dimensional manifold where distance is inversely proportional to connection likelihood. More positive manifold curvature encourages more and tighter communities; negative curvature induces repulsion. We consistently estimate manifold type, dimension, and curvature from simply connected, complete Riemannian manifolds of…
▽ More
A common approach to modeling networks assigns each node to a position on a low-dimensional manifold where distance is inversely proportional to connection likelihood. More positive manifold curvature encourages more and tighter communities; negative curvature induces repulsion. We consistently estimate manifold type, dimension, and curvature from simply connected, complete Riemannian manifolds of constant curvature. We represent the graph as a noisy distance matrix based on the ties between cliques, then develop hypothesis tests to determine whether the observed distances could plausibly be embedded isometrically in each of the candidate geometries. We apply our approach to data-sets from economics and neuroscience.
△ Less
Submitted 30 December, 2022; v1 submitted 18 December, 2020;
originally announced December 2020.
-
A flexible Bayesian framework to estimate age- and cause-specific child mortality over time from sample registration data
Authors:
Austin E Schumacher,
Tyler H McCormick,
Jon Wakefield,
Yue Chu,
Jamie Perin,
Francisco Villavicencio,
Noah Simon,
Li Liu
Abstract:
In order to implement disease-specific interventions in young age groups, policy makers in low- and middle-income countries require timely and accurate estimates of age- and cause-specific child mortality. High quality data is not available in settings where these interventions are most needed, but there is a push to create sample registration systems that collect detailed mortality information. C…
▽ More
In order to implement disease-specific interventions in young age groups, policy makers in low- and middle-income countries require timely and accurate estimates of age- and cause-specific child mortality. High quality data is not available in settings where these interventions are most needed, but there is a push to create sample registration systems that collect detailed mortality information. Current methods that estimate mortality from this data employ multistage frameworks without rigorous statistical justification that separately estimate all-cause and cause-specific mortality and are not sufficiently adaptable to capture important features of the data. We propose a flexible Bayesian modeling framework to estimate age- and cause-specific child mortality from sample registration data. We provide a theoretical justification for the framework, explore its properties via simulation, and use it to estimate mortality trends using data from the Maternal and Child Health Surveillance System in China.
△ Less
Submitted 18 May, 2021; v1 submitted 29 February, 2020;
originally announced March 2020.
-
Anomaly Detection in Large Scale Networks with Latent Space Models
Authors:
Wesley Lee,
Tyler H. McCormick,
Joshua Neil,
Cole Sodja,
Yanran Cui
Abstract:
We develop a real-time anomaly detection algorithm for directed activity on large, sparse networks. We model the propensity for future activity using a dynamic logistic model with interaction terms for sender- and receiver-specific latent factors in addition to sender- and receiver-specific popularity scores; deviations from this underlying model constitute potential anomalies. Latent nodal attrib…
▽ More
We develop a real-time anomaly detection algorithm for directed activity on large, sparse networks. We model the propensity for future activity using a dynamic logistic model with interaction terms for sender- and receiver-specific latent factors in addition to sender- and receiver-specific popularity scores; deviations from this underlying model constitute potential anomalies. Latent nodal attributes are estimated via a variational Bayesian approach and may change over time, representing natural shifts in network activity. Estimation is augmented with a case-control approximation to take advantage of the sparsity of the network and reduces computational complexity from $O(N^2)$ to $O(E)$, where $N$ is the number of nodes and $E$ is the number of observed edges. We run our algorithm on network event records collected from an enterprise network of over 25,000 computers and are able to identify a red team attack with half the detection rate required of the model without latent interaction terms.
△ Less
Submitted 29 January, 2021; v1 submitted 13 November, 2019;
originally announced November 2019.
-
Consistently estimating network statistics using Aggregated Relational Data
Authors:
Emily Breza,
Arun G. Chandrasekhar,
Shane Lubold,
Tyler H. McCormick,
Mengjie Pan
Abstract:
Collecting complete network data is expensive, time-consuming, and often infeasible. Aggregated Relational Data (ARD), which capture information about a social network by asking a respondent questions of the form ``How many people with trait X do you know?'' provide a low-cost option when collecting complete network data is not possible. Rather than asking about connections between each pair of in…
▽ More
Collecting complete network data is expensive, time-consuming, and often infeasible. Aggregated Relational Data (ARD), which capture information about a social network by asking a respondent questions of the form ``How many people with trait X do you know?'' provide a low-cost option when collecting complete network data is not possible. Rather than asking about connections between each pair of individuals directly, ARD collects the number of contacts the respondent knows with a given trait. Despite widespread use and a growing literature on ARD methodology, there is still no systematic understanding of when and why ARD should accurately recover features of the unobserved network. This paper provides such a characterization by deriving conditions under which statistics about the unobserved network (or functions of these statistics like regression coefficients) can be consistently estimated using ARD. We do this by first providing consistent estimates of network model parameters for three commonly used probabilistic models: the beta-model with node-specific unobserved effects, the stochastic block model with unobserved community structure, and latent geometric space models with unobserved latent locations. A key observation behind these results is that cross-group link probabilities for a collection of (possibly unobserved) groups identifies the model parameters, meaning ARD is sufficient for parameter estimation. With these estimated parameters, it is possible to simulate graphs from the fitted distribution and analyze the distribution of network statistics. We can then characterize conditions under which the simulated networks based on ARD will allow for consistent estimation of the unobserved network statistics, such as eigenvector centrality or response functions by or of the unobserved network, such as regression coefficients.
△ Less
Submitted 21 October, 2022; v1 submitted 26 August, 2019;
originally announced August 2019.
-
Estimating spillovers using imprecisely measured networks
Authors:
Morgan Hardy,
Rachel M. Heath,
Wesley Lee,
Tyler H. McCormick
Abstract:
In many experimental contexts, whether and how network interactions impact the outcome of interest for both treated and untreated individuals are key concerns. Networks data is often assumed to perfectly represent these possible interactions. This paper considers the problem of estimating treatment effects when measured connections are, instead, a noisy representation of the true spillover pathway…
▽ More
In many experimental contexts, whether and how network interactions impact the outcome of interest for both treated and untreated individuals are key concerns. Networks data is often assumed to perfectly represent these possible interactions. This paper considers the problem of estimating treatment effects when measured connections are, instead, a noisy representation of the true spillover pathways. We show that existing methods, using the potential outcomes framework, yield biased estimators in the presence of this mismeasurement. We develop a new method, using a class of mixture models, that can account for missing connections and discuss its estimation via the Expectation-Maximization algorithm. We check our method's performance by simulating experiments on real network data from 43 villages in India. Finally, we use data from a previously published study to show that estimates using our method are more robust to the choice of network measure.
△ Less
Submitted 8 March, 2024; v1 submitted 29 March, 2019;
originally announced April 2019.
-
Modeling the social media relationships of Irish politicians using a generalized latent space stochastic blockmodel
Authors:
Tin Lok James Ng,
Thomas Brendan Murphy,
Ted Westling,
Tyler H. McCormick,
Bailey K. Fosdick
Abstract:
Dáil Éireann is the principal chamber of the Irish parliament. The 31st Dáil Éireann is the principal chamber of the Irish parliament. The 31st Dáil was in session from March 11th, 2011 to February 6th, 2016. Many of the members of the Dáil were active on social media and many were Twitter users who followed other members of the Dáil. The pattern of following amongst these politicians provides ins…
▽ More
Dáil Éireann is the principal chamber of the Irish parliament. The 31st Dáil Éireann is the principal chamber of the Irish parliament. The 31st Dáil was in session from March 11th, 2011 to February 6th, 2016. Many of the members of the Dáil were active on social media and many were Twitter users who followed other members of the Dáil. The pattern of following amongst these politicians provides insights into political alignment within the Dáil. We propose a new model, called the generalized latent space stochastic blockmodel, which extends and generalizes both the latent space model and the stochastic blockmodel to study social media connections between members of the Dáil. The probability of an edge between two nodes in a network depends on their respective class labels as well as latent positions in an unobserved latent space. The proposed model is capable of representing transitivity, clustering, as well as disassortative mixing. A Bayesian method with Markov chain Monte Carlo sampling is proposed for estimation of model parameters. Model selection is performed using the WAIC criterion and models of different number of classes or dimensions of latent space are compared. We use the model to study Twitter following relationships of members of the Dáil and interpret structure found in these relationships. We find that the following relationships amongst politicians is mainly driven by past and present political party membership. We also find that the modeling outputs are informative when studying voting within the Dáil.
△ Less
Submitted 13 December, 2020; v1 submitted 16 July, 2018;
originally announced July 2018.
-
Bayesian Joint Spike-and-Slab Graphical Lasso
Authors:
Zehang Richard Li,
Tyler H. McCormick,
Samuel J. Clark
Abstract:
In this article, we propose a new class of priors for Bayesian inference with multiple Gaussian graphical models. We introduce fully Bayesian treatments of two popular procedures, the group graphical lasso and the fused graphical lasso, and extend them to a continuous spike-and-slab framework to allow self-adaptive shrinkage and model selection simultaneously. We develop an EM algorithm that perfo…
▽ More
In this article, we propose a new class of priors for Bayesian inference with multiple Gaussian graphical models. We introduce fully Bayesian treatments of two popular procedures, the group graphical lasso and the fused graphical lasso, and extend them to a continuous spike-and-slab framework to allow self-adaptive shrinkage and model selection simultaneously. We develop an EM algorithm that performs fast and dynamic explorations of posterior modes. Our approach selects sparse models efficiently with substantially smaller bias than would be induced by alternative regularization procedures. The performance of the proposed methods are demonstrated through simulation and two real data examples.
△ Less
Submitted 9 May, 2019; v1 submitted 18 May, 2018;
originally announced May 2018.
-
Quantifying the Contributions of Training Data and Algorithm Logic to the Performance of Automated Cause-assignment Algorithms for Verbal Autopsy
Authors:
Samuel J. Clark,
Zehang Li,
Tyler H. McCormick
Abstract:
A verbal autopsy (VA) consists of a survey with a relative or close contact of a person who has recently died. VA surveys are commonly used to infer likely causes of death for individuals when deaths happen outside of hospitals or healthcare facilities. Several statistical and algorithmic methods are available to assign cause of death using VA surveys. Each of these methods require as inputs some…
▽ More
A verbal autopsy (VA) consists of a survey with a relative or close contact of a person who has recently died. VA surveys are commonly used to infer likely causes of death for individuals when deaths happen outside of hospitals or healthcare facilities. Several statistical and algorithmic methods are available to assign cause of death using VA surveys. Each of these methods require as inputs some information about the joint distribution of symptoms and causes. In this note, we examine the generalizability of this symptom-cause information by comparing different automated coding methods using various combinations of inputs and evaluation data. VA algorithm performance is affected by both the specific SCI themselves and the logic of a given algorithm. Using a variety of performance metrics for all existing VA algorithms, we demonstrate that in general the adequacy of the information about the joint distribution between symptoms and cause affects performance at least as much or more than algorithm logic.
△ Less
Submitted 15 November, 2018; v1 submitted 6 March, 2018;
originally announced March 2018.
-
Bayesian factor models for probabilistic cause of death assessment with verbal autopsies
Authors:
Tsuyoshi Kunihama,
Zehang Richard Li,
Samuel J. Clark,
Tyler H. McCormick
Abstract:
The distribution of deaths by cause provides crucial information for public health planning, response, and evaluation. About 60% of deaths globally are not registered or given a cause, limiting our ability to understand disease epidemiology. Verbal autopsy (VA) surveys are increasingly used in such settings to collect information on the signs, symptoms, and medical history of people who have recen…
▽ More
The distribution of deaths by cause provides crucial information for public health planning, response, and evaluation. About 60% of deaths globally are not registered or given a cause, limiting our ability to understand disease epidemiology. Verbal autopsy (VA) surveys are increasingly used in such settings to collect information on the signs, symptoms, and medical history of people who have recently died. This article develops a novel Bayesian method for estimation of population distributions of deaths by cause using verbal autopsy data. The proposed approach is based on a multivariate probit model where associations among items in questionnaires are flexibly induced by latent factors. Using the Population Health Metrics Research Consortium labeled data that include both VA and medically certified causes of death, we assess performance of the proposed method. Further, we estimate important questionnaire items that are highly associated with causes of death. This framework provides insights that will simplify future data collection.
△ Less
Submitted 26 November, 2018; v1 submitted 4 March, 2018;
originally announced March 2018.
-
Using Bayesian latent Gaussian graphical models to infer symptom associations in verbal autopsies
Authors:
Zehang Richard Li,
Tyler H. McCormick,
Samuel J. Clark
Abstract:
Learning dependence relationships among variables of mixed types provides insights in a variety of scientific settings and is a well-studied problem in statistics. Existing methods, however, typically rely on copious, high quality data to accurately learn associations. In this paper, we develop a method for scientific settings where learning dependence structure is essential, but data are sparse a…
▽ More
Learning dependence relationships among variables of mixed types provides insights in a variety of scientific settings and is a well-studied problem in statistics. Existing methods, however, typically rely on copious, high quality data to accurately learn associations. In this paper, we develop a method for scientific settings where learning dependence structure is essential, but data are sparse and have a high fraction of missing values. Specifically, our work is motivated by survey-based cause of death assessments known as verbal autopsies (VAs). We propose a Bayesian approach to characterize dependence relationships using a latent Gaussian graphical model that incorporates informative priors on the marginal distributions of the variables. We demonstrate such information can improve estimation of the dependence structure, especially in settings with little training data. We show that our method can be integrated into existing probabilistic cause-of-death assignment algorithms and improves model performance while recovering dependence patterns between symptoms that can inform efficient questionnaire design in future data collection.
△ Less
Submitted 24 July, 2019; v1 submitted 2 November, 2017;
originally announced November 2017.
-
An Expectation Conditional Maximization approach for Gaussian graphical models
Authors:
Zehang Richard Li,
Tyler H. McCormick
Abstract:
Bayesian graphical models are a useful tool for understanding dependence relationships among many variables, particularly in situations with external prior information. In high-dimensional settings, the space of possible graphs becomes enormous, rendering even state-of-the-art Bayesian stochastic search computationally infeasible. We propose a deterministic alternative to estimate Gaussian and Gau…
▽ More
Bayesian graphical models are a useful tool for understanding dependence relationships among many variables, particularly in situations with external prior information. In high-dimensional settings, the space of possible graphs becomes enormous, rendering even state-of-the-art Bayesian stochastic search computationally infeasible. We propose a deterministic alternative to estimate Gaussian and Gaussian copula graphical models using an Expectation Conditional Maximization (ECM) algorithm, extending the EM approach from Bayesian variable selection to graphical model estimation. We show that the ECM approach enables fast posterior exploration under a sequence of mixture priors, and can incorporate multiple sources of information.
△ Less
Submitted 6 February, 2019; v1 submitted 20 September, 2017;
originally announced September 2017.
-
Using Aggregated Relational Data to feasibly identify network structure without network data
Authors:
Emily Breza,
Arun G. Chandrasekhar,
Tyler H. McCormick,
Mengjie Pan
Abstract:
Social network data is often prohibitively expensive to collect, limiting empirical network research. Typical economic network map** requires (1) enumerating a census, (2) eliciting the names of all network links for each individual, (3) matching the list of social connections to the census, and (4) repeating (1)-(3) across many networks. In settings requiring field surveys, steps (2)-(3) can be…
▽ More
Social network data is often prohibitively expensive to collect, limiting empirical network research. Typical economic network map** requires (1) enumerating a census, (2) eliciting the names of all network links for each individual, (3) matching the list of social connections to the census, and (4) repeating (1)-(3) across many networks. In settings requiring field surveys, steps (2)-(3) can be very expensive. In other network populations such as financial intermediaries or high-risk groups, proprietary data and privacy concerns may render (2)-(3) impossible. Both restrict the accessibility of high-quality networks research to investigators with considerable resources.
We propose an inexpensive and feasible strategy for network elicitation using Aggregated Relational Data (ARD) -- responses to questions of the form "How many of your social connections have trait k?" Our method uses ARD to recover the parameters of a general network formation model, which in turn, permits the estimation of any arbitrary node- or graph-level statistic. The method works well in simulations and in matching a range of network characteristics in real-world graphs from 75 Indian villages. Moreover, we replicate the results of two field experiments that involved collecting network data. We show that the researchers would have drawn similar conclusions using ARD alone. Finally, using calculations from J-PAL fieldwork, we show that in rural India, for example, ARD surveys are 80% cheaper than full network surveys.
△ Less
Submitted 2 August, 2018; v1 submitted 12 March, 2017;
originally announced March 2017.
-
Regression of exchangeable relational arrays
Authors:
Frank W. Marrs,
Bailey K. Fosdick,
Tyler H. McCormick
Abstract:
Relational arrays represent measures of association between pairs of actors, often in varied contexts or over time. Trade flows between countries, financial transactions between individuals, contact frequencies between school children in classrooms, and dynamic protein-protein interactions are all examples of relational arrays. Elements of a relational array are often modeled as a linear function…
▽ More
Relational arrays represent measures of association between pairs of actors, often in varied contexts or over time. Trade flows between countries, financial transactions between individuals, contact frequencies between school children in classrooms, and dynamic protein-protein interactions are all examples of relational arrays. Elements of a relational array are often modeled as a linear function of observable covariates. Uncertainty estimates for regression coefficient estimators -- and ideally the coefficient estimators themselves -- must account for dependence between elements of the array (e.g. relations involving the same actor) and existing estimators of standard errors that recognize such relational dependence rely on estimating extremely complex, heterogeneous structure across actors. This paper develops a new class of parsimonious coefficient and standard error estimators for regressions of relational arrays. We leverage an exchangeability assumption to derive standard error estimators that pool information across actors and are substantially more accurate than existing estimators in a variety of settings. This exchangeability assumption is pervasive in network and array models in the statistics literature, but not previously considered when adjusting for dependence in a regression setting with relational data. We demonstrate improvements in inference theoretically, via a simulation study, and by analysis of a data set involving international trade.
△ Less
Submitted 22 June, 2022; v1 submitted 19 January, 2017;
originally announced January 2017.
-
Inferring social structure from continuous-time interaction data
Authors:
Wesley Lee,
Bailey K. Fosdick,
Tyler H. McCormick
Abstract:
Relational event data, which consist of events involving pairs of actors over time, are now commonly available at the finest of temporal resolutions. Existing continuous-time methods for modeling such data are based on point processes and directly model interaction "contagion," whereby one interaction increases the propensity of future interactions among actors, often as dictated by some latent va…
▽ More
Relational event data, which consist of events involving pairs of actors over time, are now commonly available at the finest of temporal resolutions. Existing continuous-time methods for modeling such data are based on point processes and directly model interaction "contagion," whereby one interaction increases the propensity of future interactions among actors, often as dictated by some latent variable structure. In this article, we present an alternative approach to using temporal-relational point process models for continuous-time event data. We characterize interactions between a pair of actors as either spurious or that resulting from an underlying, persistent connection in a latent social network. We argue that consistent deviations from expected behavior, rather than solely high frequency counts, are crucial for identifying well-established underlying social relationships. This study aims to explore these latent network structures in two contexts: one comprising of college students and another involving barn swallows.
△ Less
Submitted 15 January, 2018; v1 submitted 8 September, 2016;
originally announced September 2016.
-
Multiresolution network models
Authors:
Bailey K. Fosdick,
Tyler H. McCormick,
Thomas Brendan Murphy,
Tin Lok James Ng,
Ted Westling
Abstract:
Many existing statistical and machine learning tools for social network analysis focus on a single level of analysis. Methods designed for clustering optimize a global partition of the graph, whereas projection based approaches (e.g. the latent space model in the statistics literature) represent in rich detail the roles of individuals. Many pertinent questions in sociology and economics, however,…
▽ More
Many existing statistical and machine learning tools for social network analysis focus on a single level of analysis. Methods designed for clustering optimize a global partition of the graph, whereas projection based approaches (e.g. the latent space model in the statistics literature) represent in rich detail the roles of individuals. Many pertinent questions in sociology and economics, however, span multiple scales of analysis. Further, many questions involve comparisons across disconnected graphs that will, inevitably be of different sizes, either due to missing data or the inherent heterogeneity in real-world networks. We propose a class of network models that represent network structure on multiple scales and facilitate comparison across graphs with different numbers of individuals. These models differentially invest modeling effort within subgraphs of high density, often termed communities, while maintaining a parsimonious structure between said subgraphs. We show that our model class is projective, highlighting an ongoing discussion in the social network modeling literature on the dependence of inference paradigms on the size of the observed graph. We illustrate the utility of our method using data on household relations from Karnataka, India.
△ Less
Submitted 5 July, 2018; v1 submitted 26 August, 2016;
originally announced August 2016.
-
Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model
Authors:
Benjamin Letham,
Cynthia Rudin,
Tyler H. McCormick,
David Madigan
Abstract:
We aim to produce predictive models that are not only accurate, but are also interpretable to human experts. Our models are decision lists, which consist of a series of if...then... statements (e.g., if high blood pressure, then stroke) that discretize a high-dimensional, multivariate feature space into a series of simple, readily interpretable decision statements. We introduce a generative model…
▽ More
We aim to produce predictive models that are not only accurate, but are also interpretable to human experts. Our models are decision lists, which consist of a series of if...then... statements (e.g., if high blood pressure, then stroke) that discretize a high-dimensional, multivariate feature space into a series of simple, readily interpretable decision statements. We introduce a generative model called Bayesian Rule Lists that yields a posterior distribution over possible decision lists. It employs a novel prior structure to encourage sparsity. Our experiments show that Bayesian Rule Lists has predictive accuracy on par with the current top algorithms for prediction in machine learning. Our method is motivated by recent developments in personalized medicine, and can be used to produce highly accurate and interpretable medical scoring systems. We demonstrate this by producing an alternative to the CHADS$_2$ score, actively used in clinical practice for estimating the risk of stroke in patients that have atrial fibrillation. Our model is as interpretable as CHADS$_2$, but more accurate.
△ Less
Submitted 5 November, 2015;
originally announced November 2015.
-
Beyond prediction: A framework for inference with variational approximations in mixture models
Authors:
Ted Westling,
Tyler H. McCormick
Abstract:
Variational inference is a popular method for estimating model parameters and conditional distributions in hierarchical and mixed models, which arise frequently in many settings in the health, social, and biological sciences. Variational inference in a frequentist context works by approximating intractable conditional distributions with a tractable family and optimizing the resulting lower bound o…
▽ More
Variational inference is a popular method for estimating model parameters and conditional distributions in hierarchical and mixed models, which arise frequently in many settings in the health, social, and biological sciences. Variational inference in a frequentist context works by approximating intractable conditional distributions with a tractable family and optimizing the resulting lower bound on the log-likelihood. The variational objective function is typically less computationally intensive to optimize than the true likelihood, enabling scientists to fit rich models even with extremely large datasets. Despite widespread use, little is known about the general theoretical properties of estimators arising from variational approximations to the log-likelihood, which hinders their use in inferential statistics. In this paper we connect such estimators to profile M-estimation, which enables us to provide regularity conditions for consistency and asymptotic normality of variational estimators. Our theory also motivates three methodological improvements to variational inference: estimation of the asymptotic model-robust covariance matrix, a one-step correction that improves estimator efficiency, and an empirical assessment of consistency. We evaluate the proposed results using simulation studies and data on marijuana use from the National Longitudinal Study of Youth.
△ Less
Submitted 9 January, 2019; v1 submitted 27 October, 2015;
originally announced October 2015.
-
Reactive point processes: A new approach to predicting power failures in underground electrical systems
Authors:
Şeyda Ertekin,
Cynthia Rudin,
Tyler H. McCormick
Abstract:
Reactive point processes (RPPs) are a new statistical model designed for predicting discrete events in time based on past history. RPPs were developed to handle an important problem within the domain of electrical grid reliability: short-term prediction of electrical grid failures ("manhole events"), including outages, fires, explosions and smoking manholes, which can cause threats to public safet…
▽ More
Reactive point processes (RPPs) are a new statistical model designed for predicting discrete events in time based on past history. RPPs were developed to handle an important problem within the domain of electrical grid reliability: short-term prediction of electrical grid failures ("manhole events"), including outages, fires, explosions and smoking manholes, which can cause threats to public safety and reliability of electrical service in cities. RPPs incorporate self-exciting, self-regulating and saturating components. The self-excitement occurs as a result of a past event, which causes a temporary rise in vulner ability to future events. The self-regulation occurs as a result of an external inspection which temporarily lowers vulnerability to future events. RPPs can saturate when too many events or inspections occur close together, which ensures that the probability of an event stays within a realistic range. Two of the operational challenges for power companies are (i) making continuous-time failure predictions, and (ii) cost/benefit analysis for decision making and proactive maintenance. RPPs are naturally suited for handling both of these challenges. We use the model to predict power-grid failures in Manhattan over a short-term horizon, and to provide a cost/benefit analysis of different proactive maintenance programs.
△ Less
Submitted 28 May, 2015;
originally announced May 2015.
-
Modeling Recovery Curves With Application to Prostatectomy
Authors:
Fulton Wang,
Tyler H. McCormick,
Cynthia Rudin,
John Gore
Abstract:
We propose a Bayesian model that predicts recovery curves based on information available before the disruptive event. A recovery curve of interest is the quantified sexual function of prostate cancer patients after prostatectomy surgery. We illustrate the utility of our model as a pre-treatment medical decision aid, producing personalized predictions that are both interpretable and accurate. We un…
▽ More
We propose a Bayesian model that predicts recovery curves based on information available before the disruptive event. A recovery curve of interest is the quantified sexual function of prostate cancer patients after prostatectomy surgery. We illustrate the utility of our model as a pre-treatment medical decision aid, producing personalized predictions that are both interpretable and accurate. We uncover covariate relationships that agree with and supplement that in existing medical literature.
△ Less
Submitted 4 March, 2018; v1 submitted 27 April, 2015;
originally announced April 2015.
-
Probabilistic Cause-of-death Assignment using Verbal Autopsies
Authors:
Tyler H. McCormick,
Zehang Li,
Clara Calvert,
Amelia C. Crampin,
Kathleen Kahn,
Samuel J. Clark
Abstract:
In regions without complete-coverage civil registration and vital statistics systems there is uncertainty about even the most basic demographic indicators. In such areas the majority of deaths occur outside hospitals and are not recorded. Worldwide, fewer than one-third of deaths are assigned a cause, with the least information available from the most impoverished nations. In populations like this…
▽ More
In regions without complete-coverage civil registration and vital statistics systems there is uncertainty about even the most basic demographic indicators. In such areas the majority of deaths occur outside hospitals and are not recorded. Worldwide, fewer than one-third of deaths are assigned a cause, with the least information available from the most impoverished nations. In populations like this, verbal autopsy (VA) is a commonly used tool to assess cause of death and estimate cause-specific mortality rates and the distribution of deaths by cause. VA uses an interview with caregivers of the decedent to elicit data describing the signs and symptoms leading up to the death. This paper develops a new statistical tool known as InSilicoVA to classify cause of death using information acquired through VA. InSilicoVA shares uncertainty between cause of death assignments for specific individuals and the distribution of deaths by cause across the population. Using side-by-side comparisons with both observed and simulated data, we demonstrate that InSilicoVA has distinct advantages compared to currently available methods.
△ Less
Submitted 21 September, 2015; v1 submitted 11 November, 2014;
originally announced November 2014.
-
Clustering South African households based on their asset status using latent variable models
Authors:
Damien McParland,
Isobel Claire Gormley,
Tyler H. McCormick,
Samuel J. Clark,
Chodziwadziwa Whiteson Kabudula,
Mark A. Collinson
Abstract:
The Agincourt Health and Demographic Surveillance System has since 2001 conducted a biannual household asset survey in order to quantify household socio-economic status (SES) in a rural population living in northeast South Africa. The survey contains binary, ordinal and nominal items. In the absence of income or expenditure data, the SES landscape in the study population is explored and described…
▽ More
The Agincourt Health and Demographic Surveillance System has since 2001 conducted a biannual household asset survey in order to quantify household socio-economic status (SES) in a rural population living in northeast South Africa. The survey contains binary, ordinal and nominal items. In the absence of income or expenditure data, the SES landscape in the study population is explored and described by clustering the households into homogeneous groups based on their asset status. A model-based approach to clustering the Agincourt households, based on latent variable models, is proposed. In the case of modeling binary or ordinal items, item response theory models are employed. For nominal survey items, a factor analysis model, similar in nature to a multinomial probit model, is used. Both model types have an underlying latent variable structure - this similarity is exploited and the models are combined to produce a hybrid model capable of handling mixed data types. Further, a mixture of the hybrid models is considered to provide clustering capabilities within the context of mixed binary, ordinal and nominal response data. The proposed model is termed a mixture of factor analyzers for mixed data (MFA-MD). The MFA-MD model is applied to the survey data to cluster the Agincourt households into homogeneous groups. The model is estimated within the Bayesian paradigm, using a Markov chain Monte Carlo algorithm. Intuitive grou**s result, providing insight to the different socio-economic strata within the Agincourt region.
△ Less
Submitted 31 July, 2014; v1 submitted 21 January, 2014;
originally announced January 2014.
-
Estimating population size using the network scale up method
Authors:
Rachael Maltiel,
Adrian E. Raftery,
Tyler H. McCormick,
Aaron J. Baraff
Abstract:
We develop methods for estimating the size of hard-to-reach populations from data collected using network-based questions on standard surveys. Such data arise by asking respondents how many people they know in a specific group (e.g., people named Michael, intravenous drug users). The Network Scale up Method (NSUM) is a tool for producing population size estimates using these indirect measures of r…
▽ More
We develop methods for estimating the size of hard-to-reach populations from data collected using network-based questions on standard surveys. Such data arise by asking respondents how many people they know in a specific group (e.g., people named Michael, intravenous drug users). The Network Scale up Method (NSUM) is a tool for producing population size estimates using these indirect measures of respondents' networks. Killworth et al. [Soc. Netw. 20 (1998a) 23-50, Evaluation Review 22 (1998b) 289-308] proposed maximum likelihood estimators of population size for a fixed effects model in which respondents' degrees or personal network sizes are treated as fixed. We extend this by treating personal network sizes as random effects, yielding principled statements of uncertainty. This allows us to generalize the model to account for variation in people's propensity to know people in particular subgroups (barrier effects), such as their tendency to know people like themselves, as well as their lack of awareness of or reluctance to acknowledge their contacts' group memberships (transmission bias). NSUM estimates also suffer from recall bias, in which respondents tend to underestimate the number of members of larger groups that they know, and conversely for smaller groups. We propose a data-driven adjustment method to deal with this. Our methods perform well in simulation studies, generating improved estimates and calibrated uncertainty intervals, as well as in back estimates of real sample data. We apply them to data from a study of HIV/AIDS prevalence in Curitiba, Brazil. Our results show that when transmission bias is present, external information about its likely extent can greatly improve the estimates. The methods are implemented in the NSUM R package.
△ Less
Submitted 5 November, 2015; v1 submitted 4 June, 2013;
originally announced June 2013.
-
Latent demographic profile estimation in hard-to-reach groups
Authors:
Tyler H. McCormick,
Tian Zheng
Abstract:
The sampling frame in most social science surveys excludes members of certain groups, known as hard-to-reach groups. These groups, or subpopulations, may be difficult to access (the homeless, e.g.), camouflaged by stigma (individuals with HIV/AIDS), or both (commercial sex workers). Even basic demographic information about these groups is typically unknown, especially in many develo** nations. W…
▽ More
The sampling frame in most social science surveys excludes members of certain groups, known as hard-to-reach groups. These groups, or subpopulations, may be difficult to access (the homeless, e.g.), camouflaged by stigma (individuals with HIV/AIDS), or both (commercial sex workers). Even basic demographic information about these groups is typically unknown, especially in many develo** nations. We present statistical models which leverage social network structure to estimate demographic characteristics of these subpopulations using Aggregated relational data (ARD), or questions of the form "How many X's do you know?" Unlike other network-based techniques for reaching these groups, ARD require no special sampling strategy and are easily incorporated into standard surveys. ARD also do not require respondents to reveal their own group membership. We propose a Bayesian hierarchical model for estimating the demographic characteristics of hard-to-reach groups, or latent demographic profiles, using ARD. We propose two estimation techniques. First, we propose a Markov-chain Monte Carlo algorithm for existing data or cases where the full posterior distribution is of interest. For cases when new data can be collected, we propose guidelines and, based on these guidelines, propose a simple estimate motivated by a missing data approach. Using data from McCarty et al. [Human Organization 60 (2001) 28-39], we estimate the age and gender profiles of six hard-to-reach groups, such as individuals who have HIV, women who were raped, and homeless persons. We also evaluate our simple estimates using simulation studies.
△ Less
Submitted 11 January, 2013;
originally announced January 2013.
-
Bayesian hierarchical rule modeling for predicting medical conditions
Authors:
Tyler H. McCormick,
Cynthia Rudin,
David Madigan
Abstract:
We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient's possible future medical conditions given the patient's current and past history of reported conditions. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as "condition 1 and condition 2 $\rightarrow$ condition 3") fr…
▽ More
We propose a statistical modeling technique, called the Hierarchical Association Rule Model (HARM), that predicts a patient's possible future medical conditions given the patient's current and past history of reported conditions. The core of our technique is a Bayesian hierarchical model for selecting predictive association rules (such as "condition 1 and condition 2 $\rightarrow$ condition 3") from a large set of candidate rules. Because this method "borrows strength" using the conditions of many similar patients, it is able to provide predictions specialized to any given patient, even when little information about the patient's history of conditions is available.
△ Less
Submitted 28 June, 2012;
originally announced June 2012.