-
Mutually Exciting Point Processes for Crowdfunding Platform Dynamics
Authors:
Alexandra Djorno,
Forrest W. Crawford
Abstract:
Crowdfunding is a powerful tool for individuals or organizations seeking financial support from a vast audience. Despite widespread adoption, managers often lack information about dynamics of their platforms. Hawkes processes have been used to represent self-exciting behavior in a wide variety of empirical fields, but have not been applied to crowdfunding platforms in a way that could help manager…
▽ More
Crowdfunding is a powerful tool for individuals or organizations seeking financial support from a vast audience. Despite widespread adoption, managers often lack information about dynamics of their platforms. Hawkes processes have been used to represent self-exciting behavior in a wide variety of empirical fields, but have not been applied to crowdfunding platforms in a way that could help managers understand the dynamics of users' engagement with the platform. In this paper, we extend the Hawkes process to capture important features of crowdfunding platform contributions and apply the model to analyze data from two donation-based platforms. For each user-item pair, the continuous-time conditional intensity is modeled as the superposition of a self-exciting baseline rate and a mutual excitation by preferential attachment, both depending on prior user engagement, and attenuated by a power law decay of user interest. The model is thus structured around two time-varying features -- contribution count and item popularity. We estimate parameters that govern the dynamics of contributions from 2,000 items and 164,000 users over several years. We identify a bottleneck in the user contribution pipeline, measure the force of item popularity, and characterize the decline in user interest over time. A contagion effect is introduced to assess the effect of item popularity on contribution rates. This mechanistic model lays the groundwork for enhanced crowdfunding platform monitoring based on evaluation of counterfactual scenarios and formulation of dynamics-aware recommendations.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
The role of discretization scales in causal inference with continuous-time treatment
Authors:
**ghao Sun,
Forrest W. Crawford
Abstract:
There are well-established methods for identifying the causal effect of a time-varying treatment applied at discrete time points. However, in the real world, many treatments are continuous or have a finer time scale than the one used for measurement or analysis. While researchers have investigated the discrepancies between estimates under varying discretization scales using simulations and empiric…
▽ More
There are well-established methods for identifying the causal effect of a time-varying treatment applied at discrete time points. However, in the real world, many treatments are continuous or have a finer time scale than the one used for measurement or analysis. While researchers have investigated the discrepancies between estimates under varying discretization scales using simulations and empirical data, it is still unclear how the choice of discretization scale affects causal inference. To address this gap, we present a framework to understand how discretization scales impact the properties of causal inferences about the effect of a time-varying treatment. We introduce the concept of "identification bias", which is the difference between the causal estimand for a continuous-time treatment and the purported estimand of a discretized version of the treatment. We show that this bias can persist even with an infinite number of longitudinal treatment-outcome trajectories. We specifically examine the identification problem in a class of linear stochastic continuous-time data-generating processes and demonstrate the identification bias of the g-formula in this context. Our findings indicate that discretization bias can significantly impact empirical analysis, especially when there are limited repeated measurements. Therefore, we recommend that researchers carefully consider the choice of discretization scale and perform sensitivity analysis to address this bias. We also propose a simple and heuristic quantitative measure for sensitivity concerning discretization and suggest that researchers report this measure along with point and interval estimates in their work. By doing so, researchers can better understand and address the potential impact of discretization bias on causal inference.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
Causal identification for continuous-time stochastic processes
Authors:
**ghao Sun,
Forrest W. Crawford
Abstract:
Many real-world processes are trajectories that may be regarded as continuous-time "functional data". Examples include patients' biomarker concentrations, environmental pollutant levels, and prices of stocks. Corresponding advances in data collection have yielded near continuous-time measurements, from e.g. physiological monitors, wearable digital devices, and environmental sensors. Statistical me…
▽ More
Many real-world processes are trajectories that may be regarded as continuous-time "functional data". Examples include patients' biomarker concentrations, environmental pollutant levels, and prices of stocks. Corresponding advances in data collection have yielded near continuous-time measurements, from e.g. physiological monitors, wearable digital devices, and environmental sensors. Statistical methodology for estimating the causal effect of a time-varying treatment, measured discretely in time, is well developed. But discrete-time methods like the g-formula, structural nested models, and marginal structural models do not generalize easily to continuous time, due to the entanglement of uncountably infinite variables. Moreover, researchers have shown that the choice of discretization time scale can seriously affect the quality of causal inferences about the effects of an intervention. In this paper, we establish causal identification results for continuous-time treatment-outcome relationships for general cadlag stochastic processes under continuous-time confounding, through orthogonalization and weighting. We use three concrete running examples to demonstrate the plausibility of our identification assumptions, as well as their connections to the discrete-time g methods literature.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Communication network dynamics in a large organizational hierarchy
Authors:
Nathaniel Josephs,
Sida Peng,
Forrest W. Crawford
Abstract:
Most businesses impose a supervisory hierarchy on employees to facilitate management, decision-making, and collaboration, yet routine inter-employee communication patterns within workplaces tend to emerge more naturally as a consequence of both supervisory relationships and the needs of the organization. What then is the relationship between a formal organizational structure and the emergent commu…
▽ More
Most businesses impose a supervisory hierarchy on employees to facilitate management, decision-making, and collaboration, yet routine inter-employee communication patterns within workplaces tend to emerge more naturally as a consequence of both supervisory relationships and the needs of the organization. What then is the relationship between a formal organizational structure and the emergent communications between its employees? Understanding the nature of this relationship is critical for the successful management of an organization. While scholars of organizational management have proposed theories relating organizational trees to communication dynamics, and separately, network scientists have studied the topological structure of communication patterns in different types of organizations, existing empirical analyses are both lacking in representativeness and limited in size. In fact, much of the methodology used to study the relationship between organizational hierarchy and communication patterns comes from analyses of the Enron email corpus, reflecting a uniquely dysfunctional corporate environment. In this paper, we develop new methodology for assessing the relationship between organizational hierarchy and communication dynamics and apply it to Microsoft Corporation, currently the highest valued company in the world, consisting of approximately 200,000 employees divided into 88 teams. This reveals distinct communication network structures within and between teams. We then characterize the relationship of routine employee communication patterns to these team supervisory hierarchies, while empirically evaluating several theories of organizational management and performance. To do so, we propose new measures of communication reciprocity and new shortest-path distances for trees to track the frequency of messages passed up, down, and across the organizational hierarchy.
△ Less
Submitted 11 March, 2024; v1 submitted 1 August, 2022;
originally announced August 2022.
-
A sample size heuristic for network scale-up studies
Authors:
Nathaniel Josephs,
Dennis M. Feehan,
Forrest W. Crawford
Abstract:
The network scale-up method (NSUM) is a survey-based method for estimating the number of individuals in a hidden or hard-to-reach subgroup of a general population. In NSUM surveys, sampled individuals report how many others they know in the subpopulation of interest (e.g. "How many sex workers do you know?") and how many others they know in subpopulations of the general population (e.g. "How many…
▽ More
The network scale-up method (NSUM) is a survey-based method for estimating the number of individuals in a hidden or hard-to-reach subgroup of a general population. In NSUM surveys, sampled individuals report how many others they know in the subpopulation of interest (e.g. "How many sex workers do you know?") and how many others they know in subpopulations of the general population (e.g. "How many bus drivers do you know?"). NSUM is widely used to estimate the size of important epidemiological risk groups, including men who have sex with men, sex workers, HIV+ individuals, and drug users. Unlike several other methods for population size estimation, NSUM requires only a single random sample and the estimator has a conveniently simple form. Despite its popularity, there are no published guidelines for the minimum sample size calculation to achieve a desired statistical precision. Here, we provide a sample size formula that can be employed in any NSUM survey. We show analytically and by simulation that the sample size controls error at the nominal rate and is robust to some forms of network model mis-specification. We apply this methodology to study the minimum sample size and relative error properties of several published NSUM surveys.
△ Less
Submitted 18 November, 2021;
originally announced November 2021.
-
Causal identification of infectious disease intervention effects in a clustered population
Authors:
Xiaoxuan Cai,
Eben Kenah,
Forrest W. Crawford
Abstract:
Causal identification of treatment effects for infectious disease outcomes in interconnected populations is challenging because infection outcomes may be transmissible to others, and treatment given to one individual may affect others' outcomes. Contagion, or transmissibility of outcomes, complicates standard conceptions of treatment interference in which an intervention delivered to one individua…
▽ More
Causal identification of treatment effects for infectious disease outcomes in interconnected populations is challenging because infection outcomes may be transmissible to others, and treatment given to one individual may affect others' outcomes. Contagion, or transmissibility of outcomes, complicates standard conceptions of treatment interference in which an intervention delivered to one individual can affect outcomes of others. Several statistical frameworks have been proposed to measure causal treatment effects in this setting, including structural transmission models, mediation-based partnership models, and randomized trial designs. However, existing estimands for infectious disease intervention effects are of limited conceptual usefulness: Some are parameters in a structural model whose causal interpretation is unclear, others are causal effects defined only in a restricted two-person setting, and still others are nonparametric estimands that arise naturally in the context of a randomized trial but may not measure any biologically meaningful effect. In this paper, we describe a unifying formalism for defining nonparametric structural causal estimands and an identification strategy for learning about infectious disease intervention effects in clusters of interacting individuals when infection times are observed. The estimands generalize existing quantities and provide a framework for causal identification in randomized and observational studies, including situations where only binary infection outcomes are observed. A semiparametric class of pairwise Cox-type transmission hazard models is used to facilitate statistical inference in finite samples. A comprehensive simulation study compares existing and proposed estimands under a variety of randomized and observational vaccine trial designs.
△ Less
Submitted 7 May, 2021;
originally announced May 2021.
-
Dependence-robust confidence intervals for capture-recapture surveys
Authors:
**ghao Sun,
Luk Van Baelen,
Els Plettinckx,
Forrest W. Crawford
Abstract:
Capture-recapture (CRC) surveys are used to estimate the size of a population whose members cannot be enumerated directly. CRC surveys have been used to estimate the number of Covid-19 infections, people who use drugs, sex workers, conflict casualties, and trafficking victims. When $k$ capture samples are obtained, counts of unit captures in subsets of samples are represented naturally by a $2^k$…
▽ More
Capture-recapture (CRC) surveys are used to estimate the size of a population whose members cannot be enumerated directly. CRC surveys have been used to estimate the number of Covid-19 infections, people who use drugs, sex workers, conflict casualties, and trafficking victims. When $k$ capture samples are obtained, counts of unit captures in subsets of samples are represented naturally by a $2^k$ contingency table in which one element -- the number of individuals appearing in none of the samples -- remains unobserved. In the absence of additional assumptions, the population size is not identifiable (i.e. point-identified). Stringent assumptions about the dependence between samples are often used to achieve point-identification. However, real-world CRC surveys often use convenience samples in which the assumed dependence cannot be guaranteed, and population size estimates under these assumptions may lack empirical credibility. In this work, we apply the theory of partial identification to show that weak assumptions or qualitative knowledge about the nature of dependence between samples can be used to characterize a non-trivial confidence set for the true population size. We construct confidence sets under bounds on pairwise capture probabilities using two methods: test inversion bootstrap confidence intervals, and profile likelihood confidence intervals. Simulation results demonstrate well-calibrated confidence sets for each method. In an extensive real-world study, we apply the new methodology to the problem of using heterogeneous survey data to estimate the number of people who inject drugs in Brussels, Belgium.
△ Less
Submitted 14 October, 2022; v1 submitted 31 July, 2020;
originally announced August 2020.
-
Identification of causal intervention effects under contagion
Authors:
Xiaoxuan Cai,
Wen Wei Loh,
Forrest W. Crawford
Abstract:
Defining and identifying causal intervention effects for transmissible infectious disease outcomes is challenging because a treatment -- such as a vaccine -- given to one individual may affect the infection outcomes of others. Epidemiologists have proposed causal estimands to quantify effects of interventions under contagion using a two-person partnership model. These simple conceptual models have…
▽ More
Defining and identifying causal intervention effects for transmissible infectious disease outcomes is challenging because a treatment -- such as a vaccine -- given to one individual may affect the infection outcomes of others. Epidemiologists have proposed causal estimands to quantify effects of interventions under contagion using a two-person partnership model. These simple conceptual models have helped researchers develop causal estimands relevant to clinical evaluation of vaccine effects. However, many of these partnership models are formulated under structural assumptions that preclude realistic infectious disease transmission dynamics, limiting their conceptual usefulness in defining and identifying causal treatment effects in empirical intervention trials. In this paper, we propose causal intervention effects in two-person partnerships under arbitrary infectious disease transmission dynamics, and give nonparametric identification results showing how effects can be estimated in empirical trials using time-to-infection or binary outcome data. The key insight is that contagion is a causal phenomenon that induces conditional independencies on infection outcomes that can be exploited for the identification of clinically meaningful causal estimands. These new estimands are compared to existing quantities, and results are illustrated using a realistic simulation of an HIV vaccine trial.
△ Less
Submitted 10 December, 2019; v1 submitted 9 December, 2019;
originally announced December 2019.
-
Efficient and minimal length parametric conformal prediction regions
Authors:
Daniel J. Eck,
Forrest W. Crawford
Abstract:
Conformal prediction methods construct prediction regions for iid data that are valid in finite samples. We provide two parametric conformal prediction regions that are applicable for a wide class of continuous statistical models. This class of statistical models includes generalized linear models (GLMs) with continuous outcomes. Our parametric conformal prediction regions possesses finite sample…
▽ More
Conformal prediction methods construct prediction regions for iid data that are valid in finite samples. We provide two parametric conformal prediction regions that are applicable for a wide class of continuous statistical models. This class of statistical models includes generalized linear models (GLMs) with continuous outcomes. Our parametric conformal prediction regions possesses finite sample validity, even when the model is misspecified, and are asymptotically of minimal length when the model is correctly specified. The first parametric conformal prediction region is constructed through binning of the predictor space, guarantees finite-sample local validity and is asymptotically minimal at the $\sqrt{\log(n)/n}$ rate when the dimension $d$ of the predictor space is one or two, and converges at the $O\{(\log(n)/n)^{1/d}\}$ rate when $d > 2$. The second parametric conformal prediction region is constructed by transforming the outcome variable to a common distribution via the probability integral transform, guarantees finite-sample marginal validity, and is asymptotically minimal at the $\sqrt{\log(n)/n}$ rate. We develop a novel concentration inequality for maximum likelihood estimation that induces these convergence rates. We analyze prediction region coverage properties, large-sample efficiency, and robustness properties of four methods for constructing conformal prediction intervals for GLMs: fully nonparametric kernel-based conformal, residual based conformal, normalized residual based conformal, and parametric conformal which uses the assumed GLM density as a conformity measure. Extensive simulations compare these approaches to standard asymptotic prediction regions. The utility of the parametric conformal prediction region is demonstrated in an application to interval prediction of glycosylated hemoglobin levels, a blood measurement used to diagnose diabetes.
△ Less
Submitted 25 October, 2019; v1 submitted 9 May, 2019;
originally announced May 2019.
-
Interpretation of the individual effect under treatment spillover
Authors:
Forrest W. Crawford,
Olga Morozova,
Ashley L. Buchanan,
Donna Spiegelman
Abstract:
Some interventions may include important spillover or dissemination effects between study participants. For example, vaccines, cash transfers, and education programs may exert a causal effect on participants beyond those to whom individual treatment is assigned. In a recent paper, Buchanan et al. provide a causal definition of the "individual effect" of an intervention in networks of people who in…
▽ More
Some interventions may include important spillover or dissemination effects between study participants. For example, vaccines, cash transfers, and education programs may exert a causal effect on participants beyond those to whom individual treatment is assigned. In a recent paper, Buchanan et al. provide a causal definition of the "individual effect" of an intervention in networks of people who inject drugs. In this short note, we discuss the interpretation of the individual effect when a spillover or dissemination effect exists.
△ Less
Submitted 4 February, 2019;
originally announced February 2019.
-
Randomization for the susceptibility effect of an infectious disease intervention
Authors:
Daniel J. Eck,
Olga Morozova,
Forrest W. Crawford
Abstract:
Randomized trials of infectious disease interventions, such as vaccines, often focus on groups of connected or potentially interacting individuals. When the pathogen of interest is transmissible between study subjects, interference may occur: individual infection outcomes may depend on treatments received by others. Epidemiologists have defined the primary causal effect of interest -- called the "…
▽ More
Randomized trials of infectious disease interventions, such as vaccines, often focus on groups of connected or potentially interacting individuals. When the pathogen of interest is transmissible between study subjects, interference may occur: individual infection outcomes may depend on treatments received by others. Epidemiologists have defined the primary causal effect of interest -- called the "susceptibility effect" -- as a contrast in infection risk under treatment versus no treatment, while holding exposure to infectiousness constant. A related quantity -- the "direct effect" -- is defined as an unconditional contrast between the infection risk under treatment versus no treatment. The purpose of this paper is to show that under a widely recommended randomization design, the direct effect may fail to recover the sign of the true susceptibility effect of the intervention in a randomized trial when outcomes are contagious. The analytical approach uses structural features of infectious disease transmission to define the susceptibility effect. A new probabilistic coupling argument reveals stochastic dominance relations between potential infection outcomes under different treatment allocations. The results suggest that estimating the direct effect under randomization may provide misleading inferences about the effect of an intervention -- such as a vaccine -- when outcomes are contagious.
△ Less
Submitted 9 December, 2019; v1 submitted 16 August, 2018;
originally announced August 2018.
-
Estimating the size of a hidden finite set: large-sample behavior of estimators
Authors:
Si Cheng,
Daniel J. Eck,
Forrest W. Crawford
Abstract:
A finite set is "hidden" if its elements are not directly enumerable or if its size cannot be ascertained via a deterministic query. In public health, epidemiology, demography, ecology and intelligence analysis, researchers have developed a wide variety of indirect statistical approaches, under different models for sampling and observation, for estimating the size of a hidden set. Some methods mak…
▽ More
A finite set is "hidden" if its elements are not directly enumerable or if its size cannot be ascertained via a deterministic query. In public health, epidemiology, demography, ecology and intelligence analysis, researchers have developed a wide variety of indirect statistical approaches, under different models for sampling and observation, for estimating the size of a hidden set. Some methods make use of random sampling with known or estimable sampling probabilities, and others make structural assumptions about relationships (e.g. ordering or network information) between the elements that comprise the hidden set. In this review, we describe models and methods for learning about the size of a hidden finite set, with special attention to asymptotic properties of estimators. We study the properties of these methods under two asymptotic regimes, "infill" in which the number of fixed-size samples increases, but the population size remains constant, and "outfill" in which the sample size and population size grow together. Statistical properties under these two regimes can be dramatically different.
△ Less
Submitted 15 October, 2019; v1 submitted 14 August, 2018;
originally announced August 2018.
-
Risk ratios for contagious outcomes
Authors:
Olga Morozova,
Ted Cohen,
Forrest W. Crawford
Abstract:
The risk ratio is a popular tool for summarizing the relationship between a binary covariate and outcome, even when outcomes may be dependent. Investigations of infectious disease outcomes in cohort studies of individuals embedded within clusters -- households, villages, or small groups -- often report risk ratios. Epidemiologists have warned that risk ratios may be misleading when outcomes are co…
▽ More
The risk ratio is a popular tool for summarizing the relationship between a binary covariate and outcome, even when outcomes may be dependent. Investigations of infectious disease outcomes in cohort studies of individuals embedded within clusters -- households, villages, or small groups -- often report risk ratios. Epidemiologists have warned that risk ratios may be misleading when outcomes are contagious, but the nature and severity of this error is not well understood. In this study, we assess the epidemiologic meaning of the risk ratio when outcomes are contagious. We first give a structural definition of infectious disease transmission within clusters, based on the canonical susceptible-infective epidemic model. From this standard characterization, we define the individual-level ratio of instantaneous risks (hazard ratio) as the inferential target, and evaluate the properties of the risk ratio as an estimate of this quantity. We exhibit analytically and by simulation the circumstances under which the risk ratio implies an effect whose direction is opposite that of the true individual-level hazard ratio. In particular, the risk ratio can be greater than one even when the covariate of interest reduces both individual-level susceptibility to infection, and transmissibility once infected. We explain these findings in the epidemiologic language of confounding and relate the direction bias to Simpson's paradox.
△ Less
Submitted 18 July, 2017;
originally announced July 2017.
-
Estimating the Size of a Large Network and its Communities from a Random Sample
Authors:
Lin Chen,
Amin Karbasi,
Forrest W. Crawford
Abstract:
Most real-world networks are too large to be measured or studied directly and there is substantial interest in estimating global network properties from smaller sub-samples. One of the most important global properties is the number of vertices/nodes in the network. Estimating the number of vertices in a large network is a major challenge in computer science, epidemiology, demography, and intellige…
▽ More
Most real-world networks are too large to be measured or studied directly and there is substantial interest in estimating global network properties from smaller sub-samples. One of the most important global properties is the number of vertices/nodes in the network. Estimating the number of vertices in a large network is a major challenge in computer science, epidemiology, demography, and intelligence analysis. In this paper we consider a population random graph G = (V;E) from the stochastic block model (SBM) with K communities/blocks. A sample is obtained by randomly choosing a subset W and letting G(W) be the induced subgraph in G of the vertices in W. In addition to G(W), we observe the total degree of each sampled vertex and its block membership. Given this partial information, we propose an efficient PopULation Size Estimation algorithm, called PULSE, that correctly estimates the size of the whole population as well as the size of each community. To support our theoretical analysis, we perform an exhaustive set of experiments to study the effects of sample size, K, and SBM model parameters on the accuracy of the estimates. The experimental results also demonstrate that PULSE significantly outperforms a widely-used method called the network scale-up estimator in a wide variety of scenarios. We conclude with extensions and directions for future work.
△ Less
Submitted 26 October, 2016;
originally announced October 2016.
-
Direct likelihood-based inference for discretely observed stochastic compartmental models of infectious disease
Authors:
Lam Si Tung Ho,
Forrest W. Crawford,
Marc A. Suchard
Abstract:
Stochastic compartmental models are important tools for understanding the course of infectious diseases epidemics in populations and in prospective evaluation of intervention policies. However, calculating the likelihood for discretely observed data from even simple models -- such as the ubiquitous susceptible-infectious-removed (SIR) model -- has been considered computationally intractable, since…
▽ More
Stochastic compartmental models are important tools for understanding the course of infectious diseases epidemics in populations and in prospective evaluation of intervention policies. However, calculating the likelihood for discretely observed data from even simple models -- such as the ubiquitous susceptible-infectious-removed (SIR) model -- has been considered computationally intractable, since its formulation almost a century ago. Recently researchers have proposed methods to circumvent this limitation through data augmentation or approximation, but these approaches often suffer from high computational cost or loss of accuracy. We develop the mathematical foundation and an efficient algorithm to compute the likelihood for discretely observed data from a broad class of stochastic compartmental models. We also give expressions for the derivatives of the transition probabilities using the same technique, making possible inference via Hamiltonian Monte Carlo (HMC). We use the 17th century plague in Eyam, a classic example of the SIR model, to compare our recursion method to sequential Monte Carlo, analyze using HMC, and assess the model assumptions. We also apply our direct likelihood evaluation to perform Bayesian inference for the 2014-2015 Ebola outbreak in Guinea. The results suggest that the epidemic infectious rates have decreased since October 2014 in the Southeast region of Guinea, while rates remain the same in other regions, facilitating understanding of the outbreak and the effectiveness of Ebola control interventions.
△ Less
Submitted 25 July, 2018; v1 submitted 24 August, 2016;
originally announced August 2016.
-
Submodular Variational Inference for Network Reconstruction
Authors:
Lin Chen,
Forrest W Crawford,
Amin Karbasi
Abstract:
In real-world and online social networks, individuals receive and transmit information in real time. Cascading information transmissions (e.g. phone calls, text messages, social media posts) may be understood as a realization of a diffusion process operating on the network, and its branching path can be represented by a directed tree. The process only traverses and thus reveals a limited portion o…
▽ More
In real-world and online social networks, individuals receive and transmit information in real time. Cascading information transmissions (e.g. phone calls, text messages, social media posts) may be understood as a realization of a diffusion process operating on the network, and its branching path can be represented by a directed tree. The process only traverses and thus reveals a limited portion of the edges. The network reconstruction/inference problem is to infer the unrevealed connections. Most existing approaches derive a likelihood and attempt to find the network topology maximizing the likelihood, a problem that is highly intractable. In this paper, we focus on the network reconstruction problem for a broad class of real-world diffusion processes, exemplified by a network diffusion scheme called respondent-driven sampling (RDS). We prove that under realistic and general models of network diffusion, the posterior distribution of an observed RDS realization is a Bayesian log-submodular model.We then propose VINE (Variational Inference for Network rEconstruction), a novel, accurate, and computationally efficient variational inference algorithm, for the network reconstruction problem under this model. Crucially, we do not assume any particular probabilistic model for the underlying network. VINE recovers any connected graph with high accuracy as shown by our experimental results on real-life networks.
△ Less
Submitted 10 July, 2017; v1 submitted 28 March, 2016;
originally announced March 2016.
-
Birth/birth-death processes and their computable transition probabilities with biological applications
Authors:
Lam Si Tung Ho,
Jason Xu,
Forrest W. Crawford,
Vladimir N. Minin,
Marc A. Suchard
Abstract:
Birth-death processes track the size of a univariate population, but many biological systems involve interaction between populations, necessitating models for two or more populations simultaneously. A lack of efficient methods for evaluating finite-time transition probabilities of bivariate processes, however, has restricted statistical inference in these models. Researchers rely on computationall…
▽ More
Birth-death processes track the size of a univariate population, but many biological systems involve interaction between populations, necessitating models for two or more populations simultaneously. A lack of efficient methods for evaluating finite-time transition probabilities of bivariate processes, however, has restricted statistical inference in these models. Researchers rely on computationally expensive methods such as matrix exponentiation or Monte Carlo approximation, restricting likelihood-based inference to small systems, or indirect methods such as approximate Bayesian computation. In this paper, we introduce the birth(death)/birth-death process, a tractable bivariate extension of the birth-death process. We develop an efficient and robust algorithm to calculate the transition probabilities of birth(death)/birth-death processes using a continued fraction representation of their Laplace transforms. Next, we identify several exemplary models arising in molecular epidemiology, macro-parasite evolution, and infectious disease modeling that fall within this class, and demonstrate advantages of our proposed method over existing approaches to inference in these models. Notably, the ubiquitous stochastic susceptible-infectious-removed (SIR) model falls within this class, and we emphasize that computable transition probabilities newly enable direct inference of parameters in the SIR model. We also propose a very fast method for approximating the transition probabilities under the SIR model via a novel branching process simplification, and compare it to the continued fraction representation method with application to the 17th century plague in Eyam. Although the two methods produce similar maximum a posteriori estimates, the branching process approximation fails to capture the correlation structure in the joint posterior distribution.
△ Less
Submitted 7 August, 2017; v1 submitted 11 March, 2016;
originally announced March 2016.
-
Confidence intervals for means under constrained dependence
Authors:
Peter M. Aronow,
Forrest W. Crawford,
José R. Zubizarreta
Abstract:
We develop a general framework for conducting inference on the mean of dependent random variables given constraints on their dependency graph. We establish the consistency of an oracle variance estimator of the mean when the dependency graph is known, along with an associated central limit theorem. We derive an integer linear program for finding an upper bound for the estimated variance when the g…
▽ More
We develop a general framework for conducting inference on the mean of dependent random variables given constraints on their dependency graph. We establish the consistency of an oracle variance estimator of the mean when the dependency graph is known, along with an associated central limit theorem. We derive an integer linear program for finding an upper bound for the estimated variance when the graph is unknown, but topological and degree-based constraints are available. We develop alternative bounds, including a closed-form bound, under an additional homoskedasticity assumption. We establish a basis for Wald-type confidence intervals for the mean that are guaranteed to have asymptotically conservative coverage. We apply the approach to inference from a social network link-tracing study and provide statistical software implementing the approach.
△ Less
Submitted 31 January, 2016;
originally announced February 2016.
-
Identification of homophily and preferential recruitment in respondent-driven sampling
Authors:
Forrest W. Crawford,
Peter M. Aronow,
Li Zeng,
Jianghong Li
Abstract:
Respondent-driven sampling (RDS) is a link-tracing procedure for surveying hidden or hard-to-reach populations in which subjects recruit other subjects via their social network. There is significant research interest in detecting clustering or dependence of epidemiological traits in networks, but researchers disagree about whether data from RDS studies can reveal it. Two distinct mechanisms accoun…
▽ More
Respondent-driven sampling (RDS) is a link-tracing procedure for surveying hidden or hard-to-reach populations in which subjects recruit other subjects via their social network. There is significant research interest in detecting clustering or dependence of epidemiological traits in networks, but researchers disagree about whether data from RDS studies can reveal it. Two distinct mechanisms account for dependence in traits of recruiters and recruitees in an RDS study: homophily, the tendency for individuals to share social ties with others exhibiting similar characteristics, and preferential recruitment, in which recruiters do not recruit uniformly at random from their available alters. The different effects of network homophily and preferential recruitment in RDS studies have been a source of confusion in methodological research on RDS, and in empirical studies of the social context of health risk in hidden populations. In this paper, we give rigorous definitions of homophily and preferential recruitment and show that neither can be measured precisely in general RDS studies. We derive nonparametric identification regions for homophily and preferential recruitment and show that these parameters are not point identified unless the network takes a degenerate form. The results indicate that claims of homophily or recruitment bias measured from empirical RDS studies may not be credible. We apply our identification results to a study involving both a network census and RDS on a population of injection drug users in Hartford, CT.
△ Less
Submitted 17 November, 2015;
originally announced November 2015.
-
Seeing the Unseen Network: Inferring Hidden Social Ties from Respondent-Driven Sampling
Authors:
Lin Chen,
Forrest W. Crawford,
Amin Karbasi
Abstract:
Learning about the social structure of hidden and hard-to-reach populations --- such as drug users and sex workers --- is a major goal of epidemiological and public health research on risk behaviors and disease prevention. Respondent-driven sampling (RDS) is a peer-referral process widely used by many health organizations, where research subjects recruit other subjects from their social network. I…
▽ More
Learning about the social structure of hidden and hard-to-reach populations --- such as drug users and sex workers --- is a major goal of epidemiological and public health research on risk behaviors and disease prevention. Respondent-driven sampling (RDS) is a peer-referral process widely used by many health organizations, where research subjects recruit other subjects from their social network. In such surveys, researchers observe who recruited whom, along with the time of recruitment and the total number of acquaintances (network degree) of respondents. However, due to privacy concerns, the identities of acquaintances are not disclosed. In this work, we show how to reconstruct the underlying network structure through which the subjects are recruited. We formulate the dynamics of RDS as a continuous-time diffusion process over the underlying graph and derive the likelihood for the recruitment time series under an arbitrary recruitment time distribution. We develop an efficient stochastic optimization algorithm called RENDER (REspoNdent-Driven nEtwork Reconstruction) that finds the network that best explains the collected data. We support our analytical results through an exhaustive set of experiments on both synthetic and real data.
△ Less
Submitted 1 December, 2015; v1 submitted 12 November, 2015;
originally announced November 2015.
-
Hidden population size estimation from respondent-driven sampling: a network approach
Authors:
Forrest W. Crawford,
Jiacheng Wu,
Robert Heimer
Abstract:
Estimating the size of stigmatized, hidden, or hard-to-reach populations is a major problem in epidemiology, demography, and public health research. Capture-recapture and multiplier methods have become standard tools for inference of hidden population sizes, but they require independent random sampling of target population members, which is rarely possible. Respondent-driven sampling (RDS) is a su…
▽ More
Estimating the size of stigmatized, hidden, or hard-to-reach populations is a major problem in epidemiology, demography, and public health research. Capture-recapture and multiplier methods have become standard tools for inference of hidden population sizes, but they require independent random sampling of target population members, which is rarely possible. Respondent-driven sampling (RDS) is a survey method for hidden populations that relies on social link tracing. The RDS recruitment process is designed to spread through the social network connecting members of the target population. In this paper, we show how to use network data revealed by RDS to estimate hidden population size. The key insight is that the recruitment chain, timing of recruitments, and network degrees of recruited subjects provide information about the number of individuals belonging to the target population who are not yet in the sample. We use a computationally efficient Bayesian method to integrate over the missing edges in the subgraph of recruited individuals. We validate the method using simulated data and apply the technique to estimate the number of people who inject drugs in St. Petersburg, Russia.
△ Less
Submitted 30 April, 2015;
originally announced April 2015.
-
Nonparametric Identification for Respondent-Driven Sampling
Authors:
Peter M. Aronow,
Forrest W. Crawford
Abstract:
Respondent-driven sampling is a survey method for hidden or hard-to-reach populations in which sampled individuals recruit others in the study population via their social links. The most popular estimator for for the population mean assumes that individual sampling probabilities are proportional to each subject's reported degree in a social network connecting members of the hidden population. Howe…
▽ More
Respondent-driven sampling is a survey method for hidden or hard-to-reach populations in which sampled individuals recruit others in the study population via their social links. The most popular estimator for for the population mean assumes that individual sampling probabilities are proportional to each subject's reported degree in a social network connecting members of the hidden population. However, it remains unclear under what circumstances these estimators are valid, and what assumptions are formally required to identify population quantities. In this short note we detail nonparametric identification results for the population mean when the sampling probability is assumed to be a function of network degree known to scale. Importantly, we establish general conditions for the consistency of the popular Volz-Heckathorn (VH) estimator. Our results imply that the conditions for consistency of the VH estimator are far less stringent than those suggested by recent work on diagnostics for RDS. In particular, our results do not require random sampling or the existence of a network connecting the population.
△ Less
Submitted 14 April, 2015;
originally announced April 2015.
-
The graphical structure of respondent-driven sampling
Authors:
Forrest W. Crawford
Abstract:
Respondent-driven sampling (RDS) is a chain-referral method for sampling members of a hidden or hard-to-reach population such as sex workers, homeless people, or drug users via their social network. Most methodological work on RDS has focused on inference of population means under the assumption that subjects' network degree determines their probability of being sampled. Criticism of existing esti…
▽ More
Respondent-driven sampling (RDS) is a chain-referral method for sampling members of a hidden or hard-to-reach population such as sex workers, homeless people, or drug users via their social network. Most methodological work on RDS has focused on inference of population means under the assumption that subjects' network degree determines their probability of being sampled. Criticism of existing estimators is usually focused on missing data: the underlying network is only partially observed, so it is difficult to determine correct sampling probabilities. In this paper, we show that data collected in ordinary RDS studies contain information about the structure of the respondents' social network. We construct a continuous-time model of RDS recruitment that incorporates the time series of recruitment events, the pattern of coupon use, and the network degrees of sampled subjects. Together, the observed data and the recruitment model place a well-defined probability distribution on the recruitment-induced subgraph of respondents. We show that this distribution can be interpreted as an exponential random graph model and develop a computationally efficient method for estimating the hidden graph. We validate the method using simulated data and apply the technique to an RDS study of injection drug users in St. Petersburg, Russia.
△ Less
Submitted 31 July, 2015; v1 submitted 3 June, 2014;
originally announced June 2014.
-
Sex, lies and self-reported counts: Bayesian mixture models for hea** in longitudinal count data via birth-death processes
Authors:
Forrest W. Crawford,
Robert E. Weiss,
Marc A. Suchard
Abstract:
Surveys often ask respondents to report nonnegative counts, but respondents may misremember or round to a nearby multiple of 5 or 10. This phenomenon is called hea**, and the error inherent in heaped self-reported numbers can bias estimation. Heaped data may be collected cross-sectionally or longitudinally and there may be covariates that complicate the inferential task. Hea** is a well-known…
▽ More
Surveys often ask respondents to report nonnegative counts, but respondents may misremember or round to a nearby multiple of 5 or 10. This phenomenon is called hea**, and the error inherent in heaped self-reported numbers can bias estimation. Heaped data may be collected cross-sectionally or longitudinally and there may be covariates that complicate the inferential task. Hea** is a well-known issue in many survey settings, and inference for heaped data is an important statistical problem. We propose a novel reporting distribution whose underlying parameters are readily interpretable as rates of misremembering and rounding. The process accommodates a variety of hea** grids and allows for quasi-hea** to values nearly but not equal to hea** multiples. We present a Bayesian hierarchical model for longitudinal samples with covariates to infer both the unobserved true distribution of counts and the parameters that control the hea** process. Finally, we apply our methods to longitudinal self-reported counts of sex partners in a study of high-risk behavior in HIV-positive youth.
△ Less
Submitted 14 September, 2015; v1 submitted 16 May, 2014;
originally announced May 2014.
-
On the distribution of interspecies correlation for Markov models of character evolution on Yule trees
Authors:
Willem H. Mulder,
Forrest W. Crawford
Abstract:
Efforts to reconstruct phylogenetic trees and understand evolutionary processes depend fundamentally on stochastic models of speciation and mutation. The simplest continuous-time model for speciation in phylogenetic trees is the Yule process, in which new species are "born" from existing lineages at a constant rate. Recent work has illuminated some of the structural properties of Yule trees, but i…
▽ More
Efforts to reconstruct phylogenetic trees and understand evolutionary processes depend fundamentally on stochastic models of speciation and mutation. The simplest continuous-time model for speciation in phylogenetic trees is the Yule process, in which new species are "born" from existing lineages at a constant rate. Recent work has illuminated some of the structural properties of Yule trees, but it remains mostly unknown how these properties affect sequence and trait patterns observed at the tips of the phylogenetic tree. Understanding the interplay between speciation and mutation under simple models of evolution is essential for deriving valid phylogenetic inference methods and gives insight into the optimal design of phylogenetic studies. In this work, we derive the probability distribution of interspecies covariance under Brownian motion and Ornstein-Uhlenbeck models of phenotypic change on a Yule tree. We compute the probability distribution of the number of mutations shared between two randomly chosen taxa in a Yule tree under discrete Markov mutation models. Our results suggest summary measures of phylogenetic information content, illuminate the correlation between site patterns in sequences or traits of related organisms, and provide heuristics for experimental design and reconstruction of phylogenetic trees.
△ Less
Submitted 15 August, 2014; v1 submitted 17 March, 2014;
originally announced March 2014.
-
Combining List Experiment and Direct Question Estimates of Sensitive Behavior Prevalence
Authors:
Peter M. Aronow,
Alexander Coppock,
Forrest W. Crawford,
Donald P. Green
Abstract:
Survey respondents may give untruthful answers to sensitive questions when asked directly. In recent years, researchers have turned to the list experiment (also known as the item count technique) to overcome this difficulty. While list experiments may be less prone to bias than direct questioning, list experiments are also more susceptible to sampling variability. We show that researchers do not h…
▽ More
Survey respondents may give untruthful answers to sensitive questions when asked directly. In recent years, researchers have turned to the list experiment (also known as the item count technique) to overcome this difficulty. While list experiments may be less prone to bias than direct questioning, list experiments are also more susceptible to sampling variability. We show that researchers do not have to abandon direct questioning altogether in order to gain the advantages of list experimentation. We develop a nonparametric estimator of the prevalence of sensitive behaviors that combines list experimentation and direct questioning. We prove that this estimator is asymptotically more efficient than the standard difference-in-means estimator, and we provide a basis for inference using Wald-type confidence intervals. Additionally, leveraging information from the direct questioning, we derive two nonparametric placebo tests of the identifying assumptions for the list experiment. We demonstrate the effectiveness of our combined estimator and placebo tests with an original survey experiment.
△ Less
Submitted 1 June, 2014; v1 submitted 4 December, 2013;
originally announced December 2013.
-
Markov counting models for correlated binary responses
Authors:
Forrest W. Crawford,
Daniel Zelterman
Abstract:
We propose a class of continuous-time Markov counting processes for analyzing correlated binary data and establish a correspondence between these models and sums of exchangeable Bernoulli random variables. Our approach generalizes many previous models for correlated outcomes, admits easily interpretable parameterizations, allows different cluster sizes, and incorporates ascertainment bias in a nat…
▽ More
We propose a class of continuous-time Markov counting processes for analyzing correlated binary data and establish a correspondence between these models and sums of exchangeable Bernoulli random variables. Our approach generalizes many previous models for correlated outcomes, admits easily interpretable parameterizations, allows different cluster sizes, and incorporates ascertainment bias in a natural way. We demonstrate several new models for dependent outcomes and provide algorithms for computing maximum likelihood estimates. We show how to incorporate cluster-specific covariates in a regression setting and demonstrate improved fits to well-known datasets from familial disease epidemiology and developmental toxicology.
△ Less
Submitted 26 August, 2014; v1 submitted 7 May, 2013;
originally announced May 2013.
-
Birth-death processes
Authors:
Forrest W. Crawford,
Marc A. Suchard
Abstract:
Many important stochastic counting models can be written as general birth-death processes (BDPs). BDPs are continuous-time Markov chains on the non-negative integers and can be used to easily parameterize a rich variety of probability distributions. Although the theoretical properties of general BDPs are well understood, traditionally statistical work on BDPs has been limited to the simple linear…
▽ More
Many important stochastic counting models can be written as general birth-death processes (BDPs). BDPs are continuous-time Markov chains on the non-negative integers and can be used to easily parameterize a rich variety of probability distributions. Although the theoretical properties of general BDPs are well understood, traditionally statistical work on BDPs has been limited to the simple linear (Kendall) process, which arises in ecology and evolutionary applications. Aside from a few simple cases, it remains impossible to find analytic expressions for the likelihood of a discretely-observed BDP, and computational difficulties have hindered development of tools for statistical inference. But the gap between BDP theory and practical methods for estimation has narrowed in recent years. There are now robust methods for evaluating likelihoods for realizations of BDPs: finite-time transition, first passage, equilibrium probabilities, and distributions of summary statistics that arise commonly in applications. Recent work has also exploited the connection between continuously- and discretely-observed BDPs to derive EM algorithms for maximum likelihood estimation. Likelihood-based inference for previously intractable BDPs is much easier than previously thought and regression approaches analogous to Poisson regression are straightforward to derive. In this review, we outline the basic mathematical theory for BDPs and demonstrate new tools for statistical inference using data from BDPs. We give six examples of BDPs and derive EM algorithms to fit their parameters by maximum likelihood. We show how to compute the distribution of integral summary statistics and give an example application to the total cost of an epidemic. Finally, we suggest future directions for innovation in this important class of stochastic processes.
△ Less
Submitted 25 July, 2014; v1 submitted 7 January, 2013;
originally announced January 2013.
-
Diversity, disparity, and evolutionary rate estimation for unresolved Yule trees
Authors:
Forrest W. Crawford,
Marc A. Suchard
Abstract:
The branching structure of biological evolution confers statistical dependencies on phenotypic trait values in related organisms. For this reason, comparative macroevolutionary studies usually begin with an inferred phylogeny that describes the evolutionary relationships of the organisms of interest. The probability of the observed trait data can be computed by assuming a model for trait evolution…
▽ More
The branching structure of biological evolution confers statistical dependencies on phenotypic trait values in related organisms. For this reason, comparative macroevolutionary studies usually begin with an inferred phylogeny that describes the evolutionary relationships of the organisms of interest. The probability of the observed trait data can be computed by assuming a model for trait evolution, such as Brownian motion, over the branches of this fixed tree. However, the phylogenetic tree itself contributes statistical uncertainty to estimates of other evolutionary quantities, and many comparative evolutionary biologists regard the tree as a nuisance parameter. In this paper, we present a framework for analytically integrating over unknown phylogenetic trees in comparative evolutionary studies by assuming that the tree arises from a continuous-time Markov branching model called the Yule process. To do this, we derive a closed-form expression for the distribution of phylogenetic diversity, which is the sum of branch lengths connecting a set of taxa. We then present a generalization of phylogenetic diversity which is equivalent to the expected trait disparity in a set of taxa whose evolutionary relationships are generated by a Yule process and whose traits evolve by Brownian motion. We derive expressions for the distribution of expected trait disparity under a Yule tree. Given one or more observations of trait disparity in a clade, we perform fast likelihood-based estimation of the Brownian variance for unresolved clades. Our method does not require simulation or a fixed phylogenetic tree. We conclude with a brief example illustrating Brownian rate estimation for thirteen families in the Mammalian order Carnivora, in which the phylogenetic tree for each family is unresolved.
△ Less
Submitted 20 July, 2012;
originally announced July 2012.
-
Transition probabilities for general birth-death processes with applications in ecology, genetics, and evolution
Authors:
Forrest W. Crawford,
Marc A. Suchard
Abstract:
A birth-death process is a continuous-time Markov chain that counts the number of particles in a system over time. In the general process with $n$ current particles, a new particle is born with instantaneous rate $λ_n$ and a particle dies with instantaneous rate $μ_n$. Currently no robust and efficient method exists to evaluate the finite-time transition probabilities in a general birth-death proc…
▽ More
A birth-death process is a continuous-time Markov chain that counts the number of particles in a system over time. In the general process with $n$ current particles, a new particle is born with instantaneous rate $λ_n$ and a particle dies with instantaneous rate $μ_n$. Currently no robust and efficient method exists to evaluate the finite-time transition probabilities in a general birth-death process with arbitrary birth and death rates. In this paper, we first revisit the theory of continued fractions to obtain expressions for the Laplace transforms of these transition probabilities and make explicit an important derivation connecting transition probabilities and continued fractions. We then develop an efficient algorithm for computing these probabilities that analyzes the error associated with approximations in the method. We demonstrate that this error-controlled method agrees with known solutions and outperforms previous approaches to computing these probabilities. Finally, we apply our novel method to several important problems in ecology, evolution, and genetics.
△ Less
Submitted 28 November, 2011;
originally announced November 2011.
-
Estimation for general birth-death processes
Authors:
Forrest W. Crawford,
Vladimir N. Minin,
Marc A. Suchard
Abstract:
Birth-death processes (BDPs) are continuous-time Markov chains that track the number of "particles" in a system over time. While widely used in population biology, genetics and ecology, statistical inference of the instantaneous particle birth and death rates remains largely limited to restrictive linear BDPs in which per-particle birth and death rates are constant. Researchers often observe the n…
▽ More
Birth-death processes (BDPs) are continuous-time Markov chains that track the number of "particles" in a system over time. While widely used in population biology, genetics and ecology, statistical inference of the instantaneous particle birth and death rates remains largely limited to restrictive linear BDPs in which per-particle birth and death rates are constant. Researchers often observe the number of particles at discrete times, necessitating data augmentation procedures such as expectation-maximization (EM) to find maximum likelihood estimates. The E-step in the EM algorithm is available in closed-form for some linear BDPs, but otherwise previous work has resorted to approximation or simulation. Remarkably, the E-step conditional expectations can also be expressed as convolutions of computable transition probabilities for any general BDP with arbitrary rates. This important observation, along with a convenient continued fraction representation of the Laplace transforms of the transition probabilities, allows novel and efficient computation of the conditional expectations for all BDPs, eliminating the need for approximation or costly simulation. We use this insight to derive EM algorithms that yield maximum likelihood estimation for general BDPs characterized by various rate models, including generalized linear models. We show that our Laplace convolution technique outperforms competing methods when available and demonstrate a technique to accelerate EM algorithm convergence. Finally, we validate our approach using synthetic data and then apply our methods to estimation of mutation parameters in microsatellite evolution.
△ Less
Submitted 21 November, 2011;
originally announced November 2011.