Search | arXiv e-print repository

Autoregressive Networks with Dependent Edges

Authors: **yuan Chang, Qin Fang, Eric D. Kolaczyk, Peter W. MacDonald, Qiwei Yao

Abstract: We propose an autoregressive framework for modelling dynamic networks with dependent edges. It encompasses the models which accommodate, for example, transitivity, density-dependent and other stylized features often observed in real network data. By assuming the edges of network at each time are independent conditionally on their lagged values, the models, which exhibit a close connection with tem… ▽ More We propose an autoregressive framework for modelling dynamic networks with dependent edges. It encompasses the models which accommodate, for example, transitivity, density-dependent and other stylized features often observed in real network data. By assuming the edges of network at each time are independent conditionally on their lagged values, the models, which exhibit a close connection with temporal ERGMs, facilitate both simulation and the maximum likelihood estimation in the straightforward manner. Due to the possible large number of parameters in the models, the initial MLEs may suffer from slow convergence rates. An improved estimator for each component parameter is proposed based on an iteration based on the projection which mitigates the impact of the other parameters (Chang et al., 2021, 2023). Based on a martingale difference structure, the asymptotic distribution of the improved estimator is derived without the stationarity assumption. The limiting distribution is not normal in general, and it reduces to normal when the underlying process satisfies some mixing conditions. Illustration with a transitivity model was carried out in both simulation and a real network data set. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: 27 pages, 2 tables, 4 figures

arXiv:2403.07124 [pdf, other]

Stochastic gradient descent-based inference for dynamic network models with attractors

Authors: Hancong Pan, Xiao**g Zhu, Cantay Caliskan, Dino P. Christenson, Konstantinos Spiliopoulos, Dylan Walker, Eric D. Kolaczyk

Abstract: In Coevolving Latent Space Networks with Attractors (CLSNA) models, nodes in a latent space represent social actors, and edges indicate their dynamic interactions. Attractors are added at the latent level to capture the notion of attractive and repulsive forces between nodes, borrowing from dynamical systems theory. However, CLSNA reliance on MCMC estimation makes scaling difficult, and the requir… ▽ More In Coevolving Latent Space Networks with Attractors (CLSNA) models, nodes in a latent space represent social actors, and edges indicate their dynamic interactions. Attractors are added at the latent level to capture the notion of attractive and repulsive forces between nodes, borrowing from dynamical systems theory. However, CLSNA reliance on MCMC estimation makes scaling difficult, and the requirement for nodes to be present throughout the study period limit practical applications. We address these issues by (i) introducing a Stochastic gradient descent (SGD) parameter estimation method, (ii) develo** a novel approach for uncertainty quantification using SGD, and (iii) extending the model to allow nodes to join and leave over time. Simulation results show that our extensions result in little loss of accuracy compared to MCMC, but can scale to much larger networks. We apply our approach to the longitudinal social networks of members of US Congress on the social media platform X. Accounting for node dynamics overcomes selection bias in the network and uncovers uniquely and increasingly repulsive forces within the Republican Party. △ Less

Submitted 20 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

arXiv:2308.00836 [pdf, other]

Differentially Private Linear Regression with Linked Data

Authors: Shurong Lin, Elliot Paquette, Eric D. Kolaczyk

Abstract: There has been increasing demand for establishing privacy-preserving methodologies for modern statistics and machine learning. Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses primarily on develo** differentially private versions of individual statistical and machine learning tasks, with nontrivial upstrea… ▽ More There has been increasing demand for establishing privacy-preserving methodologies for modern statistics and machine learning. Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses primarily on develo** differentially private versions of individual statistical and machine learning tasks, with nontrivial upstream pre-processing typically not incorporated. An important example is when record linkage is done prior to downstream modeling. Record linkage refers to the statistical task of linking two or more data sets of the same group of entities without a unique identifier. This probabilistic procedure brings additional uncertainty to the subsequent task. In this paper, we present two differentially private algorithms for linear regression with linked data. In particular, we propose a noisy gradient method and a sufficient statistics perturbation approach for the estimation of regression coefficients. We investigate the privacy-accuracy tradeoff by providing finite-sample error bounds for the estimators, which allows us to understand the relative contributions of linkage error, estimation error, and the cost of privacy. The variances of the estimators are also discussed. We demonstrate the performance of the proposed algorithms through simulations and an application to synthetic data. △ Less

Submitted 7 May, 2024; v1 submitted 1 August, 2023; originally announced August 2023.

MSC Class: 68P27; 62-XX ACM Class: G.3; I.0

arXiv:2301.08324 [pdf, other]

doi 10.1214/24-EJS2234

Differentially Private Confidence Intervals for Proportions under Stratified Random Sampling

Authors: Shurong Lin, Mark Bun, Marco Gaboardi, Eric D. Kolaczyk, Adam Smith

Abstract: Confidence intervals are a fundamental tool for quantifying the uncertainty of parameters of interest. With the increase of data privacy awareness, develo** a private version of confidence intervals has gained growing attention from both statisticians and computer scientists. Differential privacy is a state-of-the-art framework for analyzing privacy loss when releasing statistics computed from s… ▽ More Confidence intervals are a fundamental tool for quantifying the uncertainty of parameters of interest. With the increase of data privacy awareness, develo** a private version of confidence intervals has gained growing attention from both statisticians and computer scientists. Differential privacy is a state-of-the-art framework for analyzing privacy loss when releasing statistics computed from sensitive data. Recent work has been done around differentially private confidence intervals, yet to the best of our knowledge, rigorous methodologies on differentially private confidence intervals in the context of survey sampling have not been studied. In this paper, we propose three differentially private algorithms for constructing confidence intervals for proportions under stratified random sampling. We articulate two variants of differential privacy that make sense for data from stratified sampling designs, analyzing each of our algorithms within one of these two variants. We establish analytical privacy guarantees and asymptotic properties of the estimators. In addition, we conduct simulation studies to evaluate the proposed private confidence intervals, and two applications to the 1940 Census data are provided. △ Less

Submitted 11 April, 2024; v1 submitted 19 January, 2023; originally announced January 2023.

Comments: 39 pages, 4 figures

MSC Class: 68P27; 62G15; 62Dxx

Journal ref: Electronic Journal of Statistics, Electron. J. Statist. 18(1), 1455-1494, (2024)

arXiv:2202.10513 [pdf, other]

Quantifying Uncertainty for Temporal Motif Estimation in Graph Streams under Sampling

Authors: Xiao**g Zhu, Eric D. Kolaczyk

Abstract: Dynamic networks, a.k.a. graph streams, consist of a set of vertices and a collection of timestamped interaction events (i.e., temporal edges) between vertices. Temporal motifs are defined as classes of (small) isomorphic induced subgraphs on graph streams, considering both edge ordering and duration. As with motifs in static networks, temporal motifs are the fundamental building blocks for tempor… ▽ More Dynamic networks, a.k.a. graph streams, consist of a set of vertices and a collection of timestamped interaction events (i.e., temporal edges) between vertices. Temporal motifs are defined as classes of (small) isomorphic induced subgraphs on graph streams, considering both edge ordering and duration. As with motifs in static networks, temporal motifs are the fundamental building blocks for temporal structures in dynamic networks. Several methods have been designed to count the occurrences of temporal motifs in graph streams, with recent work focusing on estimating the count under various sampling schemes along with concentration properties. However, little attention has been given to the problem of uncertainty quantification and the asymptotic statistical properties for such count estimators. In this work, we establish the consistency and the asymptotic normality of a certain Horvitz-Thompson type of estimator in an edge sampling framework for deterministic graph streams, which can be used to construct confidence intervals and conduct hypothesis testing for the temporal motif count under sampling. We also establish similar results under an analogous stochastic model. Our results are relevant to a wide range of applications in social, communication, biological, and brain networks, for tasks involving pattern discovery. △ Less

Submitted 21 February, 2022; originally announced February 2022.

arXiv:2112.10151 [pdf, ps, other]

doi 10.1214/24-AOS2365

Edge differentially private estimation in the $β$-model via jittering and method of moments

Authors: **yuan Chang, Qiao Hu, Eric D. Kolaczyk, Qiwei Yao, Fengting Yi

Abstract: A standing challenge in data privacy is the trade-off between the level of privacy and the efficiency of statistical inference. Here we conduct an in-depth study of this trade-off for parameter estimation in the $β$-model (Chatterjee, Diaconis and Sly, 2011) for edge differentially private network data released via jittering (Karwa, Krivitsky and Slavković, 2017). Unlike most previous approaches b… ▽ More A standing challenge in data privacy is the trade-off between the level of privacy and the efficiency of statistical inference. Here we conduct an in-depth study of this trade-off for parameter estimation in the $β$-model (Chatterjee, Diaconis and Sly, 2011) for edge differentially private network data released via jittering (Karwa, Krivitsky and Slavković, 2017). Unlike most previous approaches based on maximum likelihood estimation for this network model, we proceed via method-of-moments. This choice facilitates our exploration of a substantially broader range of privacy levels - corresponding to stricter privacy - than has been to date. Over this new range we discover our proposed estimator for the parameters exhibits an interesting phase transition, with both its convergence rate and asymptotic variance following one of three different regimes of behavior depending on the level of privacy. Because identification of the operable regime is difficult if not impossible in practice, we devise a novel adaptive bootstrap procedure to construct uniform inference across different phases. In fact, leveraging this bootstrap we are able to provide for simultaneous inference of all parameters in the $β$-model (i.e., equal to the number of nodes), which, to our best knowledge, is the first result of its kind. Numerical experiments confirm the competitive and reliable finite sample performance of the proposed inference methods, next to a comparable maximum likelihood method, as well as significant advantages in terms of computational speed and memory. △ Less

Submitted 2 April, 2024; v1 submitted 19 December, 2021; originally announced December 2021.

Journal ref: Annals of Statistics 2024, Vol. 52, pp. 708-728

arXiv:2109.13129 [pdf, other]

Disentangling positive and negative partisanship in social media interactions using a coevolving latent space network with attractors model

Authors: Xiao**g Zhu, Cantay Caliskan, Dino P. Christenson, Konstantinos Spiliopoulos, Dylan Walker, Eric D. Kolaczyk

Abstract: We develop a broadly applicable class of coevolving latent space network with attractors (CLSNA) models, where nodes represent individual social actors assumed to lie in an unknown latent space, edges represent the presence of a specified interaction between actors, and attractors are added in the latent level to capture the notion of attractive and repulsive forces. We apply the CLSNA models to u… ▽ More We develop a broadly applicable class of coevolving latent space network with attractors (CLSNA) models, where nodes represent individual social actors assumed to lie in an unknown latent space, edges represent the presence of a specified interaction between actors, and attractors are added in the latent level to capture the notion of attractive and repulsive forces. We apply the CLSNA models to understand the dynamics of partisan polarization on social media, where we expect Republicans and Democrats to increasingly interact with their own party and disengage with the opposing party. Using longitudinal social networks from the social media platforms Twitter and Reddit, we investigate the relative contributions of positive (attractive) and negative (repulsive) forces among political elites and the public, respectively. Our goals are to disentangle the positive and negative forces within and between parties and explore if and how they change over time. Our analysis confirms the existence of partisan polarization in social media interactions among both political elites and the public. Moreover, while positive partisanship is the driving force of interactions across the full periods of study for both the public and Democratic elites, negative partisanship has come to dominate Republican elites' interactions since the run-up to the 2016 presidential election. △ Less

Submitted 13 August, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

Comments: revised version

arXiv:2105.04518 [pdf, other]

Causal Inference under Network Interference with Noise

Authors: Wenrui Li, Daniel L. Sussman, Eric D. Kolaczyk

Abstract: Increasingly, there is a marked interest in estimating causal effects under network interference due to the fact that interference manifests naturally in networked experiments. However, network information generally is available only up to some level of error. We study the propagation of such errors to estimators of average causal effects under network interference. Specifically, assuming a four-l… ▽ More Increasingly, there is a marked interest in estimating causal effects under network interference due to the fact that interference manifests naturally in networked experiments. However, network information generally is available only up to some level of error. We study the propagation of such errors to estimators of average causal effects under network interference. Specifically, assuming a four-level exposure model and Bernoulli random assignment of treatment, we characterize the impact of network noise on the bias and variance of standard estimators in homogeneous and inhomogeneous networks. In addition, we propose method-of-moments estimators for bias reduction where a minimal number of network replicates are available. We show our estimators are asymptotically normal and provide confidence intervals for quantifying the uncertainty in these estimates. We illustrate the practical performance of our estimators through simulation studies in British secondary school contact networks. △ Less

Submitted 31 August, 2022; v1 submitted 10 May, 2021; originally announced May 2021.

Comments: 68 pages, 1 figure

arXiv:2104.14952 [pdf, other]

doi 10.1109/IEEECONF53345.2021.9723092

Network Recovery from Unlabeled Noisy Samples

Authors: Nathaniel Josephs, Wenrui Li, Eric D. Kolaczyk

Abstract: There is a growing literature on the statistical analysis of multiple networks in which the network is the fundamental data object. However, most of this work requires networks on a shared set of labeled vertices. In this work, we consider the question of recovering a parent network based on noisy unlabeled samples. We identify a specific regime in the noisy network literature for recovery that is… ▽ More There is a growing literature on the statistical analysis of multiple networks in which the network is the fundamental data object. However, most of this work requires networks on a shared set of labeled vertices. In this work, we consider the question of recovering a parent network based on noisy unlabeled samples. We identify a specific regime in the noisy network literature for recovery that is asymptotically unbiased and computationally tractable based on a three-stage recovery procedure: first, we align the networks via a sequential pairwise graph matching procedure; next, we compute the sample average of the aligned networks; finally, we obtain an estimate of the parent by thresholding the sample average. Previous work on multiple unlabeled networks is only possible for trivial networks due to the complexity of brute-force computations. △ Less

Submitted 30 April, 2021; originally announced April 2021.

arXiv:2102.11948 [pdf, other]

Inferring the Type of Phase Transitions Undergone in Epileptic Seizures Using Random Graph Hidden Markov Models for Percolation in Noisy Dynamic Networks

Authors: Xiao**g Zhu, Heather Shappell, Mark A. Kramer, Catherine J. Chu, Eric D. Kolaczyk

Abstract: In clinical neuroscience, epileptic seizures have been associated with the sudden emergence of coupled activity across the brain. The resulting functional networks - in which edges indicate strong enough coupling between brain regions - are consistent with the notion of percolation, which is a phenomenon in complex networks corresponding to the sudden emergence of a giant connected component. Trad… ▽ More In clinical neuroscience, epileptic seizures have been associated with the sudden emergence of coupled activity across the brain. The resulting functional networks - in which edges indicate strong enough coupling between brain regions - are consistent with the notion of percolation, which is a phenomenon in complex networks corresponding to the sudden emergence of a giant connected component. Traditionally, work has concentrated on noise-free percolation with a monotonic process of network growth, but real-world networks are more complex. We develop a class of random graph hidden Markov models (RG-HMMs) for characterizing percolation regimes in noisy, dynamically evolving networks in the presence of edge birth and edge death, as well as noise. This class is used to understand the type of phase transitions undergone in a seizure, and in particular, distinguishing between different percolation regimes in epileptic seizures. We develop a hypothesis testing framework for inferring putative percolation mechanisms. As a necessary precursor, we present an EM algorithm for estimating parameters from a sequence of noisy networks only observed at a longitudinal subsampling of time points. Our results suggest that different types of percolation can occur in human seizures. The type inferred may suggest tailored treatment strategies and provide new insights into the fundamental science of epilepsy. △ Less

Submitted 23 February, 2021; originally announced February 2021.

arXiv:2011.12416 [pdf, other]

A spectral-based framework for hypothesis testing in populations of networks

Authors: Li Chen, Nathaniel Josephs, Lizhen Lin, Jie Zhou, Eric D. Kolaczyk

Abstract: In this paper, we propose a new spectral-based approach to hypothesis testing for populations of networks. The primary goal is to develop a test to determine whether two given samples of networks come from the same random model or distribution. Our test statistic is based on the trace of the third order for a centered and scaled adjacency matrix, which we prove converges to the standard normal dis… ▽ More In this paper, we propose a new spectral-based approach to hypothesis testing for populations of networks. The primary goal is to develop a test to determine whether two given samples of networks come from the same random model or distribution. Our test statistic is based on the trace of the third order for a centered and scaled adjacency matrix, which we prove converges to the standard normal distribution as the number of nodes tends to infinity. The asymptotic power guarantee of the test is also provided. The proper interplay between the number of networks and the number of nodes for each network is explored in characterizing the theoretical properties of the proposed testing statistics. Our tests are applicable to both binary and weighted networks, operate under a very general framework where the networks are allowed to be large and sparse, and can be extended to multiple-sample testing. We provide an extensive simulation study to demonstrate the superior performance of our test over existing methods and apply our test to three real datasets. △ Less

Submitted 24 November, 2020; originally announced November 2020.

arXiv:2011.00138 [pdf, other]

doi 10.1371/journal.pcbi.1008545

Sensor-based localization of epidemic sources on human mobility networks

Authors: Jun Li, Juliane Manitz, Enrico Bertuzzo, Eric D. Kolaczyk

Abstract: We investigate the source detection problem in epidemiology, which is one of the most important issues for control of epidemics. Mathematically, we reformulate the problem as one of identifying the relevant component in a multivariate Gaussian mixture model. Focusing on the study of cholera and diseases with similar modes of transmission, we calibrate the parameters of our mixture model using huma… ▽ More We investigate the source detection problem in epidemiology, which is one of the most important issues for control of epidemics. Mathematically, we reformulate the problem as one of identifying the relevant component in a multivariate Gaussian mixture model. Focusing on the study of cholera and diseases with similar modes of transmission, we calibrate the parameters of our mixture model using human mobility networks within a stochastic, spatially explicit epidemiological model for waterborne disease. Furthermore, we adopt a Bayesian perspective, so that prior information on source location can be incorporated (e.g., reflecting the impact of local conditions). Posterior-based inference is performed, which permits estimates in the form of either individual locations or regions. Importantly, our estimator only requires first-arrival times of the epidemic by putative observers, typically located only at a small proportion of nodes. The proposed method is demonstrated within the context of the 2000-2002 cholera outbreak in the KwaZulu-Natal province of South Africa. △ Less

Submitted 30 October, 2020; originally announced November 2020.

arXiv:2004.04765 [pdf, other]

Bayesian classification, anomaly detection, and survival analysis using network inputs with application to the microbiome

Authors: Nathaniel Josephs, Lizhen Lin, Steven Rosenberg, Eric D. Kolaczyk

Abstract: While the study of a single network is well-established, technological advances now allow for the collection of multiple networks with relative ease. Increasingly, anywhere from several to thousands of networks can be created from brain imaging, gene co-expression data, or microbiome measurements. And these networks, in turn, are being looked to as potentially powerful features to be used in model… ▽ More While the study of a single network is well-established, technological advances now allow for the collection of multiple networks with relative ease. Increasingly, anywhere from several to thousands of networks can be created from brain imaging, gene co-expression data, or microbiome measurements. And these networks, in turn, are being looked to as potentially powerful features to be used in modeling. However, with networks being non-Euclidean in nature, how best to incorporate them into standard modeling tasks is not obvious. In this paper, we propose a Bayesian modeling framework that provides a unified approach to binary classification, anomaly detection, and survival analysis with network inputs. We encode the networks in the kernel of a Gaussian process prior via their pairwise differences and we discuss several choices of provably positive definite kernel that can be plugged into our models. Although our methods are widely applicable, we are motivated here in particular by microbiome research (where network analysis is emerging as the standard approach for capturing the interconnectedness of microbial taxa across both time and space) and its potential for reducing preterm delivery and improving personalization of prenatal care. △ Less

Submitted 13 January, 2021; v1 submitted 9 April, 2020; originally announced April 2020.

arXiv:2002.05763 [pdf, other]

Estimation of the Epidemic Branching Factor in Noisy Contact Networks

Authors: Wenrui Li, Daniel L. Sussman, Eric D. Kolaczyk

Abstract: Many fundamental concepts in network-based epidemic modeling depend on the branching factor, which captures a sense of dispersion in the network connectivity and quantifies the rate of spreading across the network. Moreover, contact network information generally is available only up to some level of error. We study the propagation of such errors to the estimation of the branching factor. Specifica… ▽ More Many fundamental concepts in network-based epidemic modeling depend on the branching factor, which captures a sense of dispersion in the network connectivity and quantifies the rate of spreading across the network. Moreover, contact network information generally is available only up to some level of error. We study the propagation of such errors to the estimation of the branching factor. Specifically, we characterize the impact of network noise on the bias and variance of the observed branching factor for arbitrary true networks, with examples in sparse, dense, homogeneous and inhomogeneous networks. In addition, we propose a method-of-moments estimator for the true branching factor. We illustrate the practical performance of our estimator through simulation studies and with contact networks observed in British secondary schools and a French hospital. △ Less

Submitted 12 October, 2020; v1 submitted 13 February, 2020; originally announced February 2020.

Comments: 44 pages, 4 figures

arXiv:1803.02488 [pdf, ps, other]

doi 10.1080/01621459.2020.1778482

Estimation of subgraph density in noisy networks

Authors: **yuan Chang, Eric D. Kolaczyk, Qiwei Yao

Abstract: While it is common practice in applied network analysis to report various standard network summary statistics, these numbers are rarely accompanied by uncertainty quantification. Yet any error inherent in the measurements underlying the construction of the network, or in the network construction procedure itself, necessarily must propagate to any summary statistics reported. Here we study the prob… ▽ More While it is common practice in applied network analysis to report various standard network summary statistics, these numbers are rarely accompanied by uncertainty quantification. Yet any error inherent in the measurements underlying the construction of the network, or in the network construction procedure itself, necessarily must propagate to any summary statistics reported. Here we study the problem of estimating the density of an arbitrary subgraph, given a noisy version of some underlying network as data. Under a simple model of network error, we show that consistent estimation of such densities is impossible when the rates of error are unknown and only a single network is observed. Accordingly, we develop method-of-moment estimators of network subgraph densities and error rates for the case where a minimal number of network replicates are available. These estimators are shown to be asymptotically normal as the number of vertices increases to infinity. We also provide confidence intervals for quantifying the uncertainty in these estimates based on the asymptotic normality. To construct the confidence intervals, a new and non-standard bootstrap method is proposed to compute asymptotic variances, which is infeasible otherwise. We illustrate the proposed methods in the context of gene coexpression networks. △ Less

Submitted 30 June, 2020; v1 submitted 6 March, 2018; originally announced March 2018.

Journal ref: Journal of the American Statistical Association 2022, Vol. 117, No. 537, 361-374

arXiv:1712.08586 [pdf, ps, other]

Dynamic Networks with Multi-scale Temporal Structure

Authors: Xinyu Kang, Apratim Ganguly, Eric D. Kolaczyk

Abstract: We describe a novel method for modeling non-stationary multivariate time series, with time-varying conditional dependencies represented through dynamic networks. Our proposed approach combines traditional multi-scale modeling and network based neighborhood selection, aiming at capturing temporally local structure in the data while maintaining sparsity of the potential interactions. Our multi-scale… ▽ More We describe a novel method for modeling non-stationary multivariate time series, with time-varying conditional dependencies represented through dynamic networks. Our proposed approach combines traditional multi-scale modeling and network based neighborhood selection, aiming at capturing temporally local structure in the data while maintaining sparsity of the potential interactions. Our multi-scale framework is based on recursive dyadic partitioning, which recursively partitions the temporal axis into finer intervals and allows us to detect local network structural changes at varying temporal resolutions. The dynamic neighborhood selection is achieved through penalized likelihood estimation, where the penalty seeks to limit the number of neighbors used to model the data. We present theoretical and numerical results describing the performance of our method, which is motivated and illustrated using task-based magnetoencephalography (MEG) data in neuroscience. △ Less

Submitted 22 December, 2017; originally announced December 2017.

arXiv:1708.04018 [pdf, ps, other]

Approximation of the difference of two Poisson-like counts with Skellam

Authors: H. L. Gan, Eric D. Kolaczyk

Abstract: Poisson-like behavior for event count data is ubiquitous in nature. At the same time, differencing of such counts arises in the course of data processing in a variety of areas of application. As a result, the Skellam distribution -- defined as the distribution of the difference of two independent Poisson random variables -- is a natural candidate for approximating the difference of Poisson-like ev… ▽ More Poisson-like behavior for event count data is ubiquitous in nature. At the same time, differencing of such counts arises in the course of data processing in a variety of areas of application. As a result, the Skellam distribution -- defined as the distribution of the difference of two independent Poisson random variables -- is a natural candidate for approximating the difference of Poisson-like event counts. However, in many contexts strict independence, whether between counts or among events within counts, is not a tenable assumption. Here we characterize the accuracy in approximating the difference of Poisson-like counts by a Skellam random variable. Our results fully generalize existing, more limited results in this direction and, at the same time, our derivations are significantly more concise and elegant. We illustrate the potential impact of these results in the context of problems from network analysis and image processing, where various forms of weak dependence can be expected. △ Less

Submitted 2 April, 2018; v1 submitted 14 August, 2017; originally announced August 2017.

Comments: 17 pages; Minor revisions

MSC Class: 62E17; 60F05; 60J27

arXiv:1510.03959 [pdf, other]

Detection of multiple perturbations in multi-omics biological networks

Authors: Paula J. Griffin, W. Evan Johnson, Eric D. Kolaczyk

Abstract: Cellular mechanism-of-action is of fundamental concern in many biological studies. It is of particular interest for identifying the cause of disease and learning the way in which treatments act against disease. However, pinpointing such mechanisms is difficult, due to the fact that small perturbations to the cell can have wide-ranging downstream effects. Given a snapshot of cellular activity, it c… ▽ More Cellular mechanism-of-action is of fundamental concern in many biological studies. It is of particular interest for identifying the cause of disease and learning the way in which treatments act against disease. However, pinpointing such mechanisms is difficult, due to the fact that small perturbations to the cell can have wide-ranging downstream effects. Given a snapshot of cellular activity, it can be challenging to tell where a disturbance originated. The presence of an ever-greater variety of high-throughput biological data offers an opportunity to examine cellular behavior from multiple angles, but also presents the statistical challenge of how to effectively analyze data from multiple sources. In this setting, we propose a method for mechanism-of-action inference by extending network filtering to multi-attribute data. We first estimate a joint Gaussian graphical model across multiple data types using penalized regression and filter for network effects. We then apply a set of likelihood ratio tests to identify the most likely site of the original perturbation. In addition, we propose a conditional testing procedure to allow for detection of multiple perturbations. We demonstrate this methodology on paired gene expression and methylation data from The Cancer Genome Atlas (TCGA). △ Less

Submitted 2 October, 2016; v1 submitted 13 October, 2015; originally announced October 2015.

Comments: Submitted to Biometrics

arXiv:1409.5640 [pdf, other]

On the Propagation of Low-Rate Measurement Error to Subgraph Counts in Large Networks

Authors: Prakash Balachandran, Eric D. Kolaczyk, Weston Viles

Abstract: Our work in this paper is inspired by a statistical observation that is both elementary and broadly relevant to network analysis in practice -- that the uncertainty in approximating some true network graph $G=(V,E)$ by some estimated graph $\hat{G}=(V,\hat{E})$ manifests as errors in the status of (non)edges that must necessarily propagate to any estimates of network summaries $η(G)$ we seek. Moti… ▽ More Our work in this paper is inspired by a statistical observation that is both elementary and broadly relevant to network analysis in practice -- that the uncertainty in approximating some true network graph $G=(V,E)$ by some estimated graph $\hat{G}=(V,\hat{E})$ manifests as errors in the status of (non)edges that must necessarily propagate to any estimates of network summaries $η(G)$ we seek. Motivated by the common practice of using plug-in estimates $η(\hat{G})$ as proxies for $η(G)$, our focus is on the problem of characterizing the distribution of the discrepancy $D=η(\hat{G}) - η(G)$, in the case where $η(\cdot)$ is a subgraph count. Specifically, we study the fundamental case where the statistic of interest is $|E|$, the number of edges in $G$. Our primary contribution in this paper is to show that in the empirically relevant setting of large graphs with low-rate measurement errors, the distribution of $D_E=|\hat{E}| - |E|$ is well-characterized by a Skellam distribution, when the errors are independent or weakly dependent. Under an assumption of independent errors, we are able to further show conditions under which this characterization is strictly better than that of an appropriate normal distribution. These results derive from our formulation of a general result, quantifying the accuracy with which the difference of two sums of dependent Bernoulli random variables may be approximated by the difference of two independent Poisson random variables, i.e., by a Skellam distribution. This general result is developed through the use of Stein's method, and may be of some general interest. We finish with a discussion of possible extension of our work to subgraph counts $η(G)$ of higher order. △ Less

Submitted 7 October, 2016; v1 submitted 19 September, 2014; originally announced September 2014.

arXiv:1409.0503 [pdf, other]

Perturbation Detection Through Modeling of Gene Expression on a Latent Biological Pathway Network: A Bayesian hierarchical approach

Authors: Lisa M. Pham, Luis Carvalho, Scott Schaus, Eric D. Kolaczyk

Abstract: Cellular response to a perturbation is the result of a dynamic system of biological variables linked in a complex network. A major challenge in drug and disease studies is identifying the key factors of a biological network that are essential in determining the cell's fate. Here our goal is the identification of perturbed pathways from high-throughput gene expression data. We develop a three-lev… ▽ More Cellular response to a perturbation is the result of a dynamic system of biological variables linked in a complex network. A major challenge in drug and disease studies is identifying the key factors of a biological network that are essential in determining the cell's fate. Here our goal is the identification of perturbed pathways from high-throughput gene expression data. We develop a three-level hierarchical model, where (i) the first level captures the relationship between gene expression and biological pathways using confirmatory factor analysis, (ii) the second level models the behavior within an underlying network of pathways induced by an unknown perturbation using a conditional autoregressive model, and (iii) the third level is a spike-and-slab prior on the perturbations. We then identify perturbations through posterior-based variable selection. We illustrate our approach using gene transcription drug perturbation profiles from the DREAM7 drug sensitivity predication challenge data set. Our proposed method identified regulatory pathways that are known to play a causative role and that were not readily resolved using gene set enrichment analysis or exploratory factor models. Simulation results are presented assessing the performance of this model relative to a network-free variant and its robustness to inaccuracies in biological databases. △ Less

Submitted 1 September, 2014; originally announced September 2014.

arXiv:1407.5525 [pdf, other]

Hypothesis Testing For Network Data in Functional Neuroimaging

Authors: Cedric E. Ginestet, Jun Li, Prakash Balachandran, Steven Rosenberg, Eric D. Kolaczyk

Abstract: In recent years, it has become common practice in neuroscience to use networks to summarize relational information in a set of measurements, typically assumed to be reflective of either functional or structural relationships between regions of interest in the brain. One of the most basic tasks of interest in the analysis of such data is the testing of hypotheses, in answer to questions such as "Is… ▽ More In recent years, it has become common practice in neuroscience to use networks to summarize relational information in a set of measurements, typically assumed to be reflective of either functional or structural relationships between regions of interest in the brain. One of the most basic tasks of interest in the analysis of such data is the testing of hypotheses, in answer to questions such as "Is there a difference between the networks of these two groups of subjects?" In the classical setting, where the unit of interest is a scalar or a vector, such questions are answered through the use of familiar two-sample testing strategies. Networks, however, are not Euclidean objects, and hence classical methods do not directly apply. We address this challenge by drawing on concepts and techniques from geometry, and high-dimensional statistical inference. Our work is based on a precise geometric characterization of the space of graph Laplacian matrices and a nonparametric notion of averaging due to Fréchet. We motivate and illustrate our resulting methodologies for testing in the context of networks derived from functional neuroimaging data on human subjects from the 1000 Functional Connectomes Project. In particular, we show that this global test is more statistical powerful, than a mass-univariate approach. In addition, we have also provided a method for visualizing the individual contribution of each edge to the overall test statistic. △ Less

Submitted 17 March, 2017; v1 submitted 21 July, 2014; originally announced July 2014.

Comments: 34 pages. 5 figures

arXiv:1401.3518 [pdf, other]

doi 10.1103/PhysRevE.93.052301

Percolation under Noise: Detecting Explosive Percolation Using the Second Largest Component

Authors: Wes Viles, Cedric E. Ginestet, Ariana Tang, Mark A. Kramer, Eric D. Kolaczyk

Abstract: We consider the problem of distinguishing classical (Erdős-Rényi) percolation from explosive (Achlioptas) percolation, under noise. A statistical model of percolation is constructed allowing for the birth and death of edges as well as the presence of noise in the observations. This graph-valued stochastic process is composed of a latent and an observed non-stationary process, where the observed gr… ▽ More We consider the problem of distinguishing classical (Erdős-Rényi) percolation from explosive (Achlioptas) percolation, under noise. A statistical model of percolation is constructed allowing for the birth and death of edges as well as the presence of noise in the observations. This graph-valued stochastic process is composed of a latent and an observed non-stationary process, where the observed graph process is corrupted by Type I and Type II errors. This produces a hidden Markov graph model. We show that for certain choices of parameters controlling the noise, the classical (ER) percolation is visually indistinguishable from the explosive (Achlioptas) percolation model. In this setting, we compare two different criteria for discriminating between these two percolation models, based on a quantile difference (QD) of the first component's size and on the maximal size of the second largest component. We show through data simulations that this second criterion outperforms the QD of the first component's size, in terms of discriminatory power. The maximal size of the second component therefore provides a useful statistic for distinguishing between the ER and Achlioptas models of percolation, under physically motivated conditions for the birth and death of edges, and under noise. The potential application of the proposed criteria for percolation detection in clinical neuroscience is also discussed. △ Less

Submitted 15 January, 2014; originally announced January 2014.

Comments: 9 pages and 8 figures. Submitted to Physics Review, Series E

Journal ref: Phys. Rev. E 93, 052301 (2016)

arXiv:1311.1450 [pdf, ps, other]

Exponential-type Inequalities Involving Ratios of the Modified Bessel Function of the First Kind and their Applications

Authors: Prakash Balachandran, Weston Viles, Eric D. Kolaczyk

Abstract: The modified Bessel function of the first kind, $I_ν(x)$, arises in numerous areas of study, such as physics, signal processing, probability, statistics, etc. As such, there has been much interest in recent years in deducing properties of functionals involving $I_ν(x)$, in particular, of the ratio ${I_{ν+1}(x)}/{I_ν(x)}$, when $ν,x\geq 0$. In this paper we establish sharp upper and lower bounds on… ▽ More The modified Bessel function of the first kind, $I_ν(x)$, arises in numerous areas of study, such as physics, signal processing, probability, statistics, etc. As such, there has been much interest in recent years in deducing properties of functionals involving $I_ν(x)$, in particular, of the ratio ${I_{ν+1}(x)}/{I_ν(x)}$, when $ν,x\geq 0$. In this paper we establish sharp upper and lower bounds on $H(ν,x)=\sum_{k=1}^{\infty} {I_{ν+k}(x)}/{I_ν(x)}$ for $ν,x\geq 0$ that appears as the complementary cumulative hazard function for a Skellam$(λ,λ)$ probability distribution in the statistical analysis of networks. Our technique relies on bounding existing estimates of ${I_{ν+1}(x)}/{I_ν(x)}$ from above and below by quantities with nicer algebraic properties, namely exponentials, to better evaluate the sum, while optimizing their rates in the regime when $ν+1\leq x$ in order to maintain their precision. We demonstrate the relevance of our results through applications, providing an improvement for the well-known asymptotic $\exp(-x)I_ν(x)\sim {1}/{\sqrt{2πx}}$ as $x\rightarrow \infty$, upper and lower bounding $\mathbb{P}\left[W=ν\right]$ for $W\sim Skellam(λ_1,λ_2)$, and deriving a novel concentration inequality on the $Skellam(λ,λ)$ probability distribution from above and below. △ Less

Submitted 6 November, 2013; originally announced November 2013.

Comments: 18 pages, 3 figures

arXiv:1305.4977 [pdf, ps, other]

doi 10.1214/14-AOAS800

Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks

Authors: Yaonan Zhang, Eric D. Kolaczyk, Bruce D. Spencer

Abstract: Networks are a popular tool for representing elements in a system and their interconnectedness. Many observed networks can be viewed as only samples of some true underlying network. Such is frequently the case, for example, in the monitoring and study of massive, online social networks. We study the problem of how to estimate the degree distribution - an object of fundamental interest - of a true… ▽ More Networks are a popular tool for representing elements in a system and their interconnectedness. Many observed networks can be viewed as only samples of some true underlying network. Such is frequently the case, for example, in the monitoring and study of massive, online social networks. We study the problem of how to estimate the degree distribution - an object of fundamental interest - of a true underlying network from its sampled network. In particular, we show that this problem can be formulated as an inverse problem. Playing a key role in this formulation is a matrix relating the expectation of our sampled degree distribution to the true underlying degree distribution. Under many network sampling designs, this matrix can be defined entirely in terms of the design and is found to be ill-conditioned. As a result, our inverse problem frequently is ill-posed. Accordingly, we offer a constrained, penalized weighted least-squares approach to solving this problem. A Monte Carlo variant of Stein's unbiased risk estimation (SURE) is used to select the penalization parameter. We explore the behavior of our resulting estimator of network degree distribution in simulation, using a variety of combinations of network models and sampling regimes. In addition, we demonstrate the ability of our method to accurately reconstruct the degree distributions of various sub-communities within online social networks corresponding to Friendster, Orkut and LiveJournal. Overall, our results show that the true degree distributions from both homogeneous and inhomogeneous networks can be recovered with substantially greater accuracy than reflected in the empirical degree distribution resulting from the original sampling. △ Less

Submitted 28 May, 2015; v1 submitted 21 May, 2013; originally announced May 2013.

Comments: Published at http://dx.doi.org/10.1214/14-AOAS800 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS800

Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 1, 166-199

arXiv:1204.2194 [pdf, ps, other]

Weighted Frechet Means as Convex Combinations in Metric Spaces: Properties and Generalized Median Inequalities

Authors: Cedric E. Ginestet, Andrew Simmons, Eric D. Kolaczyk

Abstract: In this short note, we study the properties of the weighted Frechet mean as a convex combination operator on an arbitrary metric space, (Y,d). We show that this binary operator is commutative, non-associative, idempotent, invariant to multiplication by a constant weight and possesses an identity element. We also treat the properties of the weighted cumulative Frechet mean. These tools allow us to… ▽ More In this short note, we study the properties of the weighted Frechet mean as a convex combination operator on an arbitrary metric space, (Y,d). We show that this binary operator is commutative, non-associative, idempotent, invariant to multiplication by a constant weight and possesses an identity element. We also treat the properties of the weighted cumulative Frechet mean. These tools allow us to derive several types of median inequalities for abstract metric spaces that hold for both negative and positive Alexandrov spaces. In particular, we show through an example that these bounds cannot be improved upon in general metric spaces. For weighted Frechet means, however, such inequalities can solely be derived for weights equal or greater than one. This latter limitation highlights the inherent difficulties associated with working with abstract-valued random variables. △ Less

Submitted 12 June, 2012; v1 submitted 10 April, 2012; originally announced April 2012.

Comments: 7 pages, 1 figure. Submitted to Probability and Statistics Letters

arXiv:1112.0840 [pdf, ps, other]

doi 10.1214/14-STS502

On the Question of Effective Sample Size in Network Modeling: An Asymptotic Inquiry

Authors: Pavel N. Krivitsky, Eric D. Kolaczyk

Abstract: The modeling and analysis of networks and network data has seen an explosion of interest in recent years and represents an exciting direction for potential growth in statistics. Despite the already substantial amount of work done in this area to date by researchers from various disciplines, however, there remain many questions of a decidedly foundational nature - natural analogues of standard ques… ▽ More The modeling and analysis of networks and network data has seen an explosion of interest in recent years and represents an exciting direction for potential growth in statistics. Despite the already substantial amount of work done in this area to date by researchers from various disciplines, however, there remain many questions of a decidedly foundational nature - natural analogues of standard questions already posed and addressed in more classical areas of statistics - that have yet to even be posed, much less addressed. Here we raise and consider one such question in connection with network modeling. Specifically, we ask, "Given an observed network, what is the sample size?" Using simple, illustrative examples from the class of exponential random graph models, we show that the answer to this question can very much depend on basic properties of the networks expected under the model, as the number of vertices $n_V$ in the network grows. In particular, adopting the (asymptotic) scaling of the variance of the maximum likelihood parameter estimates as a notion of effective sample size ($n_{\mathrm{eff}}$), we show that when modeling the overall propensity to have ties and the propensity to reciprocate ties, whether the networks are sparse or not under the model (i.e., having a constant or an increasing number of ties per vertex, respectively) is sufficient to yield an order of magnitude difference in $n_{\mathrm{eff}}$, from $O(n_V)$ to $O(n^2_V)$. In addition, we report simulation study results that suggest similar properties for models for triadic (friend-of-a-friend) effects. We then explore some practical implications of this result, using both simulation and data on food-sharing from Lamalera, Indonesia. △ Less

Submitted 5 August, 2015; v1 submitted 5 December, 2011; originally announced December 2011.

Comments: Published at http://dx.doi.org/10.1214/14-STS502 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS502

Journal ref: Statistical Science 2015, Vol. 30, No. 2, 184-198

arXiv:1109.4408 [pdf, other]

A Compressed PCA Subspace Method for Anomaly Detection in High-Dimensional Data

Authors: Qi Ding, Eric D. Kolaczyk

Abstract: Random projection is widely used as a method of dimension reduction. In recent years, its combination with standard techniques of regression and classification has been explored. Here we examine its use with principal component analysis (PCA) and subspace detection methods. Specifically, we show that, under appropriate conditions, with high probability the magnitude of the residuals of a PCA analy… ▽ More Random projection is widely used as a method of dimension reduction. In recent years, its combination with standard techniques of regression and classification has been explored. Here we examine its use with principal component analysis (PCA) and subspace detection methods. Specifically, we show that, under appropriate conditions, with high probability the magnitude of the residuals of a PCA analysis of randomly projected data behaves comparably to that of the residuals of a similar PCA analysis of the original data. Our results indicate the feasibility of applying subspace-based anomaly detection algorithms to randomly projected data, when the data are high-dimensional but have a covariance of an appropriately compressed nature. We illustrate in the context of computer network traffic anomaly detection. △ Less

Submitted 11 April, 2012; v1 submitted 20 September, 2011; originally announced September 2011.

arXiv:1109.3160 [pdf, ps, other]

Inference and Characterization of Multi-Attribute Networks with Application to Computational Biology

Authors: Natallia Katenka, Eric D. Kolaczyk

Abstract: Our work is motivated by and illustrated with application of association networks in computational biology, specifically in the context of gene/protein regulatory networks. Association networks represent systems of interacting elements, where a link between two different elements indicates a sufficient level of similarity between element attributes. While in reality relational ties between element… ▽ More Our work is motivated by and illustrated with application of association networks in computational biology, specifically in the context of gene/protein regulatory networks. Association networks represent systems of interacting elements, where a link between two different elements indicates a sufficient level of similarity between element attributes. While in reality relational ties between elements can be expected to be based on similarity across multiple attributes, the vast majority of work to date on association networks involves ties defined with respect to only a single attribute. We propose an approach for the inference of multi-attribute association networks from measurements on continuous attribute variables, using canonical correlation and a hypothesis-testing strategy. Within this context, we then study the impact of partial information on multi-attribute network inference and characterization, when only a subset of attributes is available. We consider in detail the case of two attributes, wherein we examine through a combination of analytical and numerical techniques the implications of the choice and number of node attributes on the ability to detect network links and, more generally, to estimate higher-level network summary statistics, such as node degree, clustering coefficients, and measures of centrality. Illustration and applications throughout the paper are developed using gene and protein expression measurements on human cancer cell lines from the NCI-60 database. △ Less

Submitted 27 April, 2012; v1 submitted 14 September, 2011; originally announced September 2011.

Comments: Updated bibliography references

arXiv:0903.2210 [pdf, ps, other]

doi 10.1103/PhysRevE.79.061916

Network inference - with confidence - from multivariate time series

Authors: Mark A. Kramer, Uri T. Eden, Sydney S. Cash, Eric D. Kolaczyk

Abstract: Networks - collections of interacting elements or nodes - abound in the natural and manmade worlds. For many networks, complex spatiotemporal dynamics stem from patterns of physical interactions unknown to us. To infer these interactions, it is common to include edges between those nodes whose time series exhibit sufficient functional connectivity, typically defined as a measure of coupling exce… ▽ More Networks - collections of interacting elements or nodes - abound in the natural and manmade worlds. For many networks, complex spatiotemporal dynamics stem from patterns of physical interactions unknown to us. To infer these interactions, it is common to include edges between those nodes whose time series exhibit sufficient functional connectivity, typically defined as a measure of coupling exceeding a pre-determined threshold. However, when uncertainty exists in the original network measurements, uncertainty in the inferred network is likely, and hence a statistical propagation-of-error is needed. In this manuscript, we describe a principled and systematic procedure for the inference of functional connectivity networks from multivariate time series data. Our procedure yields as output both the inferred network and a quantification of uncertainty of the most fundamental interest: uncertainty in the number of edges. To illustrate this approach, we apply our procedure to simulated data and electrocorticogram data recorded from a human subject during an epileptic seizure. We demonstrate that the procedure is accurate and robust in both the determination of edges and the reporting of uncertainty associated with that determination. △ Less

Submitted 12 March, 2009; originally announced March 2009.

Comments: 12 pages, 7 figures (low resolution), submitted

arXiv:0902.3714 [pdf, ps, other]

Target Detection via Network Filtering

Authors: Shu Yang, Eric D. Kolaczyk

Abstract: A method of `network filtering' has been proposed recently to detect the effects of certain external perturbations on the interacting members in a network. However, with large networks, the goal of detection seems a priori difficult to achieve, especially since the number of observations available often is much smaller than the number of variables describing the effects of the underlying network… ▽ More A method of `network filtering' has been proposed recently to detect the effects of certain external perturbations on the interacting members in a network. However, with large networks, the goal of detection seems a priori difficult to achieve, especially since the number of observations available often is much smaller than the number of variables describing the effects of the underlying network. Under the assumption that the network possesses a certain sparsity property, we provide a formal characterization of the accuracy with which the external effects can be detected, using a network filtering system that combines Lasso regression in a sparse simultaneous equation model with simple residual analysis. We explore the implications of the technical conditions underlying our characterization, in the context of various network topologies, and we illustrate our method using simulated data. △ Less

Submitted 27 January, 2010; v1 submitted 20 February, 2009; originally announced February 2009.

arXiv:0709.3420 [pdf, ps, other]

Co-Betweenness: A Pairwise Notion of Centrality

Authors: Eric D. Kolaczyk, David B. Chua, Marc Barthelemy

Abstract: Betweenness centrality is a metric that seeks to quantify a sense of the importance of a vertex in a network graph in terms of its "control" on the distribution of information along geodesic paths throughout that network. This quantity however does not capture how different vertices participate together in such control. In order to allow for the uncovering of finer details in this regard, we int… ▽ More Betweenness centrality is a metric that seeks to quantify a sense of the importance of a vertex in a network graph in terms of its "control" on the distribution of information along geodesic paths throughout that network. This quantity however does not capture how different vertices participate together in such control. In order to allow for the uncovering of finer details in this regard, we introduce here an extension of betweenness centrality to pairs of vertices, which we term co-betweenness, that provides the basis for quantifying various analogous pairwise notions of importance and control. More specifically, we motivate and define a precise notion of co-betweenness, we present an efficient algorithm for its computation, extending the algorithm of Brandes in a natural manner, and we illustrate the utilization of this co-betweenness on a handful of different communication networks. From these real-world examples, we show that the co-betweenness allows one to identify certain vertices which are not the most central vertices but which, nevertheless, act as important actors in the relaying and dispatching of information in the network. △ Less

Submitted 21 September, 2007; originally announced September 2007.

Comments: 9 pages, 9 figures

Journal ref: Social Networks 31 (2009), pp. 190-203

arXiv:math/0510013 [pdf, ps, other]

Network Kriging

Authors: David B. Chua, Eric D. Kolaczyk, Mark Crovella

Abstract: Network service providers and customers are often concerned with aggregate performance measures that span multiple network paths. Unfortunately, forming such network-wide measures can be difficult, due to the issues of scale involved. In particular, the number of paths grows too rapidly with the number of endpoints to make exhaustive measurement practical. As a result, it is of interest to explo… ▽ More Network service providers and customers are often concerned with aggregate performance measures that span multiple network paths. Unfortunately, forming such network-wide measures can be difficult, due to the issues of scale involved. In particular, the number of paths grows too rapidly with the number of endpoints to make exhaustive measurement practical. As a result, it is of interest to explore the feasibility of methods that dramatically reduce the number of paths measured in such situations while maintaining acceptable accuracy. We cast the problem as one of statistical prediction--in the spirit of the so-called `kriging' problem in spatial statistics--and show that end-to-end network properties may be accurately predicted in many cases using a surprisingly small set of carefully chosen paths. More precisely, we formulate a general framework for the prediction problem, propose a class of linear predictors for standard quantities of interest (e.g., averages, totals, differences) and show that linear algebraic methods of subset selection may be used to effectively choose which paths to measure. We characterize the performance of the resulting methods, both analytically and numerically. The success of our methods derives from the low effective rank of routing matrices as encountered in practice, which appears to be a new observation in its own right with potentially broad implications on network measurement generally. △ Less

Submitted 3 October, 2005; v1 submitted 1 October, 2005; originally announced October 2005.

Comments: 16 pages, 9 figures, single-spaced

arXiv:cs/0510007 [pdf, ps, other]

doi 10.1103/PhysRevE.75.056111

Network Inference from TraceRoute Measurements: Internet Topology `Species'

Authors: Fabien Viger, Alain Barrat, Luca Dall'Asta, Cun-Hui Zhang, Eric D. Kolaczyk

Abstract: Internet map** projects generally consist in sampling the network from a limited set of sources by using traceroute probes. This methodology, akin to the merging of spanning trees from the different sources to a set of destinations, leads necessarily to a partial, incomplete map of the Internet. Accordingly, determination of Internet topology characteristics from such sampled maps is in part a… ▽ More Internet map** projects generally consist in sampling the network from a limited set of sources by using traceroute probes. This methodology, akin to the merging of spanning trees from the different sources to a set of destinations, leads necessarily to a partial, incomplete map of the Internet. Accordingly, determination of Internet topology characteristics from such sampled maps is in part a problem of statistical inference. Our contribution begins with the observation that the inference of many of the most basic topological quantities -- including network size and degree characteristics -- from traceroute measurements is in fact a version of the so-called `species problem' in statistics. This observation has important implications, since species problems are often quite challenging. We focus here on the most fundamental example of a traceroute internet species: the number of nodes in a network. Specifically, we characterize the difficulty of estimating this quantity through a set of analytical arguments, we use statistical subsampling principles to derive two proposed estimators, and we illustrate the performance of these estimators on networks with various topological characteristics. △ Less

Submitted 3 October, 2005; originally announced October 2005.

Journal ref: Phys. Rev. E 75 (2007) 056111

arXiv:cs/0412037 [pdf, ps, other]

A Statistical Framework for Efficient Monitoring of End-to-End Network Properties

Authors: David B. Chua, Eric D. Kolaczyk, Mark Crovella

Abstract: Network service providers and customers are often concerned with aggregate performance measures that span multiple network paths. Unfortunately, forming such network-wide measures can be difficult, due to the issues of scale involved. In particular, the number of paths grows too rapidly with the number of endpoints to make exhaustive measurement practical. As a result, there is interest in the f… ▽ More Network service providers and customers are often concerned with aggregate performance measures that span multiple network paths. Unfortunately, forming such network-wide measures can be difficult, due to the issues of scale involved. In particular, the number of paths grows too rapidly with the number of endpoints to make exhaustive measurement practical. As a result, there is interest in the feasibility of methods that dramatically reduce the number of paths measured in such situations while maintaining acceptable accuracy. In previous work we proposed a statistical framework to efficiently address this problem, in the context of additive metrics such as delay and loss rate, for which the per-path metric is a sum of (possibly transformed) per-link measures. The key to our method lies in the observation and exploitation of significant redundancy in network paths (sharing of common links). In this paper we make three contributions: (1) we generalize the framework to make it more immediately applicable to network measurements encountered in practice; (2) we demonstrate that the observed path redundancy upon which our method is based is robust to variation in key network conditions and characteristics, including link failures; and (3) we show how the framework may be applied to address three practical problems of interest to network providers and customers, using data from an operating network. In particular, we show how appropriate selection of small sets of path measurements can be used to accurately estimate network-wide averages of path delays, to reliably detect network anomalies, and to effectively make a choice between alternative sub-networks, as a customer choosing between two providers or two ingress points into a provider network. △ Less

Submitted 8 December, 2004; v1 submitted 8 December, 2004; originally announced December 2004.

Comments: 20 pages, 18 figures

arXiv:math/0406424 [pdf, ps, other]

doi 10.1214/009053604000000076

Multiscale likelihood analysis and complexity penalized estimation

Authors: Eric D. Kolaczyk, Robert D. Nowak

Abstract: We describe here a framework for a certain class of multiscale likelihood factorizations wherein, in analogy to a wavelet decomposition of an L^2 function, a given likelihood function has an alternative representation as a product of conditional densities reflecting information in both the data and the parameter vector localized in position and scale. The framework is developed as a set of suffi… ▽ More We describe here a framework for a certain class of multiscale likelihood factorizations wherein, in analogy to a wavelet decomposition of an L^2 function, a given likelihood function has an alternative representation as a product of conditional densities reflecting information in both the data and the parameter vector localized in position and scale. The framework is developed as a set of sufficient conditions for the existence of such factorizations, formulated in analogy to those underlying a standard multiresolution analysis for wavelets, and hence can be viewed as a multiresolution analysis for likelihoods. We then consider the use of these factorizations in the task of nonparametric, complexity penalized likelihood estimation. We study the risk properties of certain thresholding and partitioning estimators, and demonstrate their adaptivity and near-optimality, in a minimax sense over a broad range of function spaces, based on squared Hellinger distance as a loss function. In particular, our results provide an illustration of how properties of classical wavelet-based estimators can be obtained in a single, unified framework that includes models for continuous, count and categorical data types. △ Less

Submitted 22 June, 2004; originally announced June 2004.

Report number: IMS-AOS-AOS140 MSC Class: 62C20; 62G05 (Primary) 60E05 (Secondary)

Journal ref: Annals of Statistics 2004, Vol. 32, No. 2, 500-527

arXiv:astro-ph/9803237 [pdf, ps, other]

doi 10.1016/S1384-1076(98)00024-4

Evidence for a Galactic gamma ray halo

Authors: D. D. Dixon, D. H. Hartmann, E. D. Kolaczyk, J. Samimi, R. Diehl, G. Kanbach, H. Mayer-Hasselwander, A. W. Strong

Abstract: We present quantitative statistical evidence for a $γ$-ray emission halo surrounding the Galaxy. Maps of the emission are derived. EGRET data were analyzed in a wavelet-based non-parametric hypothesis testing framework, using a model of expected diffuse (Galactic + isotropic) emission as a null hypothesis. The results show a statistically significant large scale halo surrounding the center of th… ▽ More We present quantitative statistical evidence for a $γ$-ray emission halo surrounding the Galaxy. Maps of the emission are derived. EGRET data were analyzed in a wavelet-based non-parametric hypothesis testing framework, using a model of expected diffuse (Galactic + isotropic) emission as a null hypothesis. The results show a statistically significant large scale halo surrounding the center of the Milky Way as seen from Earth. The halo flux at high latitudes is somewhat smaller than the isotropic gamma-ray flux at the same energy, though of the same order (O(10^(-7)--10^(-6)) ph/cm^2/s/sr above 1 GeV). △ Less

Submitted 19 August, 1998; v1 submitted 19 March, 1998; originally announced March 1998.

Comments: Final version accepted for publication in New Astronomy. Some additional results/discussion included, along with entirely revised figures. 19 pages, 15 figures, AASTeX. Better quality figs (PS and JPEG) are available at http://tigre.ucr.edu/halo/paper.html

Report number: UCRHEA 0398-01

Journal ref: New Astron. 3 (1998) 539

arXiv:astro-ph/9709029 [pdf, ps, other]

doi 10.1063/1.54172

Evidence for GeV emission from the Galactic Center Fountain

Authors: D. H. Hartmann, D. D. Dixon, E. D. Kolaczyk, J. Samimi

Abstract: The region near the Galactic center may have experienced recurrent episodes of injection of energy in excess of $\sim$ 10$^{55}$ ergs due to repeated starbursts involving more than $\sim$ 10$^4$ supernovae. This hypothesis can be tested by measurements of $γ$-ray lines produced by the decay of radioactive isotopes and positron annihilation, or by searches for pulsars produced during starbursts.… ▽ More The region near the Galactic center may have experienced recurrent episodes of injection of energy in excess of $\sim$ 10$^{55}$ ergs due to repeated starbursts involving more than $\sim$ 10$^4$ supernovae. This hypothesis can be tested by measurements of $γ$-ray lines produced by the decay of radioactive isotopes and positron annihilation, or by searches for pulsars produced during starbursts. Recent OSSE observations of 511 keV emission extending above the Galactic center led to the suggestion of a starburst driven fountain from the Galactic center. We present EGRET observations that might support this picture. △ Less

Submitted 3 September, 1997; originally announced September 1997.

Comments: 5 pages, 1 embedded Postscript figure. To appear in the Proceedings of the Fourth Compton Symposium

Showing 1–37 of 37 results for author: Kolaczyk, E D