-
Autoregressive Networks with Dependent Edges
Authors:
**yuan Chang,
Qin Fang,
Eric D. Kolaczyk,
Peter W. MacDonald,
Qiwei Yao
Abstract:
We propose an autoregressive framework for modelling dynamic networks with dependent edges. It encompasses the models which accommodate, for example, transitivity, density-dependent and other stylized features often observed in real network data. By assuming the edges of network at each time are independent conditionally on their lagged values, the models, which exhibit a close connection with tem…
▽ More
We propose an autoregressive framework for modelling dynamic networks with dependent edges. It encompasses the models which accommodate, for example, transitivity, density-dependent and other stylized features often observed in real network data. By assuming the edges of network at each time are independent conditionally on their lagged values, the models, which exhibit a close connection with temporal ERGMs, facilitate both simulation and the maximum likelihood estimation in the straightforward manner. Due to the possible large number of parameters in the models, the initial MLEs may suffer from slow convergence rates. An improved estimator for each component parameter is proposed based on an iteration based on the projection which mitigates the impact of the other parameters (Chang et al., 2021, 2023). Based on a martingale difference structure, the asymptotic distribution of the improved estimator is derived without the stationarity assumption. The limiting distribution is not normal in general, and it reduces to normal when the underlying process satisfies some mixing conditions. Illustration with a transitivity model was carried out in both simulation and a real network data set.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Edge differentially private estimation in the $β$-model via jittering and method of moments
Authors:
**yuan Chang,
Qiao Hu,
Eric D. Kolaczyk,
Qiwei Yao,
Fengting Yi
Abstract:
A standing challenge in data privacy is the trade-off between the level of privacy and the efficiency of statistical inference. Here we conduct an in-depth study of this trade-off for parameter estimation in the $β$-model (Chatterjee, Diaconis and Sly, 2011) for edge differentially private network data released via jittering (Karwa, Krivitsky and Slavković, 2017). Unlike most previous approaches b…
▽ More
A standing challenge in data privacy is the trade-off between the level of privacy and the efficiency of statistical inference. Here we conduct an in-depth study of this trade-off for parameter estimation in the $β$-model (Chatterjee, Diaconis and Sly, 2011) for edge differentially private network data released via jittering (Karwa, Krivitsky and Slavković, 2017). Unlike most previous approaches based on maximum likelihood estimation for this network model, we proceed via method-of-moments. This choice facilitates our exploration of a substantially broader range of privacy levels - corresponding to stricter privacy - than has been to date. Over this new range we discover our proposed estimator for the parameters exhibits an interesting phase transition, with both its convergence rate and asymptotic variance following one of three different regimes of behavior depending on the level of privacy. Because identification of the operable regime is difficult if not impossible in practice, we devise a novel adaptive bootstrap procedure to construct uniform inference across different phases. In fact, leveraging this bootstrap we are able to provide for simultaneous inference of all parameters in the $β$-model (i.e., equal to the number of nodes), which, to our best knowledge, is the first result of its kind. Numerical experiments confirm the competitive and reliable finite sample performance of the proposed inference methods, next to a comparable maximum likelihood method, as well as significant advantages in terms of computational speed and memory.
△ Less
Submitted 2 April, 2024; v1 submitted 19 December, 2021;
originally announced December 2021.
-
A spectral-based framework for hypothesis testing in populations of networks
Authors:
Li Chen,
Nathaniel Josephs,
Lizhen Lin,
Jie Zhou,
Eric D. Kolaczyk
Abstract:
In this paper, we propose a new spectral-based approach to hypothesis testing for populations of networks. The primary goal is to develop a test to determine whether two given samples of networks come from the same random model or distribution. Our test statistic is based on the trace of the third order for a centered and scaled adjacency matrix, which we prove converges to the standard normal dis…
▽ More
In this paper, we propose a new spectral-based approach to hypothesis testing for populations of networks. The primary goal is to develop a test to determine whether two given samples of networks come from the same random model or distribution. Our test statistic is based on the trace of the third order for a centered and scaled adjacency matrix, which we prove converges to the standard normal distribution as the number of nodes tends to infinity. The asymptotic power guarantee of the test is also provided. The proper interplay between the number of networks and the number of nodes for each network is explored in characterizing the theoretical properties of the proposed testing statistics. Our tests are applicable to both binary and weighted networks, operate under a very general framework where the networks are allowed to be large and sparse, and can be extended to multiple-sample testing. We provide an extensive simulation study to demonstrate the superior performance of our test over existing methods and apply our test to three real datasets.
△ Less
Submitted 24 November, 2020;
originally announced November 2020.
-
Averages of Unlabeled Networks: Geometric Characterization and Asymptotic Behavior
Authors:
Eric Kolaczyk,
Lizhen Lin,
Steven Rosenberg,
Jie Xu,
Jackson Walters
Abstract:
It is becoming increasingly common to see large collections of network data objects -- that is, data sets in which a network is viewed as a fundamental unit of observation. As a result, there is a pressing need to develop network-based analogues of even many of the most basic tools already standard for scalar and vector data. In this paper, our focus is on averages of unlabeled, undirected network…
▽ More
It is becoming increasingly common to see large collections of network data objects -- that is, data sets in which a network is viewed as a fundamental unit of observation. As a result, there is a pressing need to develop network-based analogues of even many of the most basic tools already standard for scalar and vector data. In this paper, our focus is on averages of unlabeled, undirected networks with edge weights. Specifically, we (i) characterize a certain notion of the space of all such networks, (ii) describe key topological and geometric properties of this space relevant to doing probability and statistics thereupon, and (iii) use these properties to establish the asymptotic behavior of a generalized notion of an empirical mean under sampling from a distribution supported on this space. Our results rely on a combination of tools from geometry, probability theory, and statistical shape analysis. In particular, the lack of vertex labeling necessitates working with a quotient space modding out permutations of labels. This results in a nontrivial geometry for the space of unlabeled networks, which in turn is found to have important implications on the types of probabilistic and statistical results that may be obtained and the techniques needed to obtain them.
△ Less
Submitted 7 February, 2019; v1 submitted 8 September, 2017;
originally announced September 2017.
-
Approximation of the difference of two Poisson-like counts with Skellam
Authors:
H. L. Gan,
Eric D. Kolaczyk
Abstract:
Poisson-like behavior for event count data is ubiquitous in nature. At the same time, differencing of such counts arises in the course of data processing in a variety of areas of application. As a result, the Skellam distribution -- defined as the distribution of the difference of two independent Poisson random variables -- is a natural candidate for approximating the difference of Poisson-like ev…
▽ More
Poisson-like behavior for event count data is ubiquitous in nature. At the same time, differencing of such counts arises in the course of data processing in a variety of areas of application. As a result, the Skellam distribution -- defined as the distribution of the difference of two independent Poisson random variables -- is a natural candidate for approximating the difference of Poisson-like event counts. However, in many contexts strict independence, whether between counts or among events within counts, is not a tenable assumption. Here we characterize the accuracy in approximating the difference of Poisson-like counts by a Skellam random variable. Our results fully generalize existing, more limited results in this direction and, at the same time, our derivations are significantly more concise and elegant. We illustrate the potential impact of these results in the context of problems from network analysis and image processing, where various forms of weak dependence can be expected.
△ Less
Submitted 2 April, 2018; v1 submitted 14 August, 2017;
originally announced August 2017.
-
On the Propagation of Low-Rate Measurement Error to Subgraph Counts in Large Networks
Authors:
Prakash Balachandran,
Eric D. Kolaczyk,
Weston Viles
Abstract:
Our work in this paper is inspired by a statistical observation that is both elementary and broadly relevant to network analysis in practice -- that the uncertainty in approximating some true network graph $G=(V,E)$ by some estimated graph $\hat{G}=(V,\hat{E})$ manifests as errors in the status of (non)edges that must necessarily propagate to any estimates of network summaries $η(G)$ we seek. Moti…
▽ More
Our work in this paper is inspired by a statistical observation that is both elementary and broadly relevant to network analysis in practice -- that the uncertainty in approximating some true network graph $G=(V,E)$ by some estimated graph $\hat{G}=(V,\hat{E})$ manifests as errors in the status of (non)edges that must necessarily propagate to any estimates of network summaries $η(G)$ we seek. Motivated by the common practice of using plug-in estimates $η(\hat{G})$ as proxies for $η(G)$, our focus is on the problem of characterizing the distribution of the discrepancy $D=η(\hat{G}) - η(G)$, in the case where $η(\cdot)$ is a subgraph count. Specifically, we study the fundamental case where the statistic of interest is $|E|$, the number of edges in $G$. Our primary contribution in this paper is to show that in the empirically relevant setting of large graphs with low-rate measurement errors, the distribution of $D_E=|\hat{E}| - |E|$ is well-characterized by a Skellam distribution, when the errors are independent or weakly dependent. Under an assumption of independent errors, we are able to further show conditions under which this characterization is strictly better than that of an appropriate normal distribution. These results derive from our formulation of a general result, quantifying the accuracy with which the difference of two sums of dependent Bernoulli random variables may be approximated by the difference of two independent Poisson random variables, i.e., by a Skellam distribution. This general result is developed through the use of Stein's method, and may be of some general interest. We finish with a discussion of possible extension of our work to subgraph counts $η(G)$ of higher order.
△ Less
Submitted 7 October, 2016; v1 submitted 19 September, 2014;
originally announced September 2014.
-
Exponential-type Inequalities Involving Ratios of the Modified Bessel Function of the First Kind and their Applications
Authors:
Prakash Balachandran,
Weston Viles,
Eric D. Kolaczyk
Abstract:
The modified Bessel function of the first kind, $I_ν(x)$, arises in numerous areas of study, such as physics, signal processing, probability, statistics, etc. As such, there has been much interest in recent years in deducing properties of functionals involving $I_ν(x)$, in particular, of the ratio ${I_{ν+1}(x)}/{I_ν(x)}$, when $ν,x\geq 0$. In this paper we establish sharp upper and lower bounds on…
▽ More
The modified Bessel function of the first kind, $I_ν(x)$, arises in numerous areas of study, such as physics, signal processing, probability, statistics, etc. As such, there has been much interest in recent years in deducing properties of functionals involving $I_ν(x)$, in particular, of the ratio ${I_{ν+1}(x)}/{I_ν(x)}$, when $ν,x\geq 0$. In this paper we establish sharp upper and lower bounds on $H(ν,x)=\sum_{k=1}^{\infty} {I_{ν+k}(x)}/{I_ν(x)}$ for $ν,x\geq 0$ that appears as the complementary cumulative hazard function for a Skellam$(λ,λ)$ probability distribution in the statistical analysis of networks. Our technique relies on bounding existing estimates of ${I_{ν+1}(x)}/{I_ν(x)}$ from above and below by quantities with nicer algebraic properties, namely exponentials, to better evaluate the sum, while optimizing their rates in the regime when $ν+1\leq x$ in order to maintain their precision. We demonstrate the relevance of our results through applications, providing an improvement for the well-known asymptotic $\exp(-x)I_ν(x)\sim {1}/{\sqrt{2πx}}$ as $x\rightarrow \infty$, upper and lower bounding $\mathbb{P}\left[W=ν\right]$ for $W\sim Skellam(λ_1,λ_2)$, and deriving a novel concentration inequality on the $Skellam(λ,λ)$ probability distribution from above and below.
△ Less
Submitted 6 November, 2013;
originally announced November 2013.
-
Inference of Network Summary Statistics Through Network Denoising
Authors:
Prakash Balachandran,
Edoardo Airoldi,
Eric Kolaczyk
Abstract:
Consider observing an undirected network that is `noisy' in the sense that there are Type I and Type II errors in the observation of edges. Such errors can arise, for example, in the context of inferring gene regulatory networks in genomics or functional connectivity networks in neuroscience. Given a single observed network then, to what extent are summary statistics for that network representativ…
▽ More
Consider observing an undirected network that is `noisy' in the sense that there are Type I and Type II errors in the observation of edges. Such errors can arise, for example, in the context of inferring gene regulatory networks in genomics or functional connectivity networks in neuroscience. Given a single observed network then, to what extent are summary statistics for that network representative of their analogues for the true underlying network? Can we infer such statistics more accurately by taking into account the noise in the observed network edges?
In this paper, we answer both of these questions. In particular, we develop a spectral-based methodology using the adjacency matrix to `denoise' the observed network data and produce more accurate inference of the summary statistics of the true network. We characterize performance of our methodology through bounds on appropriate notions of risk in the $L^2$ sense, and conclude by illustrating the practical impact of this work on synthetic and real-world data.
△ Less
Submitted 31 December, 2013; v1 submitted 1 October, 2013;
originally announced October 2013.
-
Weighted Frechet Means as Convex Combinations in Metric Spaces: Properties and Generalized Median Inequalities
Authors:
Cedric E. Ginestet,
Andrew Simmons,
Eric D. Kolaczyk
Abstract:
In this short note, we study the properties of the weighted Frechet mean as a convex combination operator on an arbitrary metric space, (Y,d). We show that this binary operator is commutative, non-associative, idempotent, invariant to multiplication by a constant weight and possesses an identity element. We also treat the properties of the weighted cumulative Frechet mean. These tools allow us to…
▽ More
In this short note, we study the properties of the weighted Frechet mean as a convex combination operator on an arbitrary metric space, (Y,d). We show that this binary operator is commutative, non-associative, idempotent, invariant to multiplication by a constant weight and possesses an identity element. We also treat the properties of the weighted cumulative Frechet mean. These tools allow us to derive several types of median inequalities for abstract metric spaces that hold for both negative and positive Alexandrov spaces. In particular, we show through an example that these bounds cannot be improved upon in general metric spaces. For weighted Frechet means, however, such inequalities can solely be derived for weights equal or greater than one. This latter limitation highlights the inherent difficulties associated with working with abstract-valued random variables.
△ Less
Submitted 12 June, 2012; v1 submitted 10 April, 2012;
originally announced April 2012.
-
On the Question of Effective Sample Size in Network Modeling: An Asymptotic Inquiry
Authors:
Pavel N. Krivitsky,
Eric D. Kolaczyk
Abstract:
The modeling and analysis of networks and network data has seen an explosion of interest in recent years and represents an exciting direction for potential growth in statistics. Despite the already substantial amount of work done in this area to date by researchers from various disciplines, however, there remain many questions of a decidedly foundational nature - natural analogues of standard ques…
▽ More
The modeling and analysis of networks and network data has seen an explosion of interest in recent years and represents an exciting direction for potential growth in statistics. Despite the already substantial amount of work done in this area to date by researchers from various disciplines, however, there remain many questions of a decidedly foundational nature - natural analogues of standard questions already posed and addressed in more classical areas of statistics - that have yet to even be posed, much less addressed. Here we raise and consider one such question in connection with network modeling. Specifically, we ask, "Given an observed network, what is the sample size?" Using simple, illustrative examples from the class of exponential random graph models, we show that the answer to this question can very much depend on basic properties of the networks expected under the model, as the number of vertices $n_V$ in the network grows. In particular, adopting the (asymptotic) scaling of the variance of the maximum likelihood parameter estimates as a notion of effective sample size ($n_{\mathrm{eff}}$), we show that when modeling the overall propensity to have ties and the propensity to reciprocate ties, whether the networks are sparse or not under the model (i.e., having a constant or an increasing number of ties per vertex, respectively) is sufficient to yield an order of magnitude difference in $n_{\mathrm{eff}}$, from $O(n_V)$ to $O(n^2_V)$. In addition, we report simulation study results that suggest similar properties for models for triadic (friend-of-a-friend) effects. We then explore some practical implications of this result, using both simulation and data on food-sharing from Lamalera, Indonesia.
△ Less
Submitted 5 August, 2015; v1 submitted 5 December, 2011;
originally announced December 2011.
-
Network Kriging
Authors:
David B. Chua,
Eric D. Kolaczyk,
Mark Crovella
Abstract:
Network service providers and customers are often concerned with aggregate performance measures that span multiple network paths. Unfortunately, forming such network-wide measures can be difficult, due to the issues of scale involved. In particular, the number of paths grows too rapidly with the number of endpoints to make exhaustive measurement practical. As a result, it is of interest to explo…
▽ More
Network service providers and customers are often concerned with aggregate performance measures that span multiple network paths. Unfortunately, forming such network-wide measures can be difficult, due to the issues of scale involved. In particular, the number of paths grows too rapidly with the number of endpoints to make exhaustive measurement practical. As a result, it is of interest to explore the feasibility of methods that dramatically reduce the number of paths measured in such situations while maintaining acceptable accuracy.
We cast the problem as one of statistical prediction--in the spirit of the so-called `kriging' problem in spatial statistics--and show that end-to-end network properties may be accurately predicted in many cases using a surprisingly small set of carefully chosen paths. More precisely, we formulate a general framework for the prediction problem, propose a class of linear predictors for standard quantities of interest (e.g., averages, totals, differences) and show that linear algebraic methods of subset selection may be used to effectively choose which paths to measure. We characterize the performance of the resulting methods, both analytically and numerically. The success of our methods derives from the low effective rank of routing matrices as encountered in practice, which appears to be a new observation in its own right with potentially broad implications on network measurement generally.
△ Less
Submitted 3 October, 2005; v1 submitted 1 October, 2005;
originally announced October 2005.
-
Network Inference from TraceRoute Measurements: Internet Topology `Species'
Authors:
Fabien Viger,
Alain Barrat,
Luca Dall'Asta,
Cun-Hui Zhang,
Eric D. Kolaczyk
Abstract:
Internet map** projects generally consist in sampling the network from a limited set of sources by using traceroute probes. This methodology, akin to the merging of spanning trees from the different sources to a set of destinations, leads necessarily to a partial, incomplete map of the Internet. Accordingly, determination of Internet topology characteristics from such sampled maps is in part a…
▽ More
Internet map** projects generally consist in sampling the network from a limited set of sources by using traceroute probes. This methodology, akin to the merging of spanning trees from the different sources to a set of destinations, leads necessarily to a partial, incomplete map of the Internet. Accordingly, determination of Internet topology characteristics from such sampled maps is in part a problem of statistical inference. Our contribution begins with the observation that the inference of many of the most basic topological quantities -- including network size and degree characteristics -- from traceroute measurements is in fact a version of the so-called `species problem' in statistics. This observation has important implications, since species problems are often quite challenging. We focus here on the most fundamental example of a traceroute internet species: the number of nodes in a network. Specifically, we characterize the difficulty of estimating this quantity through a set of analytical arguments, we use statistical subsampling principles to derive two proposed estimators, and we illustrate the performance of these estimators on networks with various topological characteristics.
△ Less
Submitted 3 October, 2005;
originally announced October 2005.
-
A Statistical Framework for Efficient Monitoring of End-to-End Network Properties
Authors:
David B. Chua,
Eric D. Kolaczyk,
Mark Crovella
Abstract:
Network service providers and customers are often concerned with aggregate performance measures that span multiple network paths. Unfortunately, forming such network-wide measures can be difficult, due to the issues of scale involved. In particular, the number of paths grows too rapidly with the number of endpoints to make exhaustive measurement practical. As a result, there is interest in the f…
▽ More
Network service providers and customers are often concerned with aggregate performance measures that span multiple network paths. Unfortunately, forming such network-wide measures can be difficult, due to the issues of scale involved. In particular, the number of paths grows too rapidly with the number of endpoints to make exhaustive measurement practical. As a result, there is interest in the feasibility of methods that dramatically reduce the number of paths measured in such situations while maintaining acceptable accuracy.
In previous work we proposed a statistical framework to efficiently address this problem, in the context of additive metrics such as delay and loss rate, for which the per-path metric is a sum of (possibly transformed) per-link measures. The key to our method lies in the observation and exploitation of significant redundancy in network paths (sharing of common links).
In this paper we make three contributions: (1) we generalize the framework to make it more immediately applicable to network measurements encountered in practice; (2) we demonstrate that the observed path redundancy upon which our method is based is robust to variation in key network conditions and characteristics, including link failures; and (3) we show how the framework may be applied to address three practical problems of interest to network providers and customers, using data from an operating network. In particular, we show how appropriate selection of small sets of path measurements can be used to accurately estimate network-wide averages of path delays, to reliably detect network anomalies, and to effectively make a choice between alternative sub-networks, as a customer choosing between two providers or two ingress points into a provider network.
△ Less
Submitted 8 December, 2004; v1 submitted 8 December, 2004;
originally announced December 2004.
-
Multiscale likelihood analysis and complexity penalized estimation
Authors:
Eric D. Kolaczyk,
Robert D. Nowak
Abstract:
We describe here a framework for a certain class of multiscale likelihood factorizations wherein, in analogy to a wavelet decomposition of an L^2 function, a given likelihood function has an alternative representation as a product of conditional densities reflecting information in both the data and the parameter vector localized in position and scale. The framework is developed as a set of suffi…
▽ More
We describe here a framework for a certain class of multiscale likelihood factorizations wherein, in analogy to a wavelet decomposition of an L^2 function, a given likelihood function has an alternative representation as a product of conditional densities reflecting information in both the data and the parameter vector localized in position and scale. The framework is developed as a set of sufficient conditions for the existence of such factorizations, formulated in analogy to those underlying a standard multiresolution analysis for wavelets, and hence can be viewed as a multiresolution analysis for likelihoods. We then consider the use of these factorizations in the task of nonparametric, complexity penalized likelihood estimation. We study the risk properties of certain thresholding and partitioning estimators, and demonstrate their adaptivity and near-optimality, in a minimax sense over a broad range of function spaces, based on squared Hellinger distance as a loss function. In particular, our results provide an illustration of how properties of classical wavelet-based estimators can be obtained in a single, unified framework that includes models for continuous, count and categorical data types.
△ Less
Submitted 22 June, 2004;
originally announced June 2004.