-
Statistical Inference in a Directed Network Model with Covariates
Authors:
Ting Yan,
Binyan Jiang,
Stephen E. Fienberg,
Chenlei Leng
Abstract:
Networks are often characterized by node heterogeneity for which nodes exhibit different degrees of interaction and link homophily for which nodes sharing common features tend to associate with each other. In this paper, we propose a new directed network model to capture the former via node-specific parametrization and the latter by incorporating covariates. In particular, this model quantifies th…
▽ More
Networks are often characterized by node heterogeneity for which nodes exhibit different degrees of interaction and link homophily for which nodes sharing common features tend to associate with each other. In this paper, we propose a new directed network model to capture the former via node-specific parametrization and the latter by incorporating covariates. In particular, this model quantifies the extent of heterogeneity in terms of outgoingness and incomingness of each node by different parameters, thus allowing the number of heterogeneity parameters to be twice the number of nodes. We study the maximum likelihood estimation of the model and establish the uniform consistency and asymptotic normality of the resulting estimators. Numerical studies demonstrate our theoretical findings and a data analysis confirms the usefulness of our model.
△ Less
Submitted 10 March, 2018; v1 submitted 15 September, 2016;
originally announced September 2016.
-
Dynamic Question Ordering in Online Surveys
Authors:
Kirstin Early,
Jennifer Mankoff,
Stephen E. Fienberg
Abstract:
Online surveys have the potential to support adaptive questions, where later questions depend on earlier responses. Past work has taken a rule-based approach, uniformly across all respondents. We envision a richer interpretation of adaptive questions, which we call dynamic question ordering (DQO), where question order is personalized. Such an approach could increase engagement, and therefore respo…
▽ More
Online surveys have the potential to support adaptive questions, where later questions depend on earlier responses. Past work has taken a rule-based approach, uniformly across all respondents. We envision a richer interpretation of adaptive questions, which we call dynamic question ordering (DQO), where question order is personalized. Such an approach could increase engagement, and therefore response rate, as well as imputation quality. We present a DQO framework to improve survey completion and imputation. In the general survey-taking setting, we want to maximize survey completion, and so we focus on ordering questions to engage the respondent and collect hopefully all information, or at least the information that most characterizes the respondent, for accurate imputations. In another scenario, our goal is to provide a personalized prediction. Since it is possible to give reasonable predictions with only a subset of questions, we are not concerned with motivating users to answer all questions. Instead, we want to order questions to get information that reduces prediction uncertainty, while not being too burdensome. We illustrate this framework with an example of providing energy estimates to prospective tenants. We also discuss DQO for national surveys and consider connections between our statistics-based question-ordering approach and cognitive survey methodology.
△ Less
Submitted 14 July, 2016;
originally announced July 2016.
-
On-Average KL-Privacy and its equivalence to Generalization for Max-Entropy Mechanisms
Authors:
Yu-Xiang Wang,
**g Lei,
Stephen E. Fienberg
Abstract:
We define On-Average KL-Privacy and present its properties and connections to differential privacy, generalization and information-theoretic quantities including max-information and mutual information. The new definition significantly weakens differential privacy, while preserving its minimalistic design features such as composition over small group and multiple queries as well as closeness to pos…
▽ More
We define On-Average KL-Privacy and present its properties and connections to differential privacy, generalization and information-theoretic quantities including max-information and mutual information. The new definition significantly weakens differential privacy, while preserving its minimalistic design features such as composition over small group and multiple queries as well as closeness to post-processing. Moreover, we show that On-Average KL-Privacy is **equivalent** to generalization for a large class of commonly-used tools in statistics and machine learning that samples from Gibbs distributions---a class of distributions that arises naturally from the maximum entropy principle. In addition, a byproduct of our analysis yields a lower bound for generalization error in terms of mutual information which reveals an interesting interplay with known upper bounds that use the same quantity.
△ Less
Submitted 8 May, 2016;
originally announced May 2016.
-
A Minimax Theory for Adaptive Data Analysis
Authors:
Yu-Xiang Wang,
**g Lei,
Stephen E. Fienberg
Abstract:
In adaptive data analysis, the user makes a sequence of queries on the data, where at each step the choice of query may depend on the results in previous steps. The releases are often randomized in order to reduce overfitting for such adaptively chosen queries. In this paper, we propose a minimax framework for adaptive data analysis. Assuming Gaussianity of queries, we establish the first sharp mi…
▽ More
In adaptive data analysis, the user makes a sequence of queries on the data, where at each step the choice of query may depend on the results in previous steps. The releases are often randomized in order to reduce overfitting for such adaptively chosen queries. In this paper, we propose a minimax framework for adaptive data analysis. Assuming Gaussianity of queries, we establish the first sharp minimax lower bound on the squared error in the order of $O(\frac{\sqrt{k}σ^2}{n})$, where $k$ is the number of queries asked, and $σ^2/n$ is the ordinary signal-to-noise ratio for a single query. Our lower bound is based on the construction of an approximately least favorable adversary who picks a sequence of queries that are most likely to be affected by overfitting. This approximately least favorable adversary uses only one level of adaptivity, suggesting that the minimax risk for 1-step adaptivity with k-1 initial releases and that for $k$-step adaptivity are on the same order. The key technical component of the lower bound proof is a reduction to finding the convoluting distribution that optimally obfuscates the sign of a Gaussian signal. Our lower bound construction also reveals a transparent and elementary proof of the matching upper bound as an alternative approach to Russo and Zou (2015), who used information-theoretic tools to provide the same upper bound. We believe that the proposed framework opens up opportunities to obtain theoretical insights for many other settings of adaptive data analysis, which would extend the idea to more practical realms.
△ Less
Submitted 12 February, 2016;
originally announced February 2016.
-
Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo
Authors:
Yu-Xiang Wang,
Stephen E. Fienberg,
Alex Smola
Abstract:
We consider the problem of Bayesian learning on sensitive datasets and present two simple but somewhat surprising results that connect Bayesian learning to "differential privacy:, a cryptographic approach to protect individual-level privacy while permiting database-level utility. Specifically, we show that that under standard assumptions, getting one single sample from a posterior distribution is…
▽ More
We consider the problem of Bayesian learning on sensitive datasets and present two simple but somewhat surprising results that connect Bayesian learning to "differential privacy:, a cryptographic approach to protect individual-level privacy while permiting database-level utility. Specifically, we show that that under standard assumptions, getting one single sample from a posterior distribution is differentially private "for free". We will see that estimator is statistically consistent, near optimal and computationally tractable whenever the Bayesian model of interest is consistent, optimal and tractable. Similarly but separately, we show that a recent line of works that use stochastic gradient for Hybrid Monte Carlo (HMC) sampling also preserve differentially privacy with minor or no modifications of the algorithmic procedure at all, these observations lead to an "anytime" algorithm for Bayesian learning under privacy constraint. We demonstrate that it performs much better than the state-of-the-art differential private methods on synthetic and real datasets.
△ Less
Submitted 11 April, 2015; v1 submitted 26 February, 2015;
originally announced February 2015.
-
Learning with Differential Privacy: Stability, Learnability and the Sufficiency and Necessity of ERM Principle
Authors:
Yu-Xiang Wang,
**g Lei,
Stephen E. Fienberg
Abstract:
While machine learning has proven to be a powerful data-driven solution to many real-life problems, its use in sensitive domains has been limited due to privacy concerns. A popular approach known as **differential privacy** offers provable privacy guarantees, but it is often observed in practice that it could substantially hamper learning accuracy. In this paper we study the learnability (whether…
▽ More
While machine learning has proven to be a powerful data-driven solution to many real-life problems, its use in sensitive domains has been limited due to privacy concerns. A popular approach known as **differential privacy** offers provable privacy guarantees, but it is often observed in practice that it could substantially hamper learning accuracy. In this paper we study the learnability (whether a problem can be learned by any algorithm) under Vapnik's general learning setting with differential privacy constraint, and reveal some intricate relationships between privacy, stability and learnability.
In particular, we show that a problem is privately learnable **if an only if** there is a private algorithm that asymptotically minimizes the empirical risk (AERM). In contrast, for non-private learning AERM alone is not sufficient for learnability. This result suggests that when searching for private learning algorithms, we can restrict the search to algorithms that are AERM. In light of this, we propose a conceptual procedure that always finds a universally consistent algorithm whenever the problem is learnable under privacy constraint. We also propose a generic and practical algorithm and show that under very general conditions it privately learns a wide class of learning problems. Lastly, we extend some of the results to the more practical $(ε,δ)$-differential privacy and establish the existence of a phase-transition on the class of problems that are approximately privately learnable with respect to how small $δ$ needs to be.
△ Less
Submitted 27 April, 2016; v1 submitted 22 February, 2015;
originally announced February 2015.
-
Differentially-Private Logistic Regression for Detecting Multiple-SNP Association in GWAS Databases
Authors:
Fei Yu,
Michal Rybar,
Caroline Uhler,
Stephen E. Fienberg
Abstract:
Following the publication of an attack on genome-wide association studies (GWAS) data proposed by Homer et al., considerable attention has been given to develo** methods for releasing GWAS data in a privacy-preserving way. Here, we develop an end-to-end differentially private method for solving regression problems with convex penalty functions and selecting the penalty parameters by cross-valida…
▽ More
Following the publication of an attack on genome-wide association studies (GWAS) data proposed by Homer et al., considerable attention has been given to develo** methods for releasing GWAS data in a privacy-preserving way. Here, we develop an end-to-end differentially private method for solving regression problems with convex penalty functions and selecting the penalty parameters by cross-validation. In particular, we focus on penalized logistic regression with elastic-net regularization, a method widely used to in GWAS analyses to identify disease-causing genes. We show how a differentially private procedure for penalized logistic regression with elastic-net regularization can be applied to the analysis of GWAS data and evaluate our method's performance.
△ Less
Submitted 30 July, 2014;
originally announced July 2014.
-
A Comparison of Blocking Methods for Record Linkage
Authors:
Rebecca C. Steorts,
Samuel L. Ventura,
Mauricio Sadinle,
Stephen E. Fienberg
Abstract:
Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sens…
▽ More
Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as "private blocking." We compare these approaches in terms of their recall, reduction ratio, and computational complexity. We evaluate these methods using different synthetic datafiles and conclude with a discussion of privacy-related issues.
△ Less
Submitted 11 July, 2014;
originally announced July 2014.
-
Discussion of "Estimating the Distribution of Dietary Consumption Patterns"
Authors:
Stephen E. Fienberg,
Rebecca C. Steorts
Abstract:
Discussion of "Estimating the Distribution of Dietary Consumption Patterns" by Raymond J. Carroll [arXiv:1405.4667].
Discussion of "Estimating the Distribution of Dietary Consumption Patterns" by Raymond J. Carroll [arXiv:1405.4667].
△ Less
Submitted 20 May, 2014; v1 submitted 3 March, 2014;
originally announced March 2014.
-
SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication
Authors:
Rebecca C. Steorts,
Rob Hall,
Stephen E. Fienberg
Abstract:
We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of…
▽ More
We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate $k$-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data.
△ Less
Submitted 2 March, 2014;
originally announced March 2014.
-
Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies
Authors:
Fei Yu,
Stephen E. Fienberg,
Aleksandra Slavković,
Caroline Uhler
Abstract:
The protection of privacy of individual-level information in genome-wide association study (GWAS) databases has been a major concern of researchers following the publication of "an attack" on GWAS data by Homer et al. (2008) Traditional statistical methods for confidentiality and privacy protection of statistical databases do not scale well to deal with GWAS data, especially in terms of guarantees…
▽ More
The protection of privacy of individual-level information in genome-wide association study (GWAS) databases has been a major concern of researchers following the publication of "an attack" on GWAS data by Homer et al. (2008) Traditional statistical methods for confidentiality and privacy protection of statistical databases do not scale well to deal with GWAS data, especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach that provides a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information, although the guarantees may come at a serious price in terms of data utility. Building on such notions, Uhler et al. (2013) proposed new methods to release aggregate GWAS data without compromising an individual's privacy. We extend the methods developed in Uhler et al. (2013) for releasing differentially-private $χ^2$-statistics by allowing for arbitrary number of cases and controls, and for releasing differentially-private allelic test statistics. We also provide a new interpretation by assuming the controls' data are known, which is a realistic assumption because some GWAS use publicly available data as controls. We assess the performance of the proposed methods through a risk-utility analysis on a real data set consisting of DNA samples collected by the Wellcome Trust Case Control Consortium and compare the methods with the differentially-private release mechanism proposed by Johnson and Shmatikov (2013).
△ Less
Submitted 21 January, 2014;
originally announced January 2014.
-
A Bayesian Approach to Graphical Record Linkage and De-duplication
Authors:
Rebecca C. Steorts,
Rob Hall,
Stephen E. Fienberg
Abstract:
We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of…
▽ More
We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture-recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature.
△ Less
Submitted 30 October, 2015; v1 submitted 17 December, 2013;
originally announced December 2013.
-
From Statistical Evidence to Evidence of Causality
Authors:
Philip Dawid,
Monica Musio,
Stephen E. Fienberg
Abstract:
While statisticians and quantitative social scientists typically study the "effects of causes" (EoC), Lawyers and the Courts are more concerned with understanding the "causes of effects" (CoE). EoC can be addressed using experimental design and statistical analysis, but it is less clear how to incorporate statistical or epidemiological evidence into CoE reasoning, as might be required for a case a…
▽ More
While statisticians and quantitative social scientists typically study the "effects of causes" (EoC), Lawyers and the Courts are more concerned with understanding the "causes of effects" (CoE). EoC can be addressed using experimental design and statistical analysis, but it is less clear how to incorporate statistical or epidemiological evidence into CoE reasoning, as might be required for a case at Law. Some form of counterfactual reasoning, such as the "potential outcomes" approach championed by Rubin, appears unavoidable, but this typically yields "answers" that are sensitive to arbitrary and untestable assumptions. We must therefore recognise that a CoE question simply might not have a well-determined answer. It is nevertheless possible to use statistical data to set bounds within which any answer must lie. With less than perfect data these bounds will themselves be uncertain, leading to a compounding of different kinds of uncertainty. Still further care is required in the presence of possible confounding factors. In addition, even identifying the relevant "counterfactual contrast" may be a matter of Policy as much as of Science. Defining the question is as non-trivial a task as finding a route towards an answer. This paper develops some technical elaborations of these philosophical points, and illustrates them with an analysis of a case study in child protection.
Keywords: benfluorex, causes of effects, counterfactual, child protection, effects of causes, Fre'chet bound, potential outcome, probability of causation
△ Less
Submitted 25 October, 2014; v1 submitted 29 November, 2013;
originally announced November 2013.
-
A Generalized Fellegi-Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems
Authors:
Mauricio Sadinle,
Stephen E. Fienberg
Abstract:
We present a probabilistic method for linking multiple datafiles. This task is not trivial in the absence of unique identifiers for the individuals recorded. This is a common scenario when linking census data to coverage measurement surveys for census coverage evaluation, and in general when multiple record-systems need to be integrated for posterior analysis. Our method generalizes the Fellegi-Su…
▽ More
We present a probabilistic method for linking multiple datafiles. This task is not trivial in the absence of unique identifiers for the individuals recorded. This is a common scenario when linking census data to coverage measurement surveys for census coverage evaluation, and in general when multiple record-systems need to be integrated for posterior analysis. Our method generalizes the Fellegi-Sunter theory for linking records from two datafiles and its modern implementations. The multiple record linkage goal is to classify the record K-tuples coming from K datafiles according to the different matching patterns. Our method incorporates the transitivity of agreement in the computation of the data used to model matching probabilities. We use a mixture model to fit matching probabilities via maximum likelihood using the EM algorithm. We present a method to decide the record K-tuples membership to the subsets of matching patterns and we prove its optimality. We apply our method to the integration of three Colombian homicide record systems and we perform a simulation study in order to explore the performance of the method under measurement error and different scenarios. The proposed method works well and opens some directions for future research.
△ Less
Submitted 6 February, 2013; v1 submitted 14 May, 2012;
originally announced May 2012.
-
Privacy-Preserving Data Sharing for Genome-Wide Association Studies
Authors:
Caroline Uhler,
Aleksandra B. Slavkovic,
Stephen E. Fienberg
Abstract:
Traditional statistical methods for confidentiality protection of statistical databases do not scale well to deal with GWAS (genome-wide association studies) databases especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach which provides a rigorous defini…
▽ More
Traditional statistical methods for confidentiality protection of statistical databases do not scale well to deal with GWAS (genome-wide association studies) databases especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach which provides a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information, although the guarantees come at a serious price in terms of data utility. Building on such notions, we propose new methods to release aggregate GWAS data without compromising an individual's privacy. We present methods for releasing differentially private minor allele frequencies, chi-square statistics and p-values. We compare these approaches on simulated data and on a GWAS study of canine hair length involving 685 dogs. We also propose a privacy-preserving method for finding genome-wide associations based on a differentially-private approach to penalized logistic regression.
△ Less
Submitted 3 May, 2012;
originally announced May 2012.
-
Rejoinder
Authors:
Stephen E. Fienberg
Abstract:
Rejoinder of "Bayesian Models and Methods in Public Policy and Government Settings" by S. E. Fienberg [arXiv:1108.2177]
Rejoinder of "Bayesian Models and Methods in Public Policy and Government Settings" by S. E. Fienberg [arXiv:1108.2177]
△ Less
Submitted 19 August, 2011;
originally announced August 2011.
-
Bayesian Models and Methods in Public Policy and Government Settings
Authors:
Stephen E. Fienberg
Abstract:
Starting with the neo-Bayesian revival of the 1950s, many statisticians argued that it was inappropriate to use Bayesian methods, and in particular subjective Bayesian methods in governmental and public policy settings because of their reliance upon prior distributions. But the Bayesian framework often provides the primary way to respond to questions raised in these settings and the numbers and di…
▽ More
Starting with the neo-Bayesian revival of the 1950s, many statisticians argued that it was inappropriate to use Bayesian methods, and in particular subjective Bayesian methods in governmental and public policy settings because of their reliance upon prior distributions. But the Bayesian framework often provides the primary way to respond to questions raised in these settings and the numbers and diversity of Bayesian applications have grown dramatically in recent years. Through a series of examples, both historical and recent, we argue that Bayesian approaches with formal and informal assessments of priors AND likelihood functions are well accepted and should become the norm in public settings. Our examples include census-taking and small area estimation, US election night forecasting, studies reported to the US Food and Drug Administration, assessing global climate change, and measuring potential declines in disability among the elderly.
△ Less
Submitted 10 August, 2011;
originally announced August 2011.
-
Discussion of "Network routing in a dynamic environment"
Authors:
Andrew C. Thomas,
Stephen E. Fienberg
Abstract:
Discussion of "Network routing in a dynamic environment" by N.D. Singpurwalla [arXiv:1107.4852]
Discussion of "Network routing in a dynamic environment" by N.D. Singpurwalla [arXiv:1107.4852]
△ Less
Submitted 26 July, 2011;
originally announced July 2011.
-
Maximum lilkelihood estimation in the $β$-model
Authors:
Alessandro Rinaldo,
Sonja Petrović,
Stephen E. Fienberg
Abstract:
We study maximum likelihood estimation for the statistical model for undirected random graphs, known as the $β$-model, in which the degree sequences are minimal sufficient statistics. We derive necessary and sufficient conditions, based on the polytope of degree sequences, for the existence of the maximum likelihood estimator (MLE) of the model parameters. We characterize in a combinatorial fashio…
▽ More
We study maximum likelihood estimation for the statistical model for undirected random graphs, known as the $β$-model, in which the degree sequences are minimal sufficient statistics. We derive necessary and sufficient conditions, based on the polytope of degree sequences, for the existence of the maximum likelihood estimator (MLE) of the model parameters. We characterize in a combinatorial fashion sample points leading to a nonexistent MLE, and nonestimability of the probability parameters under a nonexistent MLE. We formulate conditions that guarantee that the MLE exists with probability tending to one as the number of nodes increases.
△ Less
Submitted 18 June, 2013; v1 submitted 30 May, 2011;
originally announced May 2011.
-
Exploring the Consequences of IED Deployment with a Generalized Linear Model Implementation of the Canadian Traveller Problem
Authors:
Andrew C. Thomas,
Stephen E. Fienberg
Abstract:
The deployment of improvised explosive devices (IEDs) along major roadways has been a favoured strategy of insurgents in recent war zones, both for the ability to cause damage to targets along roadways at minimal cost, but also as a means of controlling the flow of traffic and causing additional expense to opposing forces. Among other related approaches (which we discuss), the adversarial problem…
▽ More
The deployment of improvised explosive devices (IEDs) along major roadways has been a favoured strategy of insurgents in recent war zones, both for the ability to cause damage to targets along roadways at minimal cost, but also as a means of controlling the flow of traffic and causing additional expense to opposing forces. Among other related approaches (which we discuss), the adversarial problem has an analogue in the Canadian Traveller Problem, wherein a stretch of road is blocked with some independent probability, and the state of the road is only discovered once the traveller reaches one of the intersections that bound this stretch of road. We discuss the implementation of ideas from social network analysis, namely the notion of "betweenness centrality", and how this can be adapted to the notion of deployment of IEDs with the aid of Generalized Linear Models (GLMs): namely, how we can model the probability of an IED deployment in terms of the increased effort due to Canadian betweenness, how we can include expert judgement on the probability of a deployment, and how we can extend the approach to estimation and updating over several time steps.
△ Less
Submitted 19 December, 2010;
originally announced December 2010.
-
Introduction to papers on the modeling and analysis of network data---II
Authors:
Stephen E. Fienberg
Abstract:
Introduction to papers on the modeling and analysis of network data---II
Introduction to papers on the modeling and analysis of network data---II
△ Less
Submitted 8 November, 2010;
originally announced November 2010.
-
Introduction to papers on the modeling and analysis of network data
Authors:
Stephen E. Fienberg
Abstract:
Introduction to papers on the modeling and analysis of network data
Introduction to papers on the modeling and analysis of network data
△ Less
Submitted 19 October, 2010;
originally announced October 2010.
-
User Interest and Interaction Structure in Online Forums
Authors:
Di Liu,
Daniel Percival,
Stephen E. Fienberg
Abstract:
We present a new similarity measure tailored to posts in an online forum. Our measure takes into account all the available information about user interest and interaction --- the content of posts, the threads in the forum, and the author of the posts. We use this post similarity to build a similarity between users, based on principal coordinate analysis. This allows easy visualization of the user…
▽ More
We present a new similarity measure tailored to posts in an online forum. Our measure takes into account all the available information about user interest and interaction --- the content of posts, the threads in the forum, and the author of the posts. We use this post similarity to build a similarity between users, based on principal coordinate analysis. This allows easy visualization of the user activity as well. Similarity between users has numerous applications, such as clustering or classification. We show that including the author of a post in the post similarity has a smoothing effect on principal coordinate projections. We demonstrate our method on real data drawn from an internal corporate forum, and compare our results to those given by a standard document classification method. We conclude our method gives a more detailed picture of both the local and global network structure.
△ Less
Submitted 8 September, 2010;
originally announced September 2010.
-
A survey of statistical network models
Authors:
Anna Goldenberg,
Alice X Zheng,
Stephen E Fienberg,
Edoardo M Airoldi
Abstract:
Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociolog…
▽ More
Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active network community and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online networking communities such as Facebook, MySpace, and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. Our goal in this review is to provide the reader with an entry point to this burgeoning literature. We begin with an overview of the historical development of statistical network modeling and then we introduce a number of examples that have been studied in the network literature. Our subsequent discussion focuses on a number of prominent static and dynamic network models and their interconnections. We emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. We end with a description of some open problems and challenges for machine learning and statistics.
△ Less
Submitted 29 December, 2009;
originally announced December 2009.
-
On the Geometry of Discrete Exponential Families with Application to Exponential Random Graph Models
Authors:
Stephen E. Fienberg,
Alessandro Rinaldo,
Yi Zhou
Abstract:
There has been an explosion of interest in statistical models for analyzing network data, and considerable interest in the class of exponential random graph (ERG) models, especially in connection with difficulties in computing maximum likelihood estimates. The issues associated with these difficulties relate to the broader structure of discrete exponential families. This paper re-examines the is…
▽ More
There has been an explosion of interest in statistical models for analyzing network data, and considerable interest in the class of exponential random graph (ERG) models, especially in connection with difficulties in computing maximum likelihood estimates. The issues associated with these difficulties relate to the broader structure of discrete exponential families. This paper re-examines the issues in two parts. First we consider the closure of $k$-dimensional exponential families of distribution with discrete base measure and polyhedral convex support $\mathrm{P}$. We show that the normal fan of $\mathrm{P}$ is a geometric object that plays a fundamental role in deriving the statistical and geometric properties of the corresponding extended exponential families. We discuss its relevance to maximum likelihood estimation, both from a theoretical and computational standpoint. Second, we apply our results to the analysis of ERG models. In particular, by means of a detailed example, we provide some characterization of the properties of ERG models, and, in particular, of certain behaviors of ERG models known as degeneracy.
△ Less
Submitted 30 December, 2008;
originally announced January 2009.
-
Sequential category aggregation and partitioning approaches for multi-way contingency tables based on survey and census data
Authors:
L. Fraser Jackson,
Alistair G. Gray,
Stephen E. Fienberg
Abstract:
Large contingency tables arise in many contexts but especially in the collection of survey and census data by government statistical agencies. Because the vast majority of the variables in this context have a large number of categories, agencies and users need a systematic way of constructing tables which are summaries of such contingency tables. We propose such an approach in this paper by find…
▽ More
Large contingency tables arise in many contexts but especially in the collection of survey and census data by government statistical agencies. Because the vast majority of the variables in this context have a large number of categories, agencies and users need a systematic way of constructing tables which are summaries of such contingency tables. We propose such an approach in this paper by finding members of a class of restricted log-linear models which maximize the likelihood of the data and use this to find a parsimonious means of representing the table. In contrast with more standard approaches for model search in hierarchical log-linear models (HLLM), our procedure systematically reduces the number of categories of the variables. Through a series of examples, we illustrate the extent to which it can preserve the interaction structure found with HLLMs and be used as a data simplification procedure prior to HLL modeling. A feature of the procedure is that it can easily be applied to many tables with millions of cells, providing a new way of summarizing large data sets in many disciplines. The focus is on information and description rather than statistical testing. The procedure may treat each variable in the table in different ways, preserving full detail, treating it as fully nominal, or preserving ordinality.
△ Less
Submitted 11 November, 2008;
originally announced November 2008.
-
The Early Statistical Years: 1947--1967 A Conversation with Howard Raiffa
Authors:
Stephen E. Fienberg
Abstract:
Howard Raiffa earned his bachelor's degree in mathematics, his master's degree in statistics and his Ph.D. in mathematics at the University of Michigan. Since 1957, Raiffa has been a member of the faculty at Harvard University, where he is now the Frank P. Ramsey Chair in Managerial Economics (Emeritus) in the Graduate School of Business Administration and the Kennedy School of Government. A pio…
▽ More
Howard Raiffa earned his bachelor's degree in mathematics, his master's degree in statistics and his Ph.D. in mathematics at the University of Michigan. Since 1957, Raiffa has been a member of the faculty at Harvard University, where he is now the Frank P. Ramsey Chair in Managerial Economics (Emeritus) in the Graduate School of Business Administration and the Kennedy School of Government. A pioneer in the creation of the field known as decision analysis, his research interests span statistical decision theory, game theory, behavioral decision theory, risk analysis and negotiation analysis. Raiffa has supervised more than 90 doctoral dissertations and written 11 books. His new book is Negotiation Analysis: The Science and Art of Collaborative Decision Making. Another book, Smart Choices, co-authored with his former doctoral students John Hammond and Ralph Keeney, was the CPR (formerly known as the Center for Public Resources) Institute for Dispute Resolution Book of the Year in 1998. Raiffa helped to create the International Institute for Applied Systems Analysis and he later became its first Director, serving in that capacity from 1972 to 1975. His many honors and awards include the Distinguished Contribution Award from the Society of Risk Analysis; the Frank P. Ramsey Medal for outstanding contributions to the field of decision analysis from the Operations Research Society of America; and the Melamed Prize from the University of Chicago Business School for The Art and Science of Negotiation. He earned a Gold Medal from the International Association for Conflict Management and a Lifetime Achievement Award from the CPR Institute for Dispute Resolution. He holds honorary doctor's degrees from Carnegie Mellon University, the University of Michigan, Northwestern University, Ben Gurion University of the Negev and Harvard University. The latter was awarded in 2002.
△ Less
Submitted 6 August, 2008;
originally announced August 2008.
-
Editorial: Statistics and "The lost tomb of Jesus"
Authors:
Stephen E. Fienberg
Abstract:
What makes a problem suitable for statistical analysis? Are historical and religious questions addressable using statistical calculations? Such issues have long been debated in the statistical community and statisticians and others have used historical information and texts to analyze such questions as the economics of slavery, the authorship of the Federalist Papers and the question of the exis…
▽ More
What makes a problem suitable for statistical analysis? Are historical and religious questions addressable using statistical calculations? Such issues have long been debated in the statistical community and statisticians and others have used historical information and texts to analyze such questions as the economics of slavery, the authorship of the Federalist Papers and the question of the existence of God. But what about historical and religious attributions associated with information gathered from archeological finds? In 1980, a construction crew working in the Jerusalem neighborhood of East Talpiot stumbled upon a crypt. Archaeologists from the Israel Antiquities Authority came to the scene and found 10 limestone burial boxes, known as ossuaries, in the crypt. Six of these had inscriptions. The remains found in the ossuaries were reburied, as required by Jewish religious tradition, and the ossuaries were catalogued and stored in a warehouse. The inscriptions on the ossuaries were catalogued and published by Rahmani (1994) and by Kloner (1996) but there reports did not receive widespread public attention. Fast forward to March 2007, when a television ``docudrama'' aired on The Discovery Channel entitled ``The Lost Tomb of Jesus'' touched off a public and religious controversy--one only need think about the title to see why there might be a controversy! The program, and a simultaneously published book [Jacobovici and Pellegrino (2007)], described the ``rediscovery'' of the East Talpiot archeological find and they presented interpretations of the ossuary inscriptions from a number of perspectives. Among these was a statistical calculation attributed to the statistician Andrey Feuerverger: ``that the odds that all six names would appear together in one tomb are 1 in 600, calculated conservatively--or possibly even as much as one in one million.''
△ Less
Submitted 26 March, 2008;
originally announced March 2008.
-
Describing disability through individual-level mixture models for multivariate binary data
Authors:
Elena A. Erosheva,
Stephen E. Fienberg,
Cyrille Joutard
Abstract:
Data on functional disability are of widespread policy interest in the United States, especially with respect to planning for Medicare and Social Security for a growing population of elderly adults. We consider an extract of functional disability data from the National Long Term Care Survey (NLTCS) and attempt to develop disability profiles using variations of the Grade of Membership (GoM) model…
▽ More
Data on functional disability are of widespread policy interest in the United States, especially with respect to planning for Medicare and Social Security for a growing population of elderly adults. We consider an extract of functional disability data from the National Long Term Care Survey (NLTCS) and attempt to develop disability profiles using variations of the Grade of Membership (GoM) model. We first describe GoM as an individual-level mixture model that allows individuals to have partial membership in several mixture components simultaneously. We then prove the equivalence between individual-level and population-level mixture models, and use this property to develop a Markov Chain Monte Carlo algorithm for Bayesian estimation of the model. We use our approach to analyze functional disability data from the NLTCS.
△ Less
Submitted 13 December, 2007;
originally announced December 2007.
-
Editorial: Statistics and forensic science
Authors:
Stephen E. Fienberg
Abstract:
Forensic science is usually taken to mean the application of a broad spectrum of scientific tools to answer questions of interest to the legal system. Despite such popular television series as CSI: Crime Scene Investigation and its spinoffs--CSI: Miami and CSI: New York--on which the forensic scientists use the latest high-tech scientific tools to identify the perpetrator of a crime and always i…
▽ More
Forensic science is usually taken to mean the application of a broad spectrum of scientific tools to answer questions of interest to the legal system. Despite such popular television series as CSI: Crime Scene Investigation and its spinoffs--CSI: Miami and CSI: New York--on which the forensic scientists use the latest high-tech scientific tools to identify the perpetrator of a crime and always in under an hour, forensic science is under assault, in the public media, popular magazines [Talbot (2007), Toobin (2007)] and in the scientific literature [Kennedy (2003), Saks and Koehler (2005)]. Ironically, this growing controversy over forensic science has occurred precisely at the time that DNA evidence has become the ``gold standard'' in the courts, leading to the overturning of hundreds of convictions many of which were based on clearly less credible forensic evidence, including eyewitness testimony [Berger (2006)].
△ Less
Submitted 6 December, 2007;
originally announced December 2007.
-
William Kruskal: My Scholarly and Scientific Model
Authors:
Stephen E. Fienberg
Abstract:
Discussion of ``The William Kruskal Legacy: 1919--2005'' by Stephen E. Fienberg, Stephen M. Stigler and Judith M. Tanur [arXiv:0710.5063]
Discussion of ``The William Kruskal Legacy: 1919--2005'' by Stephen E. Fienberg, Stephen M. Stigler and Judith M. Tanur [arXiv:0710.5063]
△ Less
Submitted 26 October, 2007;
originally announced October 2007.
-
The William Kruskal Legacy: 1919--2005
Authors:
Stephen E. Fienberg,
Stephen M. Stigler,
Judith M. Tanur
Abstract:
William Kruskal (Bill) was a distinguished statistician who spent virtually his entire professional career at the University of Chicago, and who had a lasting impact on the Institute of Mathematical Statistics and on the field of statistics more broadly, as well as on many who came in contact with him. Bill passed away last April following an extended illness, and on May 19, 2005, the University…
▽ More
William Kruskal (Bill) was a distinguished statistician who spent virtually his entire professional career at the University of Chicago, and who had a lasting impact on the Institute of Mathematical Statistics and on the field of statistics more broadly, as well as on many who came in contact with him. Bill passed away last April following an extended illness, and on May 19, 2005, the University of Chicago held a memorial service at which several of Bill's colleagues and collaborators spoke along with members of his family and other friends. This biography and the accompanying commentaries derive in part from brief presentations on that occasion, along with recollections and input from several others. Bill was known personally to most of an older generation of statisticians as an editor and as an intellectual and professional leader. In 1994, Statistical Science published an interview by Sandy Zabell (Vol. 9, 285--303) in which Bill looked back on selected events in his professional life. One of the purposes of the present biography and accompanying commentaries is to reintroduce him to old friends and to introduce him for the first time to new generations of statisticians who never had an opportunity to interact with him and to fall under his influence.
△ Less
Submitted 26 October, 2007;
originally announced October 2007.
-
Maximum Likelihood Estimation in Latent Class Models For Contingency Table Data
Authors:
S. E. Fienberg,
P. Hersh,
A. Rinaldo,
Y. Zhou
Abstract:
Statistical models with latent structure have a history going back to the 1950s and have seen widespread use in the social sciences and, more recently, in computational biology and in machine learning. Here we study the basic latent class model proposed originally by the sociologist Paul F. Lazarfeld for categorical variables, and we explain its geometric structure. We draw parallels between the…
▽ More
Statistical models with latent structure have a history going back to the 1950s and have seen widespread use in the social sciences and, more recently, in computational biology and in machine learning. Here we study the basic latent class model proposed originally by the sociologist Paul F. Lazarfeld for categorical variables, and we explain its geometric structure. We draw parallels between the statistical and geometric properties of latent class models and we illustrate geometrically the causes of many problems associated with maximum likelihood estimation and related statistical inference. In particular, we focus on issues of non-identifiability and determination of the model dimension, of maximization of the likelihood function and on the effect of symmetric data. We illustrate these phenomena with a variety of synthetic and real-life tables, of different dimension and complexity. Much of the motivation for this work stems from the "100 Swiss Francs" problem, which we introduce and describe in detail.
△ Less
Submitted 21 September, 2007;
originally announced September 2007.
-
A statistical approach to simultaneous map** and localization for mobile robots
Authors:
Anita Araneda,
Stephen E. Fienberg,
Alvaro Soto
Abstract:
Mobile robots require basic information to navigate through an environment: they need to know where they are (localization) and they need to know where they are going. For the latter, robots need a map of the environment. Using sensors of a variety of forms, robots gather information as they move through an environment in order to build a map. In this paper we present a novel sampling algorithm…
▽ More
Mobile robots require basic information to navigate through an environment: they need to know where they are (localization) and they need to know where they are going. For the latter, robots need a map of the environment. Using sensors of a variety of forms, robots gather information as they move through an environment in order to build a map. In this paper we present a novel sampling algorithm to solving the simultaneous map** and localization (SLAM) problem in indoor environments. We approach the problem from a Bayesian statistics perspective. The data correspond to a set of range finder and odometer measurements, obtained at discrete time instants. We focus on the estimation of the posterior distribution over the space of possible maps given the data. By exploiting different factorizations of this distribution, we derive three sampling algorithms based on importance sampling. We illustrate the results of our approach by testing the algorithms with two real data sets obtained through robot navigation inside office buildings at Carnegie Mellon University and the Pontificia Universidad Catolica de Chile.
△ Less
Submitted 31 August, 2007;
originally announced August 2007.
-
Mixed membership stochastic blockmodels
Authors:
Edoardo M Airoldi,
David M Blei,
Stephen E Fienberg,
Eric P Xing
Abstract:
Observations consisting of measurements on relationships for pairs of objects arise in many settings, such as protein interaction and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing such data with probabilisic models can be delicate because the simple exchangeability assumptions underlying many boilerplate models no longer hold. In this paper, we d…
▽ More
Observations consisting of measurements on relationships for pairs of objects arise in many settings, such as protein interaction and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing such data with probabilisic models can be delicate because the simple exchangeability assumptions underlying many boilerplate models no longer hold. In this paper, we describe a latent variable model of such data called the mixed membership stochastic blockmodel. This model extends blockmodels for relational data to ones which capture mixed membership latent relational structure, thus providing an object-specific low-dimensional representation. We develop a general variational inference algorithm for fast approximate posterior inference. We explore applications to social and protein interaction networks.
△ Less
Submitted 30 May, 2007;
originally announced May 2007.