-
Two-Sample Hypothesis Testing for Large Random Graphs of Unequal Size
Authors:
Xin **,
Kit Chan,
Ian Barnett,
Riddhi Pratim Ghosh
Abstract:
Two-sample hypothesis testing for large graphs is popular in cognitive science, probabilistic machine learning and artificial intelligence. While numerous methods have been proposed in the literature to address this problem, less attention has been devoted to scenarios involving graphs of unequal size or situations where there are only one or a few samples of graphs. In this article, we propose a…
▽ More
Two-sample hypothesis testing for large graphs is popular in cognitive science, probabilistic machine learning and artificial intelligence. While numerous methods have been proposed in the literature to address this problem, less attention has been devoted to scenarios involving graphs of unequal size or situations where there are only one or a few samples of graphs. In this article, we propose a Frobenius test statistic tailored for small sample sizes and unequal-sized random graphs to test whether they are generated from the same model or not. Our approach involves an algorithm for generating bootstrapped adjacency matrices from estimated community-wise edge probability matrices, forming the basis of the Frobenius test statistic. We derive the asymptotic distribution of the proposed test statistic and validate its stability and efficiency in detecting minor differences in underlying models through simulations. Furthermore, we explore its application to fMRI data where we are able to distinguish brain activity patterns when subjects are exposed to sentences and pictures for two different stimuli and the control group.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
A Generalized Estimating Equation Approach to Network Regression
Authors:
Riddhi Pratim Ghosh,
Jukka-Pekka Onnela,
Ian Barnett
Abstract:
Regression models applied to network data where node attributes are the dependent variables poses a methodological challenge. As has been well studied, naive regression neither properly accounts for community structure, nor does it account for the dependent variable acting as both model outcome and covariate. To address this methodological gap, we propose a network regression model motivated by th…
▽ More
Regression models applied to network data where node attributes are the dependent variables poses a methodological challenge. As has been well studied, naive regression neither properly accounts for community structure, nor does it account for the dependent variable acting as both model outcome and covariate. To address this methodological gap, we propose a network regression model motivated by the important observation that controlling for community structure can, when a network is modular, significantly account for meaningful correlation between observations induced by network connections. We propose a generalized estimating equation (GEE) approach to learn model parameters based on clusters defined through any single-membership community detection algorithm applied to the observed network. We provide a necessary condition on the network size and edge formation probabilities to establish the asymptotic normality of the model parameters under the assumption that the graph structure is a stochastic block model. We evaluate the performance of our approach through simulations and apply it to estimate the joint impact of baseline covariates and network effects on COVID-19 incidence rate among countries connected by a network of commercial airline traffic. We find that during the beginning of the pandemic the network effect has some influence, the percentage of urban population has more influence on the incidence rate compared to the network effect after the travel ban was in effect.
△ Less
Submitted 14 February, 2024; v1 submitted 11 January, 2023;
originally announced January 2023.
-
Selecting a significance level in sequential testing procedures for community detection
Authors:
Riddhi Pratim Ghosh,
Ian Barnett
Abstract:
While there have been numerous sequential algorithms developed to estimate community structure in networks, there is little available guidance and study of what significance level or stop** parameter to use in these sequential testing procedures. Most algorithms rely on prespecifiying the number of communities or use an arbitrary stop** rule. We provide a principled approach to selecting a nom…
▽ More
While there have been numerous sequential algorithms developed to estimate community structure in networks, there is little available guidance and study of what significance level or stop** parameter to use in these sequential testing procedures. Most algorithms rely on prespecifiying the number of communities or use an arbitrary stop** rule. We provide a principled approach to selecting a nominal significance level for sequential community detection procedures by controlling the tolerance ratio, defined as the ratio of underfitting and overfitting probability of estimating the number of clusters in fitting a network. We introduce an algorithm for specifying this significance level from a user-specified tolerance ratio, and demonstrate its utility with a sequential modularity maximization approach in a stochastic block model framework. We evaluate the performance of the proposed algorithm through extensive simulations and demonstrate its utility in controlling the tolerance ratio in single-cell RNA sequencing clustering by cell type and by clustering a congressional voting network.
△ Less
Submitted 15 September, 2022;
originally announced September 2022.
-
Adaptive Bayesian Variable Clustering via Structural Learning of Breast Cancer Data
Authors:
Riddhi Pratim Ghosh,
Arnab Kumar Maity,
Mohsen Pourahmadi,
Bani K. Mallick
Abstract:
Clustering of proteins is of interest in cancer cell biology. This article proposes a hierarchical Bayesian model for protein (variable) clustering hinging on correlation structure. Starting from a multivariate normal likelihood, we enforce the clustering through prior modeling using angle based unconstrained reparameterization of correlations and assume a truncated Poisson distribution (to penali…
▽ More
Clustering of proteins is of interest in cancer cell biology. This article proposes a hierarchical Bayesian model for protein (variable) clustering hinging on correlation structure. Starting from a multivariate normal likelihood, we enforce the clustering through prior modeling using angle based unconstrained reparameterization of correlations and assume a truncated Poisson distribution (to penalize the large number of clusters) as prior on the number of clusters. The posterior distributions of the parameters are not in explicit form and we use a reversible jump Markov chain Monte Carlo (RJMCMC) based technique is used to simulate the parameters from the posteriors. The end products of the proposed method are estimated cluster configuration of the proteins (variables) along with the number of clusters. The Bayesian method is flexible enough to cluster the proteins as well as the estimate the number of clusters. The performance of the proposed method has been substantiated with extensive simulation studies and one protein expression data with a hereditary disposition in breast cancer where the proteins are coming from different pathways.
△ Less
Submitted 8 February, 2022;
originally announced February 2022.