Hypothesis testing for homogenous of nodes in -models
Abstract
The -model has been extensively utilized to model degree heterogeneity in networks, wherein each node is assigned a unique parameter. In this article, we consider the hypothesis testing problem that two nodes and of a -model have the same node parameter. We prove that the null distribution of the proposed statistic converges in distribution to the standard normal distribution. Further, we investigate the homogeneous test for -model by combining individual -values to aggregate small effects of multiple tests. Both simulation studies and real-world data examples indicate that the proposed method works well.
Keywords: -model; Combination -values; Hypothesis testing; Network data
1 Introduction
Network models are commonly popular models to character the interaction between the different entries (Scott, 2000). The studies on network data have attracted considerable attention in many fields, such as computer science, social science, and biology. For example, in the social network, the interaction between the different individuals represents a friend relationship (Hunter et al., 2012). In general, an undirected and unweight network with nodes can be represented by an adjacency matrix , where -th entry indicates whether there is a connection between node and node , i.e., if there is a connection between node and node and otherwise. In network data analysis, the -model, proposed by Chatterjee et al. (2011), is a special case of a class of models known as node-parameter models, where each node degree is associated with a corresponding parameter. Specifically, the -model assumes that the edge between node and node exists with probability
independently of all other edges, where is the node parameter (also known as the “attractiveness” of node) of node . The -model is an exponential random graph model and can be seen as an undirected version of a -model (Holland and Leinhardt, 1981). An advantage of the -model is that the degree sequence is the unique sufficient statistic. Then, the -model is widely used to model the network with degree heterogeneous. It is not difficult to see that the probability connecting the node and node only depends on the parameters of the node and node . When all ’s are equal to each other, the -model naturally degenerates to the E-R model. To fit a sparse network, Mukherjee et al. (2018) proposed the adjusted -model
where is used to measure the sparsity of the graph. Since the -model can capture important features of real-world networks, the -model and its variations have been studied widely in recent years (Chatterjee et al., 2011; Yan and Xu, 2013; Rinaldo et al., 2013; Ogawa et al., 2013; Yan et al., 2015, 2016).
Hypothesis testing plays a critical role in the studies on network data (Fu et al., 2022, 2023). One significant application is to recover the community structure of a network. Bickel and Sarkar (2016) and Dong et al. (2020) used the spectral statistic of the normalized adjacency matrix to test whether the network has a community structure, i.e., for stochastic block models. Then, Cammarata and Ke (2023) considered the global testing problem under the framework of degree-corrected mixed membership models. Further, a majority of methods of the goodness-of-fit test for stochastic block models have also been proposed, see, e.g., Lei (2016); Hu et al. (2021); ** et al. (2023). Under the settings of degree-corrected mixed membership models, Fan et al. (2022) studied the issue of hypothesis testing for the equality of membership vectors between two nodes, up to a possible scaling. Similarly, Du and Tang (2023) investigated the equality of latent positions between two nodes. Their methods are based on the Mahalanobis distance between two vectors, which are generalizations of the corresponding results in Fan et al. (2022).
Hypothesis testing for -modes is a nascent research area. Motivated by the issues of equality of two nodes, we consider the hypothesis testing problem that two node and node of a -model have the same node parameter. Specifically, we consider the following test:
(1.1) |
for any , where . Further, the other significant problem is the homogeneous test, i.e.,
(1.2) |
For test (1.2), the null hypothesis implies that there is no heterogeneity in the network, and the network can be seen as an E-R graph. For an adjusted -model, Mukherjee et al. (2018) considered a homogeneous null hypothesis with all being equal to 0 against an alternative hypothesis with a subset of strictly greater than 0. They proposed three explicitly degree-based test statistics: , , and a criticism test based on and established their asymptotic null distribution under some mild conditions. Similarly, under the -model, Yan et al. (2022) investigated two testing problems: for a fixed , the specified null and the homogeneous null , where is known. For the two nulls, they established the Wilks’ theorem of -models, i.e., the log-likelihood ratio statistic converges in distribution to a chi-square distribution with degrees of freedom and degrees of freedom, respectively, where and are the unrestricted and restricted maximum likelihood estimators of , and is the log-likelihood function. Compared with their settings, the advantages of our setting are as follows. First, our null hypothesis has a wider range of parameters than that in Mukherjee et al. (2018) since we do not require that all parameters be equal to zero. Second, we only need the unrestricted maximum likelihood estimate, and save the computational cost.
The rest of this article is organized as follows. In Section 2, we present our main method and theorems about the test for equality of node parameters. The homogeneous test for the -model is investigated in Section 3. Additional simulation studies and real-world data examples are given in Sections 4 and 5. Section 6 concludes the article. Technical proofs are given in the Appendix.
2 Hypothesis testing for equality of node parameters
Formally, suppose that is an adjacency matrix of undirected graph generated from the -model with parameter , where is unknown. Throughout this article, we assume that the self-loops are not allowed, i.e., for . Let be the degree of the node . Then, the logarithm of the likelihood function can be written as:
Denote as the maximum likelihood estimator (MLE). The MLE can be obtained by solving the following equations:
(2.1) |
Chatterjee et al. (2011) showed that the fixed point iterative algorithm can be used to solve . Under the frameworks of the -model, Chatterjee et al. (2011) established the consistency of . Specifically, let , then there is a constant depending only on such that . Further, by approximating the inverse of the Fisher information matrix, Yan and Xu (2013) proved the asymptotic normality of . Then, Rinaldo et al. (2013) gave the necessary and sufficient conditions for the existence and uniqueness of .
Denote the Fisher information matrix for as , where
Note that is also the covariance matrix of degree sequence . Then, Yan and Xu (2013) established the following central limiting theorem:
Lemma 1.
If , then for any fixed , the vector consisting of the first elements of is asymptotically standard multivariate normal as , where and .
Lemma 1 implies that, for any , the following result holds:
and and are asymptotic independent for any . Then, for a pair of nodes , we have
Under the null hypothesis of test (1.1), we have
(2.2) |
Consider the statistic . Then, under , we have . Notice that the statistic involves unknown parameters and . Hence, we can consider a natural estimate of by plugging in the estimated parameters and , where
Denote the empirical estimate of by . It is natural to conjecture that when the estimates and are accurate enough, the convergence in (2.2) will still hold for .
Formally, we have the following theorem:
Theorem 1.
Let be an adjacency matrix generated from a -model with parameter . Under , when , we have the following result:
(2.3) |
Under , we assume that . Then, we have
(2.4) |
We postpone the proof to the Appendix. Theorem 1 is an intuitive result. The method is similar to the test of the mean for two samples when the variance is unknown. It can be seen that, for the null and alternative, the statistic has different means. Using the result, we can carry out the hypothesis testing. Specifically, given a nominal level , we have a rejection rule:
(2.5) |
where is the upper -th quantile of the standard normal distribution.
3 Hypothesis testing for homogeneous
In this section, we consider the homogeneous testing for the -model. Under the null hypothesis of test (1.2), the -model reduces to the E-R model. Then, the homogeneous testing enables the evaluation of heterogeneity among the nodes within the network. For the test (1.2), the alternative hypothesis implies that there is a pair of nodes with non-equality of node parameters. Hence, using the test (1.2) on node pairs will result in rejecting the null hypothesis. Intuitively, we can consider all pairs of nodes for , then using the test (1.2) on node pairs , which leads to testing results. A significant problem is the statistics ’s are correlated and how to combine the information of results.
In the meta-analysis, methods for combining multiple test statistics are widely used in massive data analysis. Specifically, suppose we independently test the same hypothesis using different statistical tests and obtain -values . An important issue is how to combine them into a single -value. Notice that, under the null hypothesis, all ’s should follow the uniform distribution on interval . Hence, the null hypothesis can be rewritten as
The six most simple and commonly used statistics for combining -values are: (Fisher, 1932), (Pearson, 1933), (Mudholkar and George, 1979), (Edgington, 1972), (Stouffer et al., 1949), (Tippett, 1931). However, an obvious deficiency is that, when there is a dependence structure between ’s, all these six methods do not work. Then, Liu and Xie (2020) proposed a Cauchy combination method that takes advantage of the Cauchy distribution. A nonasymptotic result was established to demonstrate that the tail of the null distribution can be effectively approximated by a Cauchy distribution, under arbitrary dependency structures. Specifically, the Cauchy test statistic has the form: , where the weights ’s are nonnegative and .
Recall the homogeneous test (1.2). For any pair of nodes , we can calculate the statistic and the -value . Under the null , all ’s should follow the uniform distribution on interval , and they are not independent. Hence, we consider the Cauchy combination statistic:
According to the results in Liu and Xie (2020), the test statistic has approximately a Cauchy tail even when ’s are dependent, i.e.,
where denotes a standard Cauchy random variable. Then, for a given nominal level , we have the reject rule:
where is the upper -th quantile of the standard Cauchy distribution.
Remark. Compared with the resluts in Yan et al. (2022), the proposed method can test the homogeneous for parameters. Following Lemma 1, when diverges to infinity, the first elements of may not be independent. However, in our test procedure, we only consider the two estimators and that can be seen as independent, then we can combine information from tests.
4 Simulation
In this section, we carry out extensive simulation studies to evaluate the performance of the proposed method. All simulations were performed on a PC with a single processor of 2.3 GHz 8‐Core Intel Core i9.
4.1 The empirical distribution for statistic
In this simulation, we examine the finite sample empirical distribution of the test statistic under the null and alternative hypothesis and verify the result in Theorem 1. We set and where , and . When , all ’s are equal, which corresponds to . And, when and , there is heterogeneous between nodes, which corresponds to .
In Figures 1-3, we plot the empirical density of the statistic from 1000 data replications. When , and , the plots show that the simulation result very well matches the prediction of Theorem 1. Under the null () and the alternative ( and ), the test statistic has different mean.
![Refer to caption](x1.png)
![Refer to caption](x2.png)
![Refer to caption](x3.png)
4.2 The empirical size and power for test (1.1)
In this subsection, we investigate the empirical size and power for test (1.1), and the settings are similar to that in Section 4.1. The proportion of rejection at nominal level 0.05 is summarized in Table 1. It is easy to see that the type I error is correctly kept at the nominal level. For the alternative hypothesis, the power tends to be less than 1. In fact, when the difference between and is small ( or ), the distribution of is close to the standard normal distribution, which leads to the power may be much less than 1. When the difference between and is large ( or ), however, the empirical powers are close to 1. The results are consistent with the results in Section 4.1. In addition, we observe that, with the sample increasing, the power of the test decreases. The main reason is that the parameter generation method makes the difference between two nodes become smaller as the number of samples increases.
0.05 | 0.34 | 0.47 | 0.05 | 0.25 | 0.35 | |
0.05 | 0.86 | 0.96 | 0.05 | 0.73 | 0.87 | |
0.06 | 0.31 | 0.41 | 0.06 | 0.27 | 0.31 | |
0.07 | 0.95 | 0.99 | 0.05 | 0.96 | 0.99 |
4.3 The empirical size and power for test (1.2)
In this subsection, we investigate the homogeneous test for the -model. We also set . However, we set , where has five cases: , and . It is easy to see that corresponds to the null , and the other four cases correspond to the null . For , we consider two classes of settings: (i) , and ; (2) , where , and . The results are given in Tables 2 and 3. For the simulation results, under the null (), the type I errors are close to the nominal level. For the alternative , the empirical power is less than 1, and the proposed method is superior to the method in Yan et al. (2022) when approximates . All simulation results show that the proposed method is effective and efficient.
0.05 (0.07) | 0.03 (0.08) | 0.05 (0.05) | ||
0.97 (0.54) | 1 (0.81) | 1 (0.99) | ||
1 (0.96) | 1 (1) | 1 (1) | ||
1 (1) | 1 (1) | 1 (1) | ||
1 (1) | 1 (1) | 1 (1) | ||
0.04 (0.08) | 0.03 (0.05) | 0.04 (0.09) | ||
0.94 (0.52) | 1 (0.82) | 1 (0.99) | ||
1 (0.94) | 1 (0.99) | 1 (1) | ||
1 (1) | 1 (1) | 1 (1) | ||
1 (1) | 1 (1) | 1 (1) | ||
0.01 (0.05) | 0.03 (0.07) | 0.03 (0.05) | ||
0.98 (0.60) | 1 (0.82) | 1 (0.98) | ||
0.99 (0.96) | 1 (1) | 1 (1) | ||
1 (1) | 1 (1) | 1 (1) | ||
1 (1) | 1 (1) | 1 (1) |
0.05 (0.07) | 0.03 (0.08) | 0.05 (0.05) | ||
0.09 (0.06) | 0.73 (0.23) | 1 (1) | ||
0.17 (0.10) | 0.92 (0.67) | 1 (1) | ||
0.34 (0.35) | 1 (1) | 1 (1) | ||
0.57 (0.78) | 1 (1) | 1 (1) | ||
0.04 (0.05) | 0.04 (0.06) | 0.05 (0.07) | ||
0.46 (0.11) | 1 (0.67) | 1 (1) | ||
0.67 (0.28) | 1 (0.98) | 1 (1) | ||
0.95 (0.83) | 1 (1) | 1 (1) | ||
1 (1) | 1 (1) | 1 (1) | ||
0.04 (0.05) | 0.02 (0.06) | 0.03 (0.06) | ||
0.99 (0.33) | 1 (0.99) | 1 (1) | ||
1 (0.84) | 1 (1) | 1 (1) | ||
1 (1) | 1 (1) | 1 (1) | ||
1 (1) | 1 (1) | 1 (1) |
5 Real example analysis
In this section, we apply the proposed method to a real network dataset. The food web dataset is from Baird and Ulanowicz (1989) and is available in Blitzstein and Diaconis (2011), which contains data on 33 organisms (such as bacteria, oysters, and catfish) in the Chesapeake Bay during the summer. The degree sequence of this network is . We observe that some nodes have identical degrees in this network, and the heterogeneity of the network seems not very obvious. To investigate the equality of node parameters, we consider the nodes 4, 6, 13, 11, 12, 14, 15, 2, 22, and 8, which correspond to degrees 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. Table 4 shows that the -values for test problem (1.1). The result indicates that the increase in degree difference between two nodes leads to a decrease in -value, which tends to reject the null hypothesis. Finally, we consider the homogeneous test (1.2). The -values obtained by the proposed method and likelihood-ratio test are 0.698 and 0.998, respectively. The result shows that the network is homogeneous with high probability.
4 | 6 | 13 | 11 | 12 | 14 | 15 | 2 | 22 | 8 | |
4 | 0.277 | 0.156 | 0.090 | 0.053 | 0.031 | 0.019 | 0.012 | 0.007 | 0.004 | |
6 | 0.277 | 0.316 | 0.189 | 0.110 | 0.063 | 0.035 | 0.019 | 0.011 | 0.006 | |
13 | 0.156 | 0.316 | 0.337 | 0.213 | 0.128 | 0.074 | 0.042 | 0.023 | 0.012 | |
11 | 0.090 | 0.189 | 0.337 | 0.350 | 0.230 | 0.143 | 0.085 | 0.049 | 0.027 | |
12 | 0.053 | 0.110 | 0.213 | 0.350 | 0.360 | 0.243 | 0.156 | 0.095 | 0.055 | |
14 | 0.031 | 0.063 | 0.128 | 0.230 | 0.360 | 0.367 | 0.254 | 0.167 | 0.104 | |
15 | 0.019 | 0.035 | 0.074 | 0.143 | 0.243 | 0.367 | 0.373 | 0.263 | 0.176 | |
2 | 0.012 | 0.019 | 0.042 | 0.085 | 0.156 | 0.254 | 0.373 | 0.378 | 0.271 | |
22 | 0.007 | 0.011 | 0.023 | 0.049 | 0.095 | 0.167 | 0.263 | 0.378 | 0.382 | |
8 | 0.004 | 0.006 | 0.012 | 0.027 | 0.055 | 0.104 | 0.176 | 0.271 | 0.382 |
6 Conclusion
In this article, we have proposed a novel statistic to investigate the equality test for the two nodes of the -model. Based on the central limit theorem, we have proved the limiting distribution of the proposed statistic is the standard normal distribution. Then, plugging in the MLE of parameters, we have proved that the limiting distribution of the empirical counterpart of the test statistic is also the standard normal distribution under some mild conditions. Under the alternative hypothesis, the limit distribution of the test statistic has also been proven to be a normal distribution with a different mean from the null distribution. Further, based on the combining -values method, we have investigated the homogeneous test for the -model. Empirically, by extensive simulation studies, we have demonstrated that the size and the power of the test are valid.
It is worth noting that the proposed test method works well when the difference between the parameters of two nodes is large. However, the power will decrease when the difference between the parameters of two nodes is small. Hence, we need to consider how to improve the power of the proposed test for hypothesis test (1.1) under the case of for a small constant . Next, we can also consider extending the single sample to the multi-sample, such as for two -models with parameters and . We will continue to study this issue in future work.
7 Appendix
7.1 Proof of Theorem 1
First, we consider the case of . According to the Taylor expansion, we have, for any ,
Following the definition of , it is easy to see that
(7.1) |
Next, we consider to bound the terms . Define , then . For any ,
Notice that and the convergence rate of is between and . Hence, we have . Combining with (7.1), we have, for any ,
Thus, we have, for any ,
According to the Slutsky’s theorem, we have .
The proof of the alternative are similar to that of the null , we omit the details in the article.
Acknowledgments
Hu is partially supported by the National Natural Science Foundation of China (nos. 12171187, 12371261).
References
- Baird and Ulanowicz (1989) Baird, D., and R. E. Ulanowicz (1989), The seasonal dynamics of the Chesapeake bay ecosystem, Ecological Monographs, 59(4), 329–364, doi:10.2307/1943071.
- Bickel and Sarkar (2016) Bickel, P., and P. Sarkar (2016), Hypothesis testing for automated community detection in networks, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1), 253–273, doi:10.1111/rssb.12117.
- Blitzstein and Diaconis (2011) Blitzstein, J., and P. Diaconis (2011), A sequential importance sampling algorithm for generating random graphs with prescribed degrees, Internet Mathematics, 6(4), 489–522, doi:10.1080/15427951.2010.557277.
- Cammarata and Ke (2023) Cammarata, L. V., and Z. T. Ke (2023), Power enhancement and phase transitions for global testing of the mixed membership stochastic block model, Bernoulli, 29(3), 1741–1763, doi:10.3150/22-BEJ1519.
- Chatterjee et al. (2011) Chatterjee, S., P. Diaconis, and A. Sly (2011), Random graphs with a given degree sequence, The Annals of Applied Probability, 21(4), 1400–1435, doi:10.1214/10-AAP728.
- Dong et al. (2020) Dong, Z., S. Wang, and Q. Liu (2020), Spectral based hypothesis testing for community detection in complex networks, Information Sciences, 512, 1360–1371, doi:10.1016/j.ins.2019.10.056.
- Du and Tang (2023) Du, X., and M. Tang (2023), Hypothesis testing for equality of latent positions in random graphs, Bernoulli, 29(4), 3221–3254, doi:10.3150/22-BEJ1581.
- Edgington (1972) Edgington, E. S. (1972), An additive method for combining probability values from independent experiments, The Journal of Psychology, 80(2), 351–363, doi:10.1080/00223980.1972.9924813.
- Fan et al. (2022) Fan, J., Y. Fan, X. Han, and J. Lv (2022), SIMPLE: Statistical inference on membership profiles in large networks, Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(2), 630–653, doi:10.1111/rssb.12505.
- Fisher (1932) Fisher, R. A. (1932), Statistical Methods for Research Workers, 4th ed., Oliver and Boyd, London.
- Fu et al. (2022) Fu, K., J. Hu, S. Keita, and H. Liu (2022), Two-sample test for stochastic block models via the largest singular value, arXiv:2211.09123.
- Fu et al. (2023) Fu, K., J. Hu, S. Keita, and H. Liu (2023), Two-sample test for stochastic block models via maximum entry-wise deviation, Statistics and Its Interface (Accepted).
- Holland and Leinhardt (1981) Holland, P. W., and S. Leinhardt (1981), An exponential family of probability distributions for directed graphs, Journal of the American Statistical Association, 76(373), 33–50, doi:10.1080/01621459.1981.10477598.
- Hu et al. (2021) Hu, J., J. Zhang, H. Qin, T. Yan, and J. Zhu (2021), Using maximum entry-wise deviation to test the goodness of fit for stochastic block models, Journal of the American Statistical Association, 116(535), 1373–1382, doi:10.1080/01621459.2020.1722676.
- Hunter et al. (2012) Hunter, D. R., S. M. Goodreau, and M. S. Handcock (2012), Goodness of fit of social network models, Journal of the American Statistical Association, 103(481), 248–258, doi:10.1198/016214507000000446.
- ** et al. (2023) **, J., Z. T. Ke, S. Luo, and M. Wang (2023), Optimal estimation of the number of network communities, Journal of the American Statistical Association, 118(543), 2101–2116, doi:10.1080/01621459.2022.2035736.
- Lei (2016) Lei, J. (2016), A goodness-of-fit test for stochastic block models, The Annals of Statistics, 44(1), 401–424, doi:10.1214/15-AOS1370.
- Liu and Xie (2020) Liu, Y., and J. Xie (2020), Cauchy combination test: A powerful test with analytic -value calculation under arbitrary dependency structures, Journal of the American Statistical Association, 115(529), 393–402, doi:10.1080/01621459.2018.1554485.
- Mudholkar and George (1979) Mudholkar, G. S., and E. O. George (1979), The logit statistic for combining probabilities, in Symposium on Optimizing Methods in Statistics, edited by J. Rustagi, p. 345–366, Academic Press, New York.
- Mukherjee et al. (2018) Mukherjee, R., S. Mukherjee, and S. Sen (2018), Detection thresholds for the -model on sparse graphs, The Annals of Statistics, 46(3), 1288–1317, doi:10.1214/17-AOS1585.
- Ogawa et al. (2013) Ogawa, M., H. Hara, and A. Takemura (2013), Graver basis for an undirected graph and its application to testing the beta model of random graphs, Annals of the Institute of Statistical Mathematics, 65(1), 191–212, doi:10.1007/s10463-012-0367-8.
- Pearson (1933) Pearson, K. (1933), On a method of determining whether a sample of size n supposed to have been drawn from a parent population having a known probability integral has probably been drawn at random, Biometrika, 25(3-4), 379–410, doi:10.1093/biomet/25.3-4.379.
- Rinaldo et al. (2013) Rinaldo, A., S. Petrović, and S. E. Fienberg (2013), Maximum lilkelihood estimation in the -model, The Annals of Statistics, 41(3), 1085–1110, doi:10.1214/12-AOS1078.
- Scott (2000) Scott, J. (2000), Social network analysis: A handbook, 2nd ed., SAGE, London.
- Stouffer et al. (1949) Stouffer, S. A., E. A. Suchman, L. C. Devinney, S. A. Star, and R. M. Williams (1949), The American Soldier. Adjustment During Army Life, Princeton University Press, Princeton.
- Tippett (1931) Tippett, L. H. C. (1931), The Methods of Statistics, Williams and Norgate, London.
- Yan and Xu (2013) Yan, T., and J. Xu (2013), A central limit theorem in the -model for undirected random graphs with a diverging number of vertices, Biometrika, 100(2), 519–524, doi:10.1093/biomet/ass084.
- Yan et al. (2015) Yan, T., Y. Zhao, and H. Qin (2015), Asymptotic normality in the maximum entropy models on graphs with an increasing number of parameters, Journal of Multivariate Analysis, 133, 61–76, doi:10.1016/j.jmva.2014.08.013.
- Yan et al. (2016) Yan, T., H. Qin, and H. Wang (2016), Asymptotics in undirected random graph models parameterized by the strengths of vertices, Statistica Sinica, 26(3), 273–293, doi:10.5705/ss.2014.180.
- Yan et al. (2022) Yan, T., Y. Li, J. Xu, Y. Yang, and J. Zhu (2022), Wilks’ theorems in the -model, arXiv:2211.10055.