-
Network reconstruction via the minimum description length principle
Authors:
Tiago P. Peixoto
Abstract:
A fundamental problem associated with the task of network reconstruction from dynamical or behavioral data consists in determining the most appropriate model complexity in a manner that prevents overfitting, and produces an inferred network with a statistically justifiable number of edges. The status quo in this context is based on $L_{1}$ regularization combined with cross-validation. However, be…
▽ More
A fundamental problem associated with the task of network reconstruction from dynamical or behavioral data consists in determining the most appropriate model complexity in a manner that prevents overfitting, and produces an inferred network with a statistically justifiable number of edges. The status quo in this context is based on $L_{1}$ regularization combined with cross-validation. However, besides its high computational cost, this commonplace approach unnecessarily ties the promotion of sparsity with weight "shrinkage". This combination forces a trade-off between the bias introduced by shrinkage and the network sparsity, which often results in substantial overfitting even after cross-validation. In this work, we propose an alternative nonparametric regularization scheme based on hierarchical Bayesian inference and weight quantization, which does not rely on weight shrinkage to promote sparsity. Our approach follows the minimum description length (MDL) principle, and uncovers the weight distribution that allows for the most compression of the data, thus avoiding overfitting without requiring cross-validation. The latter property renders our approach substantially faster to employ, as it requires a single fit to the complete data. As a result, we have a principled and efficient inference scheme that can be used with a large variety of generative models, without requiring the number of edges to be known in advance. We also demonstrate that our scheme yields systematically increased accuracy in the reconstruction of both artificial and empirical networks. We highlight the use of our method with the reconstruction of interaction networks between microbial communities from large-scale abundance samples involving in the order of $10^{4}$ to $10^{5}$ species, and demonstrate how the inferred model can be used to predict the outcome of interventions in the system.
△ Less
Submitted 7 May, 2024; v1 submitted 2 May, 2024;
originally announced May 2024.
-
Scalable network reconstruction in subquadratic time
Authors:
Tiago P. Peixoto
Abstract:
Network reconstruction consists in determining the unobserved pairwise couplings between $N$ nodes given only observational data on the resulting behavior that is conditioned on those couplings -- typically a time-series or independent samples from a graphical model. A major obstacle to the scalability of algorithms proposed for this problem is a seemingly unavoidable quadratic complexity of…
▽ More
Network reconstruction consists in determining the unobserved pairwise couplings between $N$ nodes given only observational data on the resulting behavior that is conditioned on those couplings -- typically a time-series or independent samples from a graphical model. A major obstacle to the scalability of algorithms proposed for this problem is a seemingly unavoidable quadratic complexity of $Ω(N^2)$, corresponding to the requirement of each possible pairwise coupling being contemplated at least once, despite the fact that most networks of interest are sparse, with a number of non-zero couplings that is only $O(N)$. Here we present a general algorithm applicable to a broad range of reconstruction problems that significantly outperforms this quadratic baseline. Our algorithm relies on a stochastic second neighbor search (Dong et al., 2011) that produces the best edge candidates with high probability, thus bypassing an exhaustive quadratic search. If we rely on the conjecture that the second-neighbor search finishes in log-linear time (Baron & Darling, 2020; 2022), we demonstrate theoretically that our algorithm finishes in subquadratic time, with a data-dependent complexity loosely upper bounded by $O(N^{3/2}\log N)$, but with a more typical log-linear complexity of $O(N\log^2N)$. In practice, we show that our algorithm achieves a performance that is many orders of magnitude faster than the quadratic baseline -- in a manner consistent with our theoretical analysis -- allows for easy parallelization, and thus enables the reconstruction of networks with hundreds of thousands and even millions of nodes and edges.
△ Less
Submitted 7 May, 2024; v1 submitted 2 January, 2024;
originally announced January 2024.
-
Social network modeling and applications, a tutorial
Authors:
Lisette Espín-Noboa,
Tiago Peixoto,
Fariba Karimi
Abstract:
Social networks have been widely studied over the last century from multiple disciplines to understand societal issues such as inequality in employment rates, managerial performance, and epidemic spread. Today, these and many more issues can be studied at global scale thanks to the digital footprints that we generate when browsing the Web or using social media platforms. Unfortunately, scientists…
▽ More
Social networks have been widely studied over the last century from multiple disciplines to understand societal issues such as inequality in employment rates, managerial performance, and epidemic spread. Today, these and many more issues can be studied at global scale thanks to the digital footprints that we generate when browsing the Web or using social media platforms. Unfortunately, scientists often struggle to access to such data primarily because it is proprietary, and even when it is shared with privacy guarantees, such data is either no representative or too big. In this tutorial, we will discuss recent advances and future directions in network modeling. In particular, we focus on how to exploit synthetic networks to study real-world problems such as data privacy, spreading dynamics, algorithmic bias, and ranking inequalities. We start by reviewing different types of generative models for social networks including node-attributed and scale-free networks. Then, we showcase how to perform a network selection analysis to characterize the mechanisms of edge formation of any given real-world network.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
Implicit models, latent compression, intrinsic biases, and cheap lunches in community detection
Authors:
Tiago P. Peixoto,
Alec Kirkley
Abstract:
The task of community detection, which aims to partition a network into clusters of nodes to summarize its large-scale structure, has spawned the development of many competing algorithms with varying objectives. Some community detection methods are inferential, explicitly deriving the clustering objective through a probabilistic generative model, while other methods are descriptive, dividing a net…
▽ More
The task of community detection, which aims to partition a network into clusters of nodes to summarize its large-scale structure, has spawned the development of many competing algorithms with varying objectives. Some community detection methods are inferential, explicitly deriving the clustering objective through a probabilistic generative model, while other methods are descriptive, dividing a network according to an objective motivated by a particular application, making it challenging to compare these methods on the same scale. Here we present a solution to this problem that associates any community detection objective, inferential or descriptive, with its corresponding implicit network generative model. This allows us to compute the description length of a network and its partition under arbitrary objectives, providing a principled measure to compare the performance of different algorithms without the need for "ground truth" labels. Our approach also gives access to instances of the community detection problem that are optimal to any given algorithm, and in this way reveals intrinsic biases in popular descriptive methods, explaining their tendency to overfit. Using our framework, we compare a number of community detection methods on artificial networks, and on a corpus of over 500 structurally diverse empirical networks. We find that more expressive community detection methods exhibit consistently superior compression performance on structured data instances, without having degraded performance on a minority of situations where more specialized algorithms perform optimally. Our results undermine the implications of the "no free lunch" theorem for community detection, both conceptually and in practice, since it is confined to unstructured data instances, unlike relevant community detection problems which are structured by requirement.
△ Less
Submitted 7 November, 2023; v1 submitted 17 October, 2022;
originally announced October 2022.
-
Ordered community detection in directed networks
Authors:
Tiago P. Peixoto
Abstract:
We develop a method to infer community structure in directed networks where the groups are ordered in a latent one-dimensional hierarchy that determines the preferred edge direction. Our nonparametric Bayesian approach is based on a modification of the stochastic block model (SBM), which can take advantage of rank alignment and coherence to produce parsimonious descriptions of networks that combin…
▽ More
We develop a method to infer community structure in directed networks where the groups are ordered in a latent one-dimensional hierarchy that determines the preferred edge direction. Our nonparametric Bayesian approach is based on a modification of the stochastic block model (SBM), which can take advantage of rank alignment and coherence to produce parsimonious descriptions of networks that combine ordered hierarchies with arbitrary mixing patterns between groups. Since our model also includes directed degree correction, we can use it to distinguish non-local hierarchical structure from local in- and out-degree imbalance -- thus removing a source of conflation present in most ranking methods. We also demonstrate how we can reliably compare with the results obtained with the unordered SBM variant to determine whether a hierarchical ordering is statistically warranted in the first place. We illustrate the application of our method on a wide variety of empirical networks across several domains.
△ Less
Submitted 31 August, 2022; v1 submitted 30 March, 2022;
originally announced March 2022.
-
Descriptive vs. inferential community detection in networks: pitfalls, myths, and half-truths
Authors:
Tiago P. Peixoto
Abstract:
Community detection is one of the most important methodological fields of network science, and one which has attracted a significant amount of attention over the past decades. This area deals with the automated division of a network into fundamental building blocks, with the objective of providing a summary of its large-scale structure. Despite its importance and widespread adoption, there is a no…
▽ More
Community detection is one of the most important methodological fields of network science, and one which has attracted a significant amount of attention over the past decades. This area deals with the automated division of a network into fundamental building blocks, with the objective of providing a summary of its large-scale structure. Despite its importance and widespread adoption, there is a noticeable gap between what is arguably the state-of-the-art and the methods that are actually used in practice in a variety of fields. Here we attempt to address this discrepancy by dividing existing methods according to whether they have a "descriptive" or an "inferential" goal. While descriptive methods find patterns in networks based on context-dependent notions of community structure, inferential methods articulate generative models, and attempt to fit them to data. In this way, they are able to provide insights into the mechanisms of network formation, and separate structure from randomness in a manner supported by statistical evidence. We review how employing descriptive methods with inferential aims is riddled with pitfalls and misleading answers, and thus should be in general avoided. We argue that inferential methods are more typically aligned with clearer scientific questions, yield more robust results, and should be in many cases preferred. We attempt to dispel some myths and half-truths often believed when community detection is employed in practice, in an effort to improve both the use of such methods as well as the interpretation of their results.
△ Less
Submitted 6 July, 2023; v1 submitted 30 November, 2021;
originally announced December 2021.
-
The physics of higher-order interactions in complex systems
Authors:
Federico Battiston,
Enrico Amico,
Alain Barrat,
Ginestra Bianconi,
Guilherme Ferraz de Arruda,
Benedetta Franceschiello,
Iacopo Iacopini,
Sonia Kéfi,
Vito Latora,
Yamir Moreno,
Micah M. Murray,
Tiago P. Peixoto,
Francesco Vaccarino,
Giovanni Petri
Abstract:
Complex networks have become the main paradigm for modelling the dynamics of interacting systems. However, networks are intrinsically limited to describing pairwise interactions, whereas real-world systems are often characterized by higher-order interactions involving groups of three or more units. Higher-order structures, such as hypergraphs and simplicial complexes, are therefore a better tool t…
▽ More
Complex networks have become the main paradigm for modelling the dynamics of interacting systems. However, networks are intrinsically limited to describing pairwise interactions, whereas real-world systems are often characterized by higher-order interactions involving groups of three or more units. Higher-order structures, such as hypergraphs and simplicial complexes, are therefore a better tool to map the real organization of many social, biological and man-made systems. Here, we highlight recent evidence of collective behaviours induced by higher-order interactions, and we outline three key challenges for the physics of higher-order systems.
△ Less
Submitted 12 October, 2021;
originally announced October 2021.
-
Multilayer Networks for Text Analysis with Multiple Data Types
Authors:
Charles C. Hyland,
Yuanming Tao,
Lamiae Azizi,
Martin Gerlach,
Tiago P. Peixoto,
Eduardo G. Altmann
Abstract:
We are interested in the widespread problem of clustering documents and finding topics in large collections of written documents in the presence of metadata and hyperlinks. To tackle the challenge of accounting for these different types of datasets, we propose a novel framework based on Multilayer Networks and Stochastic Block Models. The main innovation of our approach over other techniques is th…
▽ More
We are interested in the widespread problem of clustering documents and finding topics in large collections of written documents in the presence of metadata and hyperlinks. To tackle the challenge of accounting for these different types of datasets, we propose a novel framework based on Multilayer Networks and Stochastic Block Models. The main innovation of our approach over other techniques is that it applies the same non-parametric probabilistic framework to the different sources of datasets simultaneously. The key difference to other multilayer complex networks is the strong unbalance between the layers, with the average degree of different node types scaling differently with system size. We show that the latter observation is due to generic properties of text, such as Heaps' law, and strongly affects the inference of communities. We present and discuss the performance of our method in different datasets (hundreds of Wikipedia documents, thousands of scientific papers, and thousands of E-mails) showing that taking into account multiple types of information provides a more nuanced view on topic- and document-clusters and increases the ability to predict missing links.
△ Less
Submitted 30 June, 2021;
originally announced June 2021.
-
Disentangling homophily, community structure and triadic closure in networks
Authors:
Tiago P. Peixoto
Abstract:
Network homophily, the tendency of similar nodes to be connected, and transitivity, the tendency of two nodes being connected if they share a common neighbor, are conflated properties in network analysis, since one mechanism can drive the other. Here we present a generative model and corresponding inference procedure that are capable of distinguishing between both mechanisms. Our approach is based…
▽ More
Network homophily, the tendency of similar nodes to be connected, and transitivity, the tendency of two nodes being connected if they share a common neighbor, are conflated properties in network analysis, since one mechanism can drive the other. Here we present a generative model and corresponding inference procedure that are capable of distinguishing between both mechanisms. Our approach is based on a variation of the stochastic block model (SBM) with the addition of triadic closure edges, and its inference can identify the most plausible mechanism responsible for the existence of every edge in the network, in addition to the underlying community structure itself. We show how the method can evade the detection of spurious communities caused solely by the formation of triangles in the network, and how it can improve the performance of edge prediction when compared to the pure version of the SBM without triadic closure.
△ Less
Submitted 6 January, 2022; v1 submitted 7 January, 2021;
originally announced January 2021.
-
Hypergraph reconstruction from network data
Authors:
Jean-Gabriel Young,
Giovanni Petri,
Tiago P. Peixoto
Abstract:
Networks can describe the structure of a wide variety of complex systems by specifying which pairs of entities in the system are connected. While such pairwise representations are flexible, they are not necessarily appropriate when the fundamental interactions involve more than two entities at the same time. Pairwise representations nonetheless remain ubiquitous, because higher-order interactions…
▽ More
Networks can describe the structure of a wide variety of complex systems by specifying which pairs of entities in the system are connected. While such pairwise representations are flexible, they are not necessarily appropriate when the fundamental interactions involve more than two entities at the same time. Pairwise representations nonetheless remain ubiquitous, because higher-order interactions are often not recorded explicitly in network data. Here, we introduce a Bayesian approach to reconstruct latent higher-order interactions from ordinary pairwise network data. Our method is based on the principle of parsimony and only includes higher-order structures when there is sufficient statistical evidence for them. We demonstrate its applicability to a wide range of datasets, both synthetic and empirical.
△ Less
Submitted 13 January, 2022; v1 submitted 11 August, 2020;
originally announced August 2020.
-
Statistical inference of assortative community structures
Authors:
Lizhi Zhang,
Tiago P. Peixoto
Abstract:
We develop a principled methodology to infer assortative communities in networks based on a nonparametric Bayesian formulation of the planted partition model. We show that this approach succeeds in finding statistically significant assortative modules in networks, unlike alternatives such as modularity maximization, which systematically overfits both in artificial as well as in empirical examples.…
▽ More
We develop a principled methodology to infer assortative communities in networks based on a nonparametric Bayesian formulation of the planted partition model. We show that this approach succeeds in finding statistically significant assortative modules in networks, unlike alternatives such as modularity maximization, which systematically overfits both in artificial as well as in empirical examples. In addition, we show that our method is not subject to a resolution limit, and can uncover an arbitrarily large number of communities, as long as there is statistical evidence for them. Our formulation is amenable to model selection procedures, which allow us to compare it to more general approaches based on the stochastic block model, and in this way reveal whether assortativity is in fact the dominating large-scale mixing pattern. We perform this comparison with several empirical networks, and identify numerous cases where the network's assortativity is exaggerated by traditional community detection methods, and we show how a more faithful degree of assortativity can be identified.
△ Less
Submitted 29 June, 2020; v1 submitted 25 June, 2020;
originally announced June 2020.
-
Revealing consensus and dissensus between network partitions
Authors:
Tiago P. Peixoto
Abstract:
Community detection methods attempt to divide a network into groups of nodes that share similar properties, thus revealing its large-scale structure. A major challenge when employing such methods is that they are often degenerate, typically yielding a complex landscape of competing answers. As an attempt to extract understanding from a population of alternative solutions, many methods exist to est…
▽ More
Community detection methods attempt to divide a network into groups of nodes that share similar properties, thus revealing its large-scale structure. A major challenge when employing such methods is that they are often degenerate, typically yielding a complex landscape of competing answers. As an attempt to extract understanding from a population of alternative solutions, many methods exist to establish a consensus among them in the form of a single partition "point estimate" that summarizes the whole distribution. Here we show that it is in general not possible to obtain a consistent answer from such point estimates when the underlying distribution is too heterogeneous. As an alternative, we provide a comprehensive set of methods designed to characterize and summarize complex populations of partitions in a manner that captures not only the existing consensus, but also the dissensus between elements of the population. Our approach is able to model mixed populations of partitions where multiple consensuses can coexist, representing different competing hypotheses for the network structure. We also show how our methods can be used to compare pairs of partitions, how they can be generalized to hierarchical divisions, and be used to perform statistical model selection between competing hypotheses.
△ Less
Submitted 21 April, 2021; v1 submitted 28 May, 2020;
originally announced May 2020.
-
Merge-split Markov chain Monte Carlo for community detection
Authors:
Tiago P. Peixoto
Abstract:
We present a Markov chain Monte Carlo scheme based on merges and splits of groups that is capable of efficiently sampling from the posterior distribution of network partitions, defined according to the stochastic block model (SBM). We demonstrate how schemes based on the move of single nodes between groups systematically fail at correctly sampling from the posterior distribution even on small netw…
▽ More
We present a Markov chain Monte Carlo scheme based on merges and splits of groups that is capable of efficiently sampling from the posterior distribution of network partitions, defined according to the stochastic block model (SBM). We demonstrate how schemes based on the move of single nodes between groups systematically fail at correctly sampling from the posterior distribution even on small networks, and how our merge-split approach behaves significantly better, and improves the mixing time of the Markov chain by several orders of magnitude in typical cases. We also show how the scheme can be straightforwardly extended to nested versions of the SBM, yielding asymptotically exact samples of hierarchical network partitions.
△ Less
Submitted 13 July, 2020; v1 submitted 16 March, 2020;
originally announced March 2020.
-
Latent Poisson models for networks with heterogeneous density
Authors:
Tiago P. Peixoto
Abstract:
Empirical networks are often globally sparse, with a small average number of connections per node, when compared to the total size of the network. However, this sparsity tends not to be homogeneous, and networks can also be locally dense, for example with a few nodes connecting to a large fraction of the rest of the network, or with small groups of nodes with a large probability of connections bet…
▽ More
Empirical networks are often globally sparse, with a small average number of connections per node, when compared to the total size of the network. However, this sparsity tends not to be homogeneous, and networks can also be locally dense, for example with a few nodes connecting to a large fraction of the rest of the network, or with small groups of nodes with a large probability of connections between them. Here we show how latent Poisson models which generate hidden multigraphs can be effective at capturing this density heterogeneity, while being more tractable mathematically than some of the alternatives that model simple graphs directly. We show how these latent multigraphs can be reconstructed from data on simple graphs, and how this allows us to disentangle disassortative degree-degree correlations from the constraints of imposed degree sequences, and to improve the identification of community structure in empirically relevant scenarios.
△ Less
Submitted 17 July, 2020; v1 submitted 18 February, 2020;
originally announced February 2020.
-
Limits and trade-offs of topological network robustness
Authors:
Christopher Priester,
Sebastian Schmitt,
Tiago P. Peixoto
Abstract:
We investigate the trade-off between the robustness against random and targeted removal of nodes from a network. To this end we utilize the stochastic block model to study ensembles of infinitely large networks with arbitrary large-scale structures. We present results from numerical two-objective optimization simulations for networks with various fixed mean degree and number of blocks. The results…
▽ More
We investigate the trade-off between the robustness against random and targeted removal of nodes from a network. To this end we utilize the stochastic block model to study ensembles of infinitely large networks with arbitrary large-scale structures. We present results from numerical two-objective optimization simulations for networks with various fixed mean degree and number of blocks. The results provide strong evidence that three different blocks are sufficient to realize the best trade-off between the two measures of robustness, i.e.\ to obtain the complete front of Pareto-optimal networks. For all values of the mean degree, a characteristic three block structure emerges over large parts of the Pareto-optimal front. This structure can be often characterized as a core-periphery structure, composed of a group of core nodes with high degree connected among themselves and to a periphery of low-degree nodes, in addition to a third group of nodes which is disconnected from the periphery, and weakly connected to the core. Only at both extremes of the Pareto-optimal front, corresponding to maximal robustness against random and targeted node removal, a two-block core-periphery structure or a one-block fully random network are found, respectively.
△ Less
Submitted 2 October, 2019;
originally announced October 2019.
-
Network reconstruction and community detection from dynamics
Authors:
Tiago P. Peixoto
Abstract:
We present a scalable nonparametric Bayesian method to perform network reconstruction from observed functional behavior that at the same time infers the communities present in the network. We show that the joint reconstruction with community detection has a synergistic effect, where the edge correlations used to inform the existence of communities are also inherently used to improve the accuracy o…
▽ More
We present a scalable nonparametric Bayesian method to perform network reconstruction from observed functional behavior that at the same time infers the communities present in the network. We show that the joint reconstruction with community detection has a synergistic effect, where the edge correlations used to inform the existence of communities are also inherently used to improve the accuracy of the reconstruction which, in turn, can better inform the uncovering of communities. We illustrate the use of our method with observations arising from epidemic models and the Ising model, both on synthetic and empirical networks, as well as on data containing only functional information.
△ Less
Submitted 20 September, 2019; v1 submitted 26 March, 2019;
originally announced March 2019.
-
Reconstructing networks with unknown and heterogeneous errors
Authors:
Tiago P. Peixoto
Abstract:
The vast majority of network datasets contains errors and omissions, although this is rarely incorporated in traditional network analysis. Recently, an increasing effort has been made to fill this methodological gap by develo** network reconstruction approaches based on Bayesian inference. These approaches, however, rely on assumptions of uniform error rates and on direct estimations of the exis…
▽ More
The vast majority of network datasets contains errors and omissions, although this is rarely incorporated in traditional network analysis. Recently, an increasing effort has been made to fill this methodological gap by develo** network reconstruction approaches based on Bayesian inference. These approaches, however, rely on assumptions of uniform error rates and on direct estimations of the existence of each edge via repeated measurements, something that is currently unavailable for the majority of network data. Here we develop a Bayesian reconstruction approach that lifts these limitations by not only allowing for heterogeneous errors, but also for single edge measurements without direct error estimates. Our approach works by coupling the inference approach with structured generative network models, which enable the correlations between edges to be used as reliable uncertainty estimates. Although our approach is general, we focus on the stochastic block model as the basic generative process, from which efficient nonparametric inference can be performed, and yields a principled method to infer hierarchical community structure from noisy data. We demonstrate the efficacy of our approach with a variety of empirical and artificial networks.
△ Less
Submitted 18 October, 2018; v1 submitted 9 June, 2018;
originally announced June 2018.
-
Change points, memory and epidemic spreading in temporal networks
Authors:
Tiago P. Peixoto,
Laetitia Gauvin
Abstract:
Dynamic networks exhibit temporal patterns that vary across different time scales, all of which can potentially affect processes that take place on the network. However, most data-driven approaches used to model time-varying networks attempt to capture only a single characteristic time scale in isolation --- typically associated with the short-time memory of a Markov chain or with long-time abrupt…
▽ More
Dynamic networks exhibit temporal patterns that vary across different time scales, all of which can potentially affect processes that take place on the network. However, most data-driven approaches used to model time-varying networks attempt to capture only a single characteristic time scale in isolation --- typically associated with the short-time memory of a Markov chain or with long-time abrupt changes caused by external or systemic events. Here we propose a unified approach to model both aspects simultaneously, detecting short and long-time behaviors of temporal networks. We do so by develo** an arbitrary-order mixed Markov model with change points, and using a nonparametric Bayesian formulation that allows the Markov order and the position of change points to be determined from data without overfitting. In addition, we evaluate the quality of the multiscale model in its capacity to reproduce the spreading of epidemics on the temporal network, and we show that describing multiple time scales simultaneously has a synergistic effect, where statistically significant features are uncovered that otherwise would remain hidden by treating each time scale independently.
△ Less
Submitted 24 December, 2017;
originally announced December 2017.
-
A network approach to topic models
Authors:
Martin Gerlach,
Tiago P. Peixoto,
Eduardo G. Altmann
Abstract:
One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous application…
▽ More
One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here we obtain a fresh view on the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. This is achieved by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods -- using a stochastic block model (SBM) with non-parametric priors -- we obtain a more versatile and principled framework for topic modeling (e.g., it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. More importantly, our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.
△ Less
Submitted 19 July, 2018; v1 submitted 4 August, 2017;
originally announced August 2017.
-
Network structure, metadata and the prediction of missing nodes and annotations
Authors:
Darko Hric,
Tiago P. Peixoto,
Santo Fortunato
Abstract:
The empirical validation of community detection methods is often based on available annotations on the nodes that serve as putative indicators of the large-scale network structure. Most often, the suitability of the annotations as topological descriptors itself is not assessed, and without this it is not possible to ultimately distinguish between actual shortcomings of the community detection algo…
▽ More
The empirical validation of community detection methods is often based on available annotations on the nodes that serve as putative indicators of the large-scale network structure. Most often, the suitability of the annotations as topological descriptors itself is not assessed, and without this it is not possible to ultimately distinguish between actual shortcomings of the community detection algorithms on one hand, and the incompleteness, inaccuracy or structured nature of the data annotations themselves on the other. In this work we present a principled method to access both aspects simultaneously. We construct a joint generative model for the data and metadata, and a nonparametric Bayesian framework to infer its parameters from annotated datasets. We assess the quality of the metadata not according to its direct alignment with the network communities, but rather in its capacity to predict the placement of edges in the network. We also show how this feature can be used to predict the connections to missing nodes when only the metadata is available, as well as missing metadata. By investigating a wide range of datasets, we show that while there are seldom exact agreements between metadata tokens and the inferred data groups, the metadata is often informative of the network structure nevertheless, and can improve the prediction of missing nodes. This shows that the method uncovers meaningful patterns in both the data and metadata, without requiring or expecting a perfect agreement between the two.
△ Less
Submitted 29 September, 2016; v1 submitted 1 April, 2016;
originally announced April 2016.
-
Modeling sequences and temporal networks with dynamic community structures
Authors:
Tiago P. Peixoto,
Martin Rosvall
Abstract:
In evolving complex systems such as air traffic and social organizations, collective effects emerge from their many components' dynamic interactions. While the dynamic interactions can be represented by temporal networks with nodes and links that change over time, they remain highly complex. It is therefore often necessary to use methods that extract the temporal networks' large-scale dynamic comm…
▽ More
In evolving complex systems such as air traffic and social organizations, collective effects emerge from their many components' dynamic interactions. While the dynamic interactions can be represented by temporal networks with nodes and links that change over time, they remain highly complex. It is therefore often necessary to use methods that extract the temporal networks' large-scale dynamic community structure. However, such methods are subject to overfitting or suffer from effects of arbitrary, a priori imposed timescales, which should instead be extracted from data. Here we simultaneously address both problems and develop a principled data-driven method that determines relevant timescales and identifies patterns of dynamics that take place on networks as well as shape the networks themselves. We base our method on an arbitrary-order Markov chain model with community structure, and develop a nonparametric Bayesian inference framework that identifies the simplest such model that can explain temporal interaction data.
△ Less
Submitted 20 September, 2017; v1 submitted 15 September, 2015;
originally announced September 2015.
-
Sampling motif-constrained ensembles of networks
Authors:
Rico Fischer,
Jorge C. Leitao,
Tiago P. Peixoto,
Eduardo G. Altmann
Abstract:
The statistical significance of network properties is conditioned on null models which satisfy spec- ified properties but that are otherwise random. Exponential random graph models are a principled theoretical framework to generate such constrained ensembles, but which often fail in practice, either due to model inconsistency, or due to the impossibility to sample networks from them. These problem…
▽ More
The statistical significance of network properties is conditioned on null models which satisfy spec- ified properties but that are otherwise random. Exponential random graph models are a principled theoretical framework to generate such constrained ensembles, but which often fail in practice, either due to model inconsistency, or due to the impossibility to sample networks from them. These problems affect the important case of networks with prescribed clustering coefficient or number of small connected subgraphs (motifs). In this paper we use the Wang-Landau method to obtain a multicanonical sampling that overcomes both these problems. We sample, in polynomial time, net- works with arbitrary degree sequences from ensembles with imposed motifs counts. Applying this method to social networks, we investigate the relation between transitivity and homophily, and we quantify the correlation between different types of motifs, finding that single motifs can explain up to 60% of the variation of motif profiles.
△ Less
Submitted 17 November, 2015; v1 submitted 30 July, 2015;
originally announced July 2015.
-
Generalized communities in networks
Authors:
M. E. J. Newman,
Tiago P. Peixoto
Abstract:
A substantial volume of research has been devoted to studies of community structure in networks, but communities are not the only possible form of large-scale network structure. Here we describe a broad extension of community structure that encompasses traditional communities but includes a wide range of generalized structural patterns as well. We describe a principled method for detecting this ge…
▽ More
A substantial volume of research has been devoted to studies of community structure in networks, but communities are not the only possible form of large-scale network structure. Here we describe a broad extension of community structure that encompasses traditional communities but includes a wide range of generalized structural patterns as well. We describe a principled method for detecting this generalized structure in empirical network data and demonstrate with real-world examples how it can be used to learn new things about the shape and meaning of networks.
△ Less
Submitted 27 May, 2015;
originally announced May 2015.
-
Model selection and hypothesis testing for large-scale network models with overlap** groups
Authors:
Tiago P. Peixoto
Abstract:
The effort to understand network systems in increasing detail has resulted in a diversity of methods designed to extract their large-scale structure from data. Unfortunately, many of these methods yield diverging descriptions of the same network, making both the comparison and understanding of their results a difficult challenge. A possible solution to this outstanding issue is to shift the focus…
▽ More
The effort to understand network systems in increasing detail has resulted in a diversity of methods designed to extract their large-scale structure from data. Unfortunately, many of these methods yield diverging descriptions of the same network, making both the comparison and understanding of their results a difficult challenge. A possible solution to this outstanding issue is to shift the focus away from ad hoc methods and move towards more principled approaches based on statistical inference of generative models. As a result, we face instead the more well-defined task of selecting between competing generative processes, which can be done under a unified probabilistic framework. Here, we consider the comparison between a variety of generative models including features such as degree correction, where nodes with arbitrary degrees can belong to the same group, and community overlap, where nodes are allowed to belong to more than one group. Because such model variants possess an increasing number of parameters, they become prone to overfitting. In this work, we present a method of model selection based on the minimum description length criterion and posterior odds ratios that is capable of fully accounting for the increased degrees of freedom of the larger models, and selects the best one according to the statistical evidence available in the data. In applying this method to many empirical unweighted networks from different fields, we observe that community overlap is very often not supported by statistical evidence and is selected as a better model only for a minority of them. On the other hand, we find that degree correction tends to be almost universally favored by the available data, implying that intrinsic node proprieties (as opposed to group properties) are often an essential ingredient of network formation.
△ Less
Submitted 26 March, 2015; v1 submitted 10 September, 2014;
originally announced September 2014.
-
Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models
Authors:
Tiago P. Peixoto
Abstract:
We present an efficient algorithm for the inference of stochastic block models in large networks. The algorithm can be used as an optimized Markov chain Monte Carlo (MCMC) method, with a fast mixing time and a much reduced susceptibility to getting trapped in metastable states, or as a greedy agglomerative heuristic, with an almost linear $O(N\ln^2N)$ complexity, where $N$ is the number of nodes i…
▽ More
We present an efficient algorithm for the inference of stochastic block models in large networks. The algorithm can be used as an optimized Markov chain Monte Carlo (MCMC) method, with a fast mixing time and a much reduced susceptibility to getting trapped in metastable states, or as a greedy agglomerative heuristic, with an almost linear $O(N\ln^2N)$ complexity, where $N$ is the number of nodes in the network, independent on the number of blocks being inferred. We show that the heuristic is capable of delivering results which are indistinguishable from the more exact and numerically expensive MCMC method in many artificial and empirical networks, despite being much faster. The method is entirely unbiased towards any specific mixing pattern, and in particular it does not favor assortative community structures.
△ Less
Submitted 13 January, 2014; v1 submitted 16 October, 2013;
originally announced October 2013.
-
Hierarchical Block Structures and High-resolution Model Selection in Large Networks
Authors:
Tiago P. Peixoto
Abstract:
Discovering and characterizing the large-scale topological features in empirical networks are crucial steps in understanding how complex systems function. However, most existing methods used to obtain the modular structure of networks suffer from serious problems, such as being oblivious to the statistical evidence supporting the discovered patterns, which results in the inability to separate actu…
▽ More
Discovering and characterizing the large-scale topological features in empirical networks are crucial steps in understanding how complex systems function. However, most existing methods used to obtain the modular structure of networks suffer from serious problems, such as being oblivious to the statistical evidence supporting the discovered patterns, which results in the inability to separate actual structure from noise. In addition to this, one also observes a resolution limit on the size of communities, where smaller but well-defined clusters are not detectable when the network becomes large. This phenomenon occurs not only for the very popular approach of modularity optimization, which lacks built-in statistical validation, but also for more principled methods based on statistical inference and model selection, which do incorporate statistical validation in a formally correct way. Here we construct a nested generative model that, through a complete description of the entire network hierarchy at multiple scales, is capable of avoiding this limitation, and enables the detection of modular structure at levels far beyond those possible with current approaches. Even with this increased resolution, the method is based on the principle of parsimony, and is capable of separating signal from noise, and thus will not lead to the identification of spurious modules even on sparse networks. Furthermore, it fully generalizes other approaches in that it is not restricted to purely assortative mixing patterns, directed or undirected graphs, and ad hoc hierarchical structures such as binary trees. Despite its general character, the approach is tractable, and can be combined with advanced techniques of community detection to yield an efficient algorithm that scales well for very large networks.
△ Less
Submitted 25 March, 2014; v1 submitted 16 October, 2013;
originally announced October 2013.
-
Spontaneous centralization of control in a network of company ownerships
Authors:
Sebastian M. Krause,
Tiago P. Peixoto,
Stefan Bornholdt
Abstract:
We introduce a model for the adaptive evolution of a network of company ownerships. In a recent work it has been shown that the empirical global network of corporate control is marked by a central, tightly connected "core" made of a small number of large companies which control a significant part of the global economy. Here we show how a simple, adaptive "rich get richer" dynamics can account for…
▽ More
We introduce a model for the adaptive evolution of a network of company ownerships. In a recent work it has been shown that the empirical global network of corporate control is marked by a central, tightly connected "core" made of a small number of large companies which control a significant part of the global economy. Here we show how a simple, adaptive "rich get richer" dynamics can account for this characteristic, which incorporates the increased buying power of more influential companies, and in turn results in even higher control. We conclude that this kind of centralized structure can emerge without it being an explicit goal of these companies, or as a result of a well-organized strategy.
△ Less
Submitted 14 June, 2013;
originally announced June 2013.
-
Evolution of robust network topologies: Emergence of central backbones
Authors:
Tiago P. Peixoto,
Stefan Bornholdt
Abstract:
We model the robustness against random failure or intentional attack of networks with arbitrary large-scale structure. We construct a block-based model which incorporates --- in a general fashion --- both connectivity and interdependence links, as well as arbitrary degree distributions and block correlations. By optimizing the percolation properties of this general class of networks, we identify a…
▽ More
We model the robustness against random failure or intentional attack of networks with arbitrary large-scale structure. We construct a block-based model which incorporates --- in a general fashion --- both connectivity and interdependence links, as well as arbitrary degree distributions and block correlations. By optimizing the percolation properties of this general class of networks, we identify a simple core-periphery structure as the topology most robust against random failure. In such networks, a distinct and small "core" of nodes with higher degree is responsible for most of the connectivity, functioning as a central "backbone" of the system. This centralized topology remains the optimal structure when other constraints are imposed, such as a given fraction of interdependence links and fixed degree distributions. This distinguishes simple centralized topologies as the most likely to emerge, when robustness against failure is the dominant evolutionary force.
△ Less
Submitted 24 September, 2012; v1 submitted 13 May, 2012;
originally announced May 2012.
-
No need for conspiracy: Self-organized cartel formation in a modified trust game
Authors:
Tiago P. Peixoto,
Stefan Bornholdt
Abstract:
We investigate the dynamics of a trust game on a mixed population where individuals with the role of buyers are forced to play against a predetermined number of sellers, whom they choose dynamically. Agents with the role of sellers are also allowed to adapt the level of value for money of their products, based on payoff. The dynamics undergoes a transition at a specific value of the strategy updat…
▽ More
We investigate the dynamics of a trust game on a mixed population where individuals with the role of buyers are forced to play against a predetermined number of sellers, whom they choose dynamically. Agents with the role of sellers are also allowed to adapt the level of value for money of their products, based on payoff. The dynamics undergoes a transition at a specific value of the strategy update rate, above which an emergent cartel organization is observed, where sellers have similar values of below optimal value for money. This cartel organization is not due to an explicit collusion among agents; instead it arises spontaneously from the maximization of the individual payoffs. This dynamics is marked by large fluctuations and a high degree of unpredictability for most of the parameter space, and serves as a plausible qualitative explanation for observed elevated levels and fluctuations of certain commodity prices.
△ Less
Submitted 10 January, 2013; v1 submitted 18 January, 2012;
originally announced January 2012.
-
Entropy of stochastic blockmodel ensembles
Authors:
Tiago P. Peixoto
Abstract:
Stochastic blockmodels are generative network models where the vertices are separated into discrete groups, and the probability of an edge existing between two vertices is determined solely by their group membership. In this paper, we derive expressions for the entropy of stochastic blockmodel ensembles. We consider several ensemble variants, including the traditional model as well as the newly in…
▽ More
Stochastic blockmodels are generative network models where the vertices are separated into discrete groups, and the probability of an edge existing between two vertices is determined solely by their group membership. In this paper, we derive expressions for the entropy of stochastic blockmodel ensembles. We consider several ensemble variants, including the traditional model as well as the newly introduced degree-corrected version [Karrer et al. Phys. Rev. E 83, 016107 (2011)], which imposes a degree sequence on the vertices, in addition to the block structure. The imposed degree sequence is implemented both as "soft" constraints, where only the expected degrees are imposed, and as "hard" constraints, where they are required to be the same on all samples of the ensemble. We also consider generalizations to multigraphs and directed graphs. We illustrate one of many applications of this measure by directly deriving a log-likelihood function from the entropy expression, and using it to infer latent block structure in observed data. Due to the general nature of the ensembles considered, the method works well for ensembles with intrinsic degree correlations (i.e. with entropic origin) as well as extrinsic degree correlations, which go beyond the block structure.
△ Less
Submitted 11 November, 2013; v1 submitted 27 December, 2011;
originally announced December 2011.
-
Trust transitivity in social networks
Authors:
Oliver Richters,
Tiago P. Peixoto
Abstract:
Non-centralized recommendation-based decision making is a central feature of several social and technological processes, such as market dynamics, peer-to-peer file-sharing and the web of trust of digital certification. We investigate the properties of trust propagation on networks, based on a simple metric of trust transitivity. We investigate analytically the percolation properties of trust trans…
▽ More
Non-centralized recommendation-based decision making is a central feature of several social and technological processes, such as market dynamics, peer-to-peer file-sharing and the web of trust of digital certification. We investigate the properties of trust propagation on networks, based on a simple metric of trust transitivity. We investigate analytically the percolation properties of trust transitivity in random networks with arbitrary degree distribution, and compare with numerical realizations. We find that the existence of a non-zero fraction of absolute trust (i.e. entirely confident trust) is a requirement for the viability of global trust propagation in large systems: The average pair-wise trust is marked by a discontinuous transition at a specific fraction of absolute trust, below which it vanishes. Furthermore, we perform an extensive analysis of the Pretty Good Privacy (PGP) web of trust, in view of the concepts introduced. We compare different scenarios of trust distribution: community- and authority-centered. We find that these scenarios lead to sharply different patterns of trust propagation, due to the segregation of authority hubs and densely-connected communities. While the authority-centered scenario is more efficient, and leads to higher average trust values, it favours weakly-connected "fringe" nodes, which are directly trusted by authorities. The community-centered scheme, on the other hand, favours nodes with intermediate degrees, in detriment of the authorities and its "fringe" peers.
△ Less
Submitted 10 December, 2010; v1 submitted 6 December, 2010;
originally announced December 2010.