-
The Deep Latent Position Topic Model for Clustering and Representation of Networks with Textual Edges
Authors:
Rémi Boutin,
Pierre Latouche,
Charles Bouveyron
Abstract:
Numerical interactions leading to users sharing textual content published by others are naturally represented by a network where the individuals are associated with the nodes and the exchanged texts with the edges. To understand those heterogeneous and complex data structures, clustering nodes into homogeneous groups as well as rendering a comprehensible visualisation of the data is mandatory. To…
▽ More
Numerical interactions leading to users sharing textual content published by others are naturally represented by a network where the individuals are associated with the nodes and the exchanged texts with the edges. To understand those heterogeneous and complex data structures, clustering nodes into homogeneous groups as well as rendering a comprehensible visualisation of the data is mandatory. To address both issues, we introduce Deep-LPTM, a model-based clustering strategy relying on a variational graph auto-encoder approach as well as a probabilistic model to characterise the topics of discussion. Deep-LPTM allows to build a joint representation of the nodes and of the edges in two embeddings spaces. The parameters are inferred using a variational inference algorithm. We also introduce IC2L, a model selection criterion specifically designed to choose models with relevant clustering and visualisation properties. An extensive benchmark study on synthetic data is provided. In particular, we find that Deep-LPTM better recovers the partitions of the nodes than the state-of-the art ETSBM and STBM. Eventually, the emails of the Enron company are analysed and visualisations of the results are presented, with meaningful highlights of the graph structure.
△ Less
Submitted 13 February, 2024; v1 submitted 14 April, 2023;
originally announced April 2023.
-
Embedded Topics in the Stochastic Block Model
Authors:
Rémi Boutin,
Charles Bouveyron,
Pierre Latouche
Abstract:
Communication networks such as emails or social networks are now ubiquitous and their analysis has become a strategic field. In many applications, the goal is to automatically extract relevant information by looking at the nodes and their connections. Unfortunately, most of the existing methods focus on analysing the presence or absence of edges and textual data is often discarded. However, all co…
▽ More
Communication networks such as emails or social networks are now ubiquitous and their analysis has become a strategic field. In many applications, the goal is to automatically extract relevant information by looking at the nodes and their connections. Unfortunately, most of the existing methods focus on analysing the presence or absence of edges and textual data is often discarded. However, all communication networks actually come with textual data on the edges. In order to take into account this specificity, we consider in this paper networks for which two nodes are linked if and only if they share textual data. We introduce a deep latent variable model allowing embedded topics to be handled called ETSBM to simultaneously perform clustering on the nodes while modelling the topics used between the different clusters. ETSBM extends both the stochastic block model (SBM) and the embedded topic model (ETM) which are core models for studying networks and corpora, respectively. The inference is done using a variational-Bayes expectation-maximisation algorithm combined with a stochastic gradient descent. The methodology is evaluated on synthetic data and on a real world dataset.
△ Less
Submitted 25 July, 2023; v1 submitted 19 September, 2022;
originally announced September 2022.
-
Motif-based tests for bipartite networks
Authors:
Sarah Ouadah,
Pierre Latouche,
Stéphane Robin
Abstract:
Bipartite networks are a natural representation of the interactions between entities from two different types. The organization (or topology) of such networks gives insight to understand the systems they describe as a whole. Here, we rely on motifs which provide a meso-scale description of the topology. Moreover, we consider the bipartite expected degree distribution (B-EDD) model which accounts f…
▽ More
Bipartite networks are a natural representation of the interactions between entities from two different types. The organization (or topology) of such networks gives insight to understand the systems they describe as a whole. Here, we rely on motifs which provide a meso-scale description of the topology. Moreover, we consider the bipartite expected degree distribution (B-EDD) model which accounts for both the density of the network and possible imbalances between the degrees of the nodes. Under the B-EDD model, we prove the asymptotic normality of the count of any given motif, considering sparsity conditions. We also provide close-form expressions for the mean and the variance of this count. This allows to avoid computationally prohibitive resampling procedures. Based on these results, we define a goodness-of-fit test for the B-EDD model and propose a family of tests for network comparisons. We assess the asymptotic normality of the test statistics and the power of the proposed tests on synthetic experiments and illustrate their use on ecological data sets.
△ Less
Submitted 16 October, 2023; v1 submitted 27 January, 2021;
originally announced January 2021.
-
A Bayesian Fisher-EM algorithm for discriminative Gaussian subspace clustering
Authors:
Nicolas Jouvin,
Charles Bouveyron,
Pierre Latouche
Abstract:
High-dimensional data clustering has become and remains a challenging task for modern statistics and machine learning, with a wide range of applications. We consider in this work the powerful discriminative latent mixture model, and we extend it to the Bayesian framework. Modeling data as a mixture of Gaussians in a low-dimensional discriminative subspace, a Gaussian prior distribution is introduc…
▽ More
High-dimensional data clustering has become and remains a challenging task for modern statistics and machine learning, with a wide range of applications. We consider in this work the powerful discriminative latent mixture model, and we extend it to the Bayesian framework. Modeling data as a mixture of Gaussians in a low-dimensional discriminative subspace, a Gaussian prior distribution is introduced over the latent group means and a family of twelve submodels are derived considering different covariance structures. Model inference is done with a variational EM algorithm, while the discriminative subspace is estimated via a Fisher-step maximizing an unsupervised Fisher criterion. An empirical Bayes procedure is proposed for the estimation of the prior hyper-parameters, and an integrated classification likelihood criterion is derived for selecting both the number of clusters and the submodel. The performances of the resulting Bayesian Fisher-EM algorithm are investigated in two thorough simulated scenarios, regarding both dimensionality as well as noise and assessing its superiority with respect to state-of-the-art Gaussian subspace clustering models. In addition to standard real data benchmarks, an application to single image denoising is proposed, displaying relevant results. This work comes with a reference implementation for the R software in the FisherEM package accompanying the paper.
△ Less
Submitted 8 December, 2020;
originally announced December 2020.
-
Cluster-Specific Predictions with Multi-Task Gaussian Processes
Authors:
Arthur Leroy,
Pierre Latouche,
Benjamin Guedj,
Servane Gey
Abstract:
A model involving Gaussian processes (GPs) is introduced to simultaneously handle multi-task learning, clustering, and prediction for multiple functional data. This procedure acts as a model-based clustering method for functional data as well as a learning step for subsequent predictions for new tasks. The model is instantiated as a mixture of multi-task GPs with common mean processes. A variation…
▽ More
A model involving Gaussian processes (GPs) is introduced to simultaneously handle multi-task learning, clustering, and prediction for multiple functional data. This procedure acts as a model-based clustering method for functional data as well as a learning step for subsequent predictions for new tasks. The model is instantiated as a mixture of multi-task GPs with common mean processes. A variational EM algorithm is derived for dealing with the optimisation of the hyper-parameters along with the hyper-posteriors' estimation of latent variables and processes. We establish explicit formulas for integrating the mean processes and the latent clustering variables within a predictive distribution, accounting for uncertainty on both aspects. This distribution is defined as a mixture of cluster-specific GP predictions, which enhances the performances when dealing with group-structured data. The model handles irregular grid of observations and offers different hypotheses on the covariance structure for sharing additional information across tasks. The performances on both clustering and prediction tasks are assessed through various simulated scenarios and real datasets. The overall algorithm, called MagmaClust, is publicly available as an R package.
△ Less
Submitted 30 November, 2022; v1 submitted 16 November, 2020;
originally announced November 2020.
-
MAGMA: Inference and Prediction with Multi-Task Gaussian Processes
Authors:
Arthur Leroy,
Pierre Latouche,
Benjamin Guedj,
Servane Gey
Abstract:
A novel multi-task Gaussian process (GP) framework is proposed, by using a common mean process for sharing information across tasks. In particular, we investigate the problem of time series forecasting, with the objective to improve multiple-step-ahead predictions. The common mean process is defined as a GP for which the hyper-posterior distribution is tractable. Therefore an EM algorithm is deriv…
▽ More
A novel multi-task Gaussian process (GP) framework is proposed, by using a common mean process for sharing information across tasks. In particular, we investigate the problem of time series forecasting, with the objective to improve multiple-step-ahead predictions. The common mean process is defined as a GP for which the hyper-posterior distribution is tractable. Therefore an EM algorithm is derived for handling both hyper-parameters optimisation and hyper-posterior computation. Unlike previous approaches in the literature, the model fully accounts for uncertainty and can handle irregular grids of observations while maintaining explicit formulations, by modelling the mean process in a unified GP framework. Predictive analytical equations are provided, integrating information shared across tasks through a relevant prior mean. This approach greatly improves the predictive performances, even far from observations, and may reduce significantly the computational complexity compared to traditional multi-task GP models. Our overall algorithm is called \textsc{Magma} (standing for Multi tAsk Gaussian processes with common MeAn). The quality of the mean process estimation, predictive performances, and comparisons to alternatives are assessed in various simulated scenarios and on real datasets.
△ Less
Submitted 24 May, 2022; v1 submitted 21 July, 2020;
originally announced July 2020.
-
Hierarchical clustering with discrete latent variable models and the integrated classification likelihood
Authors:
Etienne Côme,
Nicolas Jouvin,
Pierre Latouche,
Charles Bouveyron
Abstract:
Finding a set of nested partitions of a dataset is useful to uncover relevant structure at different scales, and is often dealt with a data-dependent methodology. In this paper, we introduce a general two-step methodology for model-based hierarchical clustering. Considering the integrated classification likelihood criterion as an objective function, this work applies to every discrete latent varia…
▽ More
Finding a set of nested partitions of a dataset is useful to uncover relevant structure at different scales, and is often dealt with a data-dependent methodology. In this paper, we introduce a general two-step methodology for model-based hierarchical clustering. Considering the integrated classification likelihood criterion as an objective function, this work applies to every discrete latent variable models (DLVMs) where this quantity is tractable. The first step of the methodology involves maximizing the criterion with respect to the partition. Addressing the known problem of sub-optimal local maxima found by greedy hill climbing heuristics, we introduce a new hybrid algorithm based on a genetic algorithm efficiently exploring the space of solutions. The resulting algorithm carefully combines and merges different solutions, and allows the joint inference of the number $K$ of clusters as well as the clusters themselves. Starting from this natural partition, the second step of the methodology is based on a bottom-up greedy procedure to extract a hierarchy of clusters. In a Bayesian context, this is achieved by considering the Dirichlet cluster proportion prior parameter $α$ as a regularization term controlling the granularity of the clustering. A new approximation of the criterion is derived as a log-linear function of $α$, enabling a simple functional form of the merge decision criterion. This second step allows the exploration of the clustering at coarser scales. The proposed approach is compared with existing strategies on simulated as well as real settings, and its results are shown to be particularly relevant. A reference implementation of this work is available in the R package greed accompanying the paper.
△ Less
Submitted 21 April, 2021; v1 submitted 26 February, 2020;
originally announced February 2020.
-
Greedy clustering of count data through a mixture of multinomial PCA
Authors:
Nicolas Jouvin,
Pierre Latouche,
Charles Bouveyron,
Guillaume Bataillon,
Alain Livartowski
Abstract:
Count data is becoming more and more ubiquitous in a wide range of applications, with datasets growing both in size and in dimension. In this context, an increasing amount of work is dedicated to the construction of statistical models directly accounting for the discrete nature of the data. Moreover, it has been shown that integrating dimension reduction to clustering can drastically improve perfo…
▽ More
Count data is becoming more and more ubiquitous in a wide range of applications, with datasets growing both in size and in dimension. In this context, an increasing amount of work is dedicated to the construction of statistical models directly accounting for the discrete nature of the data. Moreover, it has been shown that integrating dimension reduction to clustering can drastically improve performance and stability. In this paper, we rely on the mixture of multinomial PCA, a mixture model for the clustering of count data, also known as the probabilistic clustering-projection model in the literature. Related to the latent Dirichlet allocation model, it offers the flexibility of topic modeling while being able to assign each observation to a unique cluster. We introduce a greedy clustering algorithm, where inference and clustering are jointly done by mixing a classification variational expectation maximization algorithm, with a branch & bound like strategy on a variational lower bound. An integrated classification likelihood criterion is derived for model selection, and a thorough study with numerical experiments is proposed to assess both the performance and robustness of the method. Finally, we illustrate the qualitative interest of the latter in a real-world application, for the clustering of anatomopathological medical reports, in partnership with expert practitioners from the Institut Curie hospital.
△ Less
Submitted 10 July, 2020; v1 submitted 2 September, 2019;
originally announced September 2019.
-
Block modelling in dynamic networks with non-homogeneous Poisson processes and exact ICL
Authors:
Marco Corneli,
Pierre Latouche,
Fabrice Rossi
Abstract:
We develop a model in which interactions between nodes of a dynamic network are counted by non homogeneous Poisson processes. In a block modelling perspective, nodes belong to hidden clusters (whose number is unknown) and the intensity functions of the counting processes only depend on the clusters of nodes. In order to make inference tractable we move to discrete time by partitioning the entire t…
▽ More
We develop a model in which interactions between nodes of a dynamic network are counted by non homogeneous Poisson processes. In a block modelling perspective, nodes belong to hidden clusters (whose number is unknown) and the intensity functions of the counting processes only depend on the clusters of nodes. In order to make inference tractable we move to discrete time by partitioning the entire time horizon in which interactions are observed in fixed-length time sub-intervals. First, we derive an exact integrated classification likelihood criterion and maximize it relying on a greedy search approach. This allows to estimate the memberships to clusters and the number of clusters simultaneously. Then a maximum-likelihood estimator is developed to estimate non parametrically the integrated intensities. We discuss the over-fitting problems of the model and propose a regularized version solving these issues. Experiments on real and simulated data are carried out in order to assess the proposed methodology.
△ Less
Submitted 10 July, 2017;
originally announced July 2017.
-
Exact Dimensionality Selection for Bayesian PCA
Authors:
Charles Bouveyron,
Pierre Latouche,
Pierre-Alexandre Mattei
Abstract:
We present a Bayesian model selection approach to estimate the intrinsic dimensionality of a high-dimensional dataset. To this end, we introduce a novel formulation of the probabilisitic principal component analysis model based on a normal-gamma prior distribution. In this context, we exhibit a closed-form expression of the marginal likelihood which allows to infer an optimal number of components.…
▽ More
We present a Bayesian model selection approach to estimate the intrinsic dimensionality of a high-dimensional dataset. To this end, we introduce a novel formulation of the probabilisitic principal component analysis model based on a normal-gamma prior distribution. In this context, we exhibit a closed-form expression of the marginal likelihood which allows to infer an optimal number of components. We also propose a heuristic based on the expected shape of the marginal likelihood curve in order to choose the hyperparameters. In non-asymptotic frameworks, we show on simulated data that this exact dimensionality selection approach is competitive with both Bayesian and frequentist state-of-the-art methods.
△ Less
Submitted 21 May, 2019; v1 submitted 8 March, 2017;
originally announced March 2017.
-
Choosing the number of groups in a latent stochastic block model for dynamic networks
Authors:
Riccardo Rastelli,
Pierre Latouche,
Nial Friel
Abstract:
Latent stochastic block models are flexible statistical models that are widely used in social network analysis. In recent years, efforts have been made to extend these models to temporal dynamic networks, whereby the connections between nodes are observed at a number of different times. In this paper we extend the original stochastic block model by using a Markovian property to describe the evolut…
▽ More
Latent stochastic block models are flexible statistical models that are widely used in social network analysis. In recent years, efforts have been made to extend these models to temporal dynamic networks, whereby the connections between nodes are observed at a number of different times. In this paper we extend the original stochastic block model by using a Markovian property to describe the evolution of nodes' cluster memberships over time. We recast the problem of clustering the nodes of the network into a model-based context, and show that the integrated completed likelihood can be evaluated analytically for a number of likelihood models. Then, we propose a scalable greedy algorithm to maximise this quantity, thereby estimating both the optimal partition and the ideal number of groups in a single inferential framework. Finally we propose applications of our methodology to both real and artificial datasets.
△ Less
Submitted 22 March, 2017; v1 submitted 5 February, 2017;
originally announced February 2017.
-
Bayesian Variable Selection for Globally Sparse Probabilistic PCA
Authors:
Charles Bouveyron,
Pierre Latouche,
Pierre-Alexandre Mattei
Abstract:
Sparse versions of principal component analysis (PCA) have imposed themselves as simple, yet powerful ways of selecting relevant features of high-dimensional data in an unsupervised manner. However, when several sparse principal components are computed, the interpretation of the selected variables is difficult since each axis has its own sparsity pattern and has to be interpreted separately. To ov…
▽ More
Sparse versions of principal component analysis (PCA) have imposed themselves as simple, yet powerful ways of selecting relevant features of high-dimensional data in an unsupervised manner. However, when several sparse principal components are computed, the interpretation of the selected variables is difficult since each axis has its own sparsity pattern and has to be interpreted separately. To overcome this drawback, we propose a Bayesian procedure called globally sparse probabilistic PCA (GSPPCA) that allows to obtain several sparse components with the same sparsity pattern. This allows the practitioner to identify the original variables which are relevant to describe the data. To this end, using Roweis' probabilistic interpretation of PCA and a Gaussian prior on the loading matrix, we provide the first exact computation of the marginal likelihood of a Bayesian PCA model. To avoid the drawbacks of discrete model selection, a simple relaxation of this framework is presented. It allows to find a path of models using a variational expectation-maximization algorithm. The exact marginal likelihood is then maximized over this path. This approach is illustrated on real and synthetic data sets. In particular, using unlabeled microarray data, GSPPCA infers much more relevant gene subsets than traditional sparse PCA algorithms.
△ Less
Submitted 20 September, 2016; v1 submitted 19 May, 2016;
originally announced May 2016.
-
Exact ICL maximization in a non-stationary temporal extension of the stochastic block model for dynamic networks
Authors:
Marco Corneli,
Pierre Latouche,
Fabrice Rossi
Abstract:
The stochastic block model (SBM) is a flexible probabilistic tool that can be used to model interactions between clusters of nodes in a network. However, it does not account for interactions of time varying intensity between clusters. The extension of the SBM developed in this paper addresses this shortcoming through a temporal partition: assuming interactions between nodes are recorded on fixed-l…
▽ More
The stochastic block model (SBM) is a flexible probabilistic tool that can be used to model interactions between clusters of nodes in a network. However, it does not account for interactions of time varying intensity between clusters. The extension of the SBM developed in this paper addresses this shortcoming through a temporal partition: assuming interactions between nodes are recorded on fixed-length time intervals, the inference procedure associated with the model we propose allows to cluster simultaneously the nodes of the network and the time intervals. The number of clusters of nodes and of time intervals, as well as the memberships to clusters, are obtained by maximizing an exact integrated complete-data likelihood, relying on a greedy search approach. Experiments on simulated and real data are carried out in order to assess the proposed methodology.
△ Less
Submitted 10 July, 2017; v1 submitted 9 May, 2016;
originally announced May 2016.
-
Is the corporate elite disintegrating? Interlock boards and the Mizruchi hypothesis
Authors:
Kevin Mentzer,
Francois-Xavier Dudouet,
Dominique Haughton,
Pierre Latouche,
Fabrice Rossi
Abstract:
This paper proposes an approach for comparing interlocked board networks over time to test for statistically significant change. In addition to contributing to the conversation about whether the Mizruchi hypothesis (that a disintegration of power is occurring within the corporate elite) holds or not, we propose novel methods to handle a longitudinal investigation of a series of social networks whe…
▽ More
This paper proposes an approach for comparing interlocked board networks over time to test for statistically significant change. In addition to contributing to the conversation about whether the Mizruchi hypothesis (that a disintegration of power is occurring within the corporate elite) holds or not, we propose novel methods to handle a longitudinal investigation of a series of social networks where the nodes undergo a few modifications at each time point. Methodologically, our contribution is twofold: we extend a Bayesian model hereto applied to compare two time periods to a longer time period, and we define and employ the concept of a hull of a sequence of social networks, which makes it possible to circumvent the problem of changing nodes over time.
△ Less
Submitted 8 February, 2016;
originally announced February 2016.
-
Modelling time evolving interactions in networks through a non stationary extension of stochastic block models
Authors:
Marco Corneli,
Pierre Latouche,
Fabrice Rossi
Abstract:
In this paper, we focus on the stochastic block model (SBM),a probabilistic tool describing interactions between nodes of a network using latent clusters. The SBM assumes that the networkhas a stationary structure, in which connections of time varying intensity are not taken into account. In other words, interactions between two groups are forced to have the same features during the whole observat…
▽ More
In this paper, we focus on the stochastic block model (SBM),a probabilistic tool describing interactions between nodes of a network using latent clusters. The SBM assumes that the networkhas a stationary structure, in which connections of time varying intensity are not taken into account. In other words, interactions between two groups are forced to have the same features during the whole observation time. To overcome this limitation,we propose a partition of the whole time horizon, in which interactions are observed, and develop a non stationary extension of the SBM,allowing to simultaneously cluster the nodes in a network along with fixed time intervals in which the interactions take place. The number of clusters (K for nodes, D for time intervals) as well as the class memberships are finallyobtained through maximizing the complete-data integrated likelihood by means of a greedy search approach. After showing that the model works properly with simulated data, we focus on a real data set. We thus consider the three days ACM Hypertext conference held in Turin,June 29th - July 1st 2009. Proximity interactions between attendees during the first day are modelled and an interestingclustering of the daily hours is finally obtained, with times of social gathering (e.g. coffee breaks) recovered by the approach. Applications to large networks are limited due to the computational complexity of the greedy search which is dominated bythe number $K\_{max}$ and $D\_{max}$ of clusters used in the initialization. Therefore,advanced clustering tools are considered to reduce the number of clusters expected in the data, making the greedy search applicable to large networks.
△ Less
Submitted 8 September, 2015;
originally announced September 2015.
-
Goodness of fit of logistic models for random graphs
Authors:
Pierre Latouche,
Stéphane Robin,
Sarah Ouadah
Abstract:
Logistic regression is a natural and simple tool to understand how covariates contribute to explain the topology of a binary network. Once the model fitted, the practitioner is interested in the goodness-of-fit of the regression in order to check if the covariates are sufficient to explain the whole topology of the network and, if they are not, to analyze the residual structure. To address this pr…
▽ More
Logistic regression is a natural and simple tool to understand how covariates contribute to explain the topology of a binary network. Once the model fitted, the practitioner is interested in the goodness-of-fit of the regression in order to check if the covariates are sufficient to explain the whole topology of the network and, if they are not, to analyze the residual structure. To address this problem, we introduce a generic model that combines logistic regression with a network-oriented residual term. This residual term takes the form of the graphon function of a W-graph. Using a variational Bayes framework, we infer the residual graphon by averaging over a series of blockwise constant functions. This approach allows us to define a generic goodness-of-fit criterion, which corresponds to the posterior probability for the residual graphon to be constant. Experiments on toy data are carried out to assess the accuracy of the procedure. Several networks from social sciences and ecology are studied to illustrate the proposed methodology.
△ Less
Submitted 6 January, 2017; v1 submitted 2 August, 2015;
originally announced August 2015.
-
Degree-based goodness-of-fit tests for heterogeneous random graph models : independent and exchangeable cases
Authors:
Sarah Ouadah,
Stéphane Robin,
Pierre Latouche
Abstract:
The degrees are a classical and relevant way to study the topology of a network. They can be used to assess the goodness-of-fit for a given random graph model. In this paper we introduce goodness-of-fit tests for two classes of models. First, we consider the case of independent graph models such as the heterogeneous Erdös-Rényi model in which the edges have different connection probabilities. Seco…
▽ More
The degrees are a classical and relevant way to study the topology of a network. They can be used to assess the goodness-of-fit for a given random graph model. In this paper we introduce goodness-of-fit tests for two classes of models. First, we consider the case of independent graph models such as the heterogeneous Erdös-Rényi model in which the edges have different connection probabilities. Second, we consider a generic model for exchangeable random graphs called the W-graph. The stochastic block model and the expected degree distribution model fall within this framework. We prove the asymptotic normality of the degree mean square under these independent and exchangeable models and derive formal tests. We study the power of the proposed tests and we prove the asymptotic normality under specific sparsity regimes. The tests are illustrated on real networks from social sciences and ecology, and their performances are assessed via a simulation study.
△ Less
Submitted 29 July, 2019; v1 submitted 29 July, 2015;
originally announced July 2015.
-
Graphs in machine learning: an introduction
Authors:
Pierre Latouche,
Fabrice Rossi
Abstract:
Graphs are commonly used to characterise interactions between objects of interest. Because they are based on a straightforward formalism, they are used in many scientific fields from computer science to historical sciences. In this paper, we give an introduction to some methods relying on graphs for learning. This includes both unsupervised and supervised methods. Unsupervised learning algorithms…
▽ More
Graphs are commonly used to characterise interactions between objects of interest. Because they are based on a straightforward formalism, they are used in many scientific fields from computer science to historical sciences. In this paper, we give an introduction to some methods relying on graphs for learning. This includes both unsupervised and supervised methods. Unsupervised learning algorithms usually aim at visualising graphs in latent spaces and/or clustering the nodes. Both focus on extracting knowledge from graph topologies. While most existing techniques are only applicable to static graphs, where edges do not evolve through time, recent developments have shown that they could be extended to deal with evolving networks. In a supervised context, one generally aims at inferring labels or numerical values attached to nodes using both the graph and, when they are available, node characteristics. Balancing the two sources of information can be challenging, especially as they can disagree locally or globally. In both contexts, supervised and un-supervised, data can be relational (augmented with one or several global graphs) as described above, or graph valued. In this latter case, each object of interest is given as a full graph (possibly completed by other characteristics). In this context, natural tasks include graph clustering (as in producing clusters of graphs rather than clusters of nodes in a single graph), graph classification, etc. 1 Real networks One of the first practical studies on graphs can be dated back to the original work of Moreno [51] in the 30s. Since then, there has been a growing interest in graph analysis associated with strong developments in the modelling and the processing of these data. Graphs are now used in many scientific fields. In Biology [54, 2, 7], for instance, metabolic networks can describe pathways of biochemical reactions [41], while in social sciences networks are used to represent relation ties between actors [66, 56, 36, 34]. Other examples include powergrids [71] and the web [75]. Recently, networks have also been considered in other areas such as geography [22] and history [59, 39]. In machine learning, networks are seen as powerful tools to model problems in order to extract information from data and for prediction purposes. This is the object of this paper. For more complete surveys, we refer to [28, 62, 49, 45]. In this section, we introduce notations and highlight properties shared by most real networks. In Section 2, we then consider methods aiming at extracting information from a unique network. We will particularly focus on clustering methods where the goal is to find clusters of vertices. Finally, in Section 3, techniques that take a series of networks into account, where each network is
△ Less
Submitted 23 June, 2015;
originally announced June 2015.
-
Exact ICL maximization in a non-stationary time extension of the latent block model for dynamic networks
Authors:
Marco Corneli,
Pierre Latouche,
Fabrice Rossi
Abstract:
The latent block model (LBM) is a flexible probabilistic tool to describe interactions between node sets in bipartite networks, but it does not account for interactions of time varying intensity between nodes in unknown classes. In this paper we propose a non stationary temporal extension of the LBM that clusters simultaneously the two node sets of a bipartite network and constructs classes of tim…
▽ More
The latent block model (LBM) is a flexible probabilistic tool to describe interactions between node sets in bipartite networks, but it does not account for interactions of time varying intensity between nodes in unknown classes. In this paper we propose a non stationary temporal extension of the LBM that clusters simultaneously the two node sets of a bipartite network and constructs classes of time intervals on which interactions are stationary. The number of clusters as well as the membership to classes are obtained by maximizing the exact complete-data integrated likelihood relying on a greedy search approach. Experiments on simulated and real data are carried out in order to assess the proposed methodology.
△ Less
Submitted 12 June, 2015;
originally announced June 2015.
-
Model Selection in Overlap** Stochastic Block Models
Authors:
P. Latouche,
E. Birmelé,
C. Ambroise
Abstract:
Networks are a commonly used mathematical model to describe the rich set of interactions between objects of interest. Many clustering methods have been developed in order to partition such structures, among which several rely on underlying probabilistic models, typically mixture models. The relevant hidden structure may however show overlap** groups in several applications. The Overlap** Stoch…
▽ More
Networks are a commonly used mathematical model to describe the rich set of interactions between objects of interest. Many clustering methods have been developed in order to partition such structures, among which several rely on underlying probabilistic models, typically mixture models. The relevant hidden structure may however show overlap** groups in several applications. The Overlap** Stochastic Block Model (2011) has been developed to take this phenomenon into account. Nevertheless, the problem of the choice of the number of classes in the inference step is still open. To tackle this issue, we consider the proposed model in a Bayesian framework and develop a new criterion based on a non asymptotic approximation of the marginal log-likelihood. We describe how the criterion can be computed through a variational Bayes EM algorithm, and demonstrate its efficiency by running it on both simulated and real data.
△ Less
Submitted 12 May, 2014;
originally announced May 2014.
-
Inferring structure in bipartite networks using the latent block model and exact ICL
Authors:
Jason Wyse,
Nial Friel,
Pierre Latouche
Abstract:
We consider the task of simultaneous clustering of the two node sets involved in a bipartite network. The approach we adopt is based on use of the exact integrated complete likelihood for the latent block model. Using this allows one to infer the number of clusters as well as cluster memberships using a greedy search. This gives a model-based clustering of the node sets. Experiments on simulated b…
▽ More
We consider the task of simultaneous clustering of the two node sets involved in a bipartite network. The approach we adopt is based on use of the exact integrated complete likelihood for the latent block model. Using this allows one to infer the number of clusters as well as cluster memberships using a greedy search. This gives a model-based clustering of the node sets. Experiments on simulated bipartite network data show that the greedy search approach is vastly more scalable than competing Markov chain Monte Carlo based methods. Application to a number of real observed bipartite networks demonstrate the algorithms discussed.
△ Less
Submitted 16 May, 2015; v1 submitted 10 April, 2014;
originally announced April 2014.
-
Variational Bayes model averaging for graphon functions and motif frequencies inference in W-graph models
Authors:
P. Latouche,
S Robin
Abstract:
W-graph refers to a general class of random graph models that can be seen as a random graph limit. It is characterized by both its graphon function and its motif frequencies. In this paper, relying on an existing variational Bayes algorithm for the stochastic block models along with the corresponding weights for model averaging, we derive an estimate of the graphon function as an average of stocha…
▽ More
W-graph refers to a general class of random graph models that can be seen as a random graph limit. It is characterized by both its graphon function and its motif frequencies. In this paper, relying on an existing variational Bayes algorithm for the stochastic block models along with the corresponding weights for model averaging, we derive an estimate of the graphon function as an average of stochastic block models with increasing number of blocks. In the same framework, we derive the variational posterior frequency of any motif. A simulation study and an illustration on a social network complete our work.
△ Less
Submitted 5 November, 2015; v1 submitted 23 October, 2013;
originally announced October 2013.
-
Activity date estimation in timestamped interaction networks
Authors:
Fabrice Rossi,
Pierre Latouche
Abstract:
We propose in this paper a new generative model for graphs that uses a latent space approach to explain timestamped interactions. The model is designed to provide global estimates of activity dates in historical networks where only the interaction dates between agents are known with reasonable precision. Experimental results show that the model provides better results than local averages in dense…
▽ More
We propose in this paper a new generative model for graphs that uses a latent space approach to explain timestamped interactions. The model is designed to provide global estimates of activity dates in historical networks where only the interaction dates between agents are known with reasonable precision. Experimental results show that the model provides better results than local averages in dense enough networks
△ Less
Submitted 18 October, 2013;
originally announced October 2013.
-
Model selection and clustering in stochastic block models with the exact integrated complete data likelihood
Authors:
E. Côme,
P. Latouche
Abstract:
The stochastic block model (SBM) is a mixture model used for the clustering of nodes in networks. It has now been employed for more than a decade to analyze very different types of networks in many scientific fields such as Biology and social sciences. Because of conditional dependency, there is no analytical expression for the posterior distribution over the latent variables, given the data and m…
▽ More
The stochastic block model (SBM) is a mixture model used for the clustering of nodes in networks. It has now been employed for more than a decade to analyze very different types of networks in many scientific fields such as Biology and social sciences. Because of conditional dependency, there is no analytical expression for the posterior distribution over the latent variables, given the data and model parameters. Therefore, approximation strategies, based on variational techniques or sampling, have been proposed for clustering. Moreover, two SBM model selection criteria exist for the estimation of the number K of clusters in networks but, again, both of them rely on some approximations. In this paper, we show how an analytical expression can be derived for the integrated complete data log likelihood. We then propose an inference algorithm to maximize this exact quantity. This strategy enables the clustering of nodes as well as the estimation of the number clusters to be performed at the same time and no model selection criterion has to be computed for various values of K. The algorithm we propose has a better computational cost than existing inference techniques for SBM and can be employed to analyze large networks with ten thousand nodes. Using toy and true data sets, we compare our work with other approaches.
△ Less
Submitted 9 May, 2014; v1 submitted 12 March, 2013;
originally announced March 2013.
-
The random subgraph model for the analysis of an ecclesiastical network in Merovingian Gaul
Authors:
Yacine Jernite,
Pierre Latouche,
Charles Bouveyron,
Patrick Rivera,
Laurent Jegou,
Stéphane Lamassé
Abstract:
In the last two decades many random graph models have been proposed to extract knowledge from networks. Most of them look for communities or, more generally, clusters of vertices with homogeneous connection profiles. While the first models focused on networks with binary edges only, extensions now allow to deal with valued networks. Recently, new models were also introduced in order to characteriz…
▽ More
In the last two decades many random graph models have been proposed to extract knowledge from networks. Most of them look for communities or, more generally, clusters of vertices with homogeneous connection profiles. While the first models focused on networks with binary edges only, extensions now allow to deal with valued networks. Recently, new models were also introduced in order to characterize connection patterns in networks through mixed memberships. This work was motivated by the need of analyzing a historical network where a partition of the vertices is given and where edges are typed. A known partition is seen as a decomposition of a network into subgraphs that we propose to model using a stochastic model with unknown latent clusters. Each subgraph has its own mixing vector and sees its vertices associated to the clusters. The vertices then connect with a probability depending on the subgraphs only, while the types of edges are assumed to be sampled from the latent clusters. A variational Bayes expectation-maximization algorithm is proposed for inference as well as a model selection criterion for the estimation of the cluster number. Experiments are carried out on simulated data to assess the approach. The proposed methodology is then applied to an ecclesiastical network in Merovingian Gaul. An R code, called Rambo, implementing the inference algorithm is available from the authors upon request.
△ Less
Submitted 5 May, 2014; v1 submitted 21 December, 2012;
originally announced December 2012.
-
Variational Bayesian Inference and Complexity Control for Stochastic Block Models
Authors:
Pierre Latouche,
Etienne Birmele,
Christophe Ambroise
Abstract:
It is now widely accepted that knowledge can be acquired from networks by clustering their vertices according to connection profiles. Many methods have been proposed and in this paper we concentrate on the Stochastic Block Model (SBM). The clustering of vertices and the estimation of SBM model parameters have been subject to previous work and numerous inference strategies such as variational Expec…
▽ More
It is now widely accepted that knowledge can be acquired from networks by clustering their vertices according to connection profiles. Many methods have been proposed and in this paper we concentrate on the Stochastic Block Model (SBM). The clustering of vertices and the estimation of SBM model parameters have been subject to previous work and numerous inference strategies such as variational Expectation Maximization (EM) and classification EM have been proposed. However, SBM still suffers from a lack of criteria to estimate the number of components in the mixture. To our knowledge, only one model based criterion, ICL, has been derived for SBM in the literature. It relies on an asymptotic approximation of the Integrated Complete-data Likelihood and recent studies have shown that it tends to be too conservative in the case of small networks. To tackle this issue, we propose a new criterion that we call ILvb, based on a non asymptotic approximation of the marginal likelihood. We describe how the criterion can be computed through a variational Bayes EM algorithm.
△ Less
Submitted 24 July, 2010; v1 submitted 15 December, 2009;
originally announced December 2009.
-
Overlap** stochastic block models with application to the French political blogosphere
Authors:
Pierre Latouche,
Etienne Birmelé,
Christophe Ambroise
Abstract:
Complex systems in nature and in society are often represented as networks, describing the rich set of interactions between objects of interest. Many deterministic and probabilistic clustering methods have been developed to analyze such structures. Given a network, almost all of them partition the vertices into disjoint clusters, according to their connection profile. However, recent studies have…
▽ More
Complex systems in nature and in society are often represented as networks, describing the rich set of interactions between objects of interest. Many deterministic and probabilistic clustering methods have been developed to analyze such structures. Given a network, almost all of them partition the vertices into disjoint clusters, according to their connection profile. However, recent studies have shown that these techniques were too restrictive and that most of the existing networks contained overlap** clusters. To tackle this issue, we present in this paper the Overlap** Stochastic Block Model. Our approach allows the vertices to belong to multiple clusters, and, to some extent, generalizes the well-known Stochastic Block Model [Nowicki and Snijders (2001)]. We show that the model is generically identifiable within classes of equivalence and we propose an approximate inference procedure, based on global and local variational techniques. Using toy data sets as well as the French Political Blogosphere network and the transcriptional network of Saccharomyces cerevisiae, we compare our work with other approaches.
△ Less
Submitted 15 April, 2011; v1 submitted 12 October, 2009;
originally announced October 2009.