-
Graph-based Topic Extraction from Vector Embeddings of Text Documents: Application to a Corpus of News Articles
Authors:
M. Tarik Altuncu,
Sophia N. Yaliraki,
Mauricio Barahona
Abstract:
Production of news content is growing at an astonishing rate. To help manage and monitor the sheer amount of text, there is an increasing need to develop efficient methods that can provide insights into emerging content areas, and stratify unstructured corpora of text into `topics' that stem intrinsically from content similarity. Here we present an unsupervised framework that brings together power…
▽ More
Production of news content is growing at an astonishing rate. To help manage and monitor the sheer amount of text, there is an increasing need to develop efficient methods that can provide insights into emerging content areas, and stratify unstructured corpora of text into `topics' that stem intrinsically from content similarity. Here we present an unsupervised framework that brings together powerful vector embeddings from natural language processing with tools from multiscale graph partitioning that can reveal natural partitions at different resolutions without making a priori assumptions about the number of clusters in the corpus. We show the advantages of graph-based clustering through end-to-end comparisons with other popular clustering and topic modelling methods, and also evaluate different text vector embeddings, from classic Bag-of-Words to Doc2Vec to the recent transformers based model Bert. This comparative work is showcased through an analysis of a corpus of US news coverage during the presidential election year of 2016.
△ Less
Submitted 28 October, 2020;
originally announced October 2020.
-
Data-driven modelling and characterisation of task completion sequences in online courses
Authors:
Robert L. Peach,
Sam F. Greenbury,
Iain G. Johnston,
Sophia N. Yaliraki,
David Lefevre,
Mauricio Barahona
Abstract:
The intrinsic temporality of learning demands the adoption of methodologies capable of exploiting time-series information. In this study we leverage the sequence data framework and show how data-driven analysis of temporal sequences of task completion in online courses can be used to characterise personal and group learners' behaviors, and to identify critical tasks and course sessions in a given…
▽ More
The intrinsic temporality of learning demands the adoption of methodologies capable of exploiting time-series information. In this study we leverage the sequence data framework and show how data-driven analysis of temporal sequences of task completion in online courses can be used to characterise personal and group learners' behaviors, and to identify critical tasks and course sessions in a given course design. We also introduce a recently developed probabilistic Bayesian model to learn sequence trajectories of students and predict student performance. The application of our data-driven sequence-based analyses to data from learners undertaking an on-line Business Management course reveals distinct behaviors within the cohort of learners, identifying learners or groups of learners that deviate from the nominal order expected in the course. Using course grades a posteriori, we explore differences in behavior between high and low performing learners. We find that high performing learners follow the progression between weekly sessions more regularly than low performing learners, yet within each weekly session high performing learners are less tied to the nominal task order. We then model the sequences of high and low performance students using the probablistic Bayesian model and show that we can learn engagement behaviors associated with performance. We also show that the data sequence framework can be used for task centric analysis; we identify critical junctures and differences among types of tasks within the course design. We find that non-rote learning tasks, such as interactive tasks or discussion posts, are correlated with higher performance. We discuss the application of such analytical techniques as an aid to course design, intervention, and student supervision.
△ Less
Submitted 14 July, 2020;
originally announced July 2020.
-
Severability of mesoscale components and local time scales in dynamical networks
Authors:
Yun William Yu,
Jean-Charles Delvenne,
Sophia N. Yaliraki,
Mauricio Barahona
Abstract:
A major goal of dynamical systems theory is the search for simplified descriptions of the dynamics of a large number of interacting states. For overwhelmingly complex dynamical systems, the derivation of a reduced description on the entire dynamics at once is computationally infeasible. Other complex systems are so expansive that despite the continual onslaught of new data only partial information…
▽ More
A major goal of dynamical systems theory is the search for simplified descriptions of the dynamics of a large number of interacting states. For overwhelmingly complex dynamical systems, the derivation of a reduced description on the entire dynamics at once is computationally infeasible. Other complex systems are so expansive that despite the continual onslaught of new data only partial information is available. To address this challenge, we define and optimise for a local quality function severability for measuring the dynamical coherency of a set of states over time. The theoretical underpinnings of severability lie in our local adaptation of the Simon-Ando-Fisher time-scale separation theorem, which formalises the intuition of local wells in the Markov landscape of a dynamical process, or the separation between a microscopic and a macroscopic dynamics. Finally, we demonstrate the practical relevance of severability by applying it to examples drawn from power networks, image segmentation, social networks, metabolic networks, and word association.
△ Less
Submitted 4 June, 2020;
originally announced June 2020.
-
Extracting information from free text through unsupervised graph-based clustering: an application to patient incident records
Authors:
M. Tarik Altuncu,
Eloise Sorin,
Joshua D. Symons,
Erik Mayer,
Sophia N. Yaliraki,
Francesca Toni,
Mauricio Barahona
Abstract:
The large volume of text in electronic healthcare records often remains underused due to a lack of methodologies to extract interpretable content. Here we present an unsupervised framework for the analysis of free text that combines text-embedding with paragraph vectors and graph-theoretical multiscale community detection. We analyse text from a corpus of patient incident reports from the National…
▽ More
The large volume of text in electronic healthcare records often remains underused due to a lack of methodologies to extract interpretable content. Here we present an unsupervised framework for the analysis of free text that combines text-embedding with paragraph vectors and graph-theoretical multiscale community detection. We analyse text from a corpus of patient incident reports from the National Health Service in England to find content-based clusters of reports in an unsupervised manner and at different levels of resolution. Our unsupervised method extracts groups with high intrinsic textual consistency and compares well against categories hand-coded by healthcare personnel. We also show how to use our content-driven clusters to improve the supervised prediction of the degree of harm of the incident based on the text of the report. Finally, we discuss future directions to monitor reports over time, and to detect emerging trends outside pre-existing categories.
△ Less
Submitted 31 August, 2019;
originally announced September 2019.
-
Data-driven unsupervised clustering of online learner behaviour
Authors:
Robert L. Peach,
Sophia N. Yaliraki,
David Lefevre,
Mauricio Barahona
Abstract:
The widespread adoption of online courses opens opportunities for the analysis of learner behaviour and for the optimisation of web-based material adapted to observed usage. Here we introduce a mathematical framework for the analysis of time series collected from online engagement of learners, which allows the identification of clusters of learners with similar online behaviour directly from the d…
▽ More
The widespread adoption of online courses opens opportunities for the analysis of learner behaviour and for the optimisation of web-based material adapted to observed usage. Here we introduce a mathematical framework for the analysis of time series collected from online engagement of learners, which allows the identification of clusters of learners with similar online behaviour directly from the data, i.e., the groups of learners are not pre-determined subjectively but emerge algorithmically from the analysis and the data.The method uses a dynamic time war** kernel to create a pairwise similarity between time series of learner actions, and combines it with an unsupervised multiscale graph clustering algorithm to cluster groups of learners with similar patterns of behaviour. We showcase our approach on online engagement data of adult learners taking six web-based courses as part of a post-graduate degree at Imperial Business School. Our analysis identifies clusters of learners with statistically distinct patterns of engagement, ranging from distributed to massed learning, with different levels of adherence to pre-planned course structure and/or task completion, and also revealing outlier learners with highly sporadic behaviour. A posteriori comparison with performance showed that, although the majority of low-performing learners are part of in the massed learning cluster, the high performing learners are distributed across clusters with different traits of online engagement. We also show that our methodology is able to identify low performing learners more accurately than common classification methods based on raw statistics extracted from the data.
△ Less
Submitted 16 July, 2019; v1 submitted 11 February, 2019;
originally announced February 2019.
-
From Free Text to Clusters of Content in Health Records: An Unsupervised Graph Partitioning Approach
Authors:
M. Tarik Altuncu,
Erik Mayer,
Sophia N. Yaliraki,
Mauricio Barahona
Abstract:
Electronic Healthcare records contain large volumes of unstructured data in different forms. Free text constitutes a large portion of such data, yet this source of richly detailed information often remains under-used in practice because of a lack of suitable methodologies to extract interpretable content in a timely manner. Here we apply network-theoretical tools to the analysis of free text in Ho…
▽ More
Electronic Healthcare records contain large volumes of unstructured data in different forms. Free text constitutes a large portion of such data, yet this source of richly detailed information often remains under-used in practice because of a lack of suitable methodologies to extract interpretable content in a timely manner. Here we apply network-theoretical tools to the analysis of free text in Hospital Patient Incident reports in the English National Health Service, to find clusters of reports in an unsupervised manner and at different levels of resolution based directly on the free text descriptions contained within them. To do so, we combine recently developed deep neural network text-embedding methodologies based on paragraph vectors with multi-scale Markov Stability community detection applied to a similarity graph of documents obtained from sparsified text vector similarities. We showcase the approach with the analysis of incident reports submitted in Imperial College Healthcare NHS Trust, London. The multiscale community structure reveals levels of meaning with different resolution in the topics of the dataset, as shown by relevant descriptive terms extracted from the groups of records, as well as by comparing a posteriori against hand-coded categories assigned by healthcare personnel. Our content communities exhibit good correspondence with well-defined hand-coded categories, yet our results also provide further medical detail in certain areas as well as revealing complementary descriptors of incidents beyond the external classification. We also discuss how the method can be used to monitor reports over time and across different healthcare providers, and to detect emerging trends that fall outside of pre-existing categories.
△ Less
Submitted 14 November, 2018;
originally announced November 2018.
-
Content-driven, unsupervised clustering of news articles through multiscale graph partitioning
Authors:
M. Tarik Altuncu,
Sophia N. Yaliraki,
Mauricio Barahona
Abstract:
The explosion in the amount of news and journalistic content being generated across the globe, coupled with extended and instantaneous access to information through online media, makes it difficult and time-consuming to monitor news developments and opinion formation in real time. There is an increasing need for tools that can pre-process, analyse and classify raw text to extract interpretable con…
▽ More
The explosion in the amount of news and journalistic content being generated across the globe, coupled with extended and instantaneous access to information through online media, makes it difficult and time-consuming to monitor news developments and opinion formation in real time. There is an increasing need for tools that can pre-process, analyse and classify raw text to extract interpretable content; specifically, identifying topics and content-driven grou**s of articles. We present here such a methodology that brings together powerful vector embeddings from Natural Language Processing with tools from Graph Theory that exploit diffusive dynamics on graphs to reveal natural partitions across scales. Our framework uses a recent deep neural network text analysis methodology (Doc2vec) to represent text in vector form and then applies a multi-scale community detection method (Markov Stability) to partition a similarity graph of document vectors. The method allows us to obtain clusters of documents with similar content, at different levels of resolution, in an unsupervised manner. We showcase our approach with the analysis of a corpus of 9,000 news articles published by Vox Media over one year. Our results show consistent grou**s of documents according to content without a priori assumptions about the number or type of clusters to be found. The multilevel clustering reveals a quasi-hierarchy of topics and subtopics with increased intelligibility and improved topic coherence as compared to external taxonomy services and standard topic detection methods.
△ Less
Submitted 3 August, 2018;
originally announced August 2018.
-
From Text to Topics in Healthcare Records: An Unsupervised Graph Partitioning Methodology
Authors:
M. Tarik Altuncu,
Erik Mayer,
Sophia N. Yaliraki,
Mauricio Barahona
Abstract:
Electronic Healthcare Records contain large volumes of unstructured data, including extensive free text. Yet this source of detailed information often remains under-used because of a lack of methodologies to extract interpretable content in a timely manner. Here we apply network-theoretical tools to analyse free text in Hospital Patient Incident reports from the National Health Service, to find cl…
▽ More
Electronic Healthcare Records contain large volumes of unstructured data, including extensive free text. Yet this source of detailed information often remains under-used because of a lack of methodologies to extract interpretable content in a timely manner. Here we apply network-theoretical tools to analyse free text in Hospital Patient Incident reports from the National Health Service, to find clusters of documents with similar content in an unsupervised manner at different levels of resolution. We combine deep neural network paragraph vector text-embedding with multiscale Markov Stability community detection applied to a sparsified similarity graph of document vectors, and showcase the approach on incident reports from Imperial College Healthcare NHS Trust, London. The multiscale community structure reveals different levels of meaning in the topics of the dataset, as shown by descriptive terms extracted from the clusters of records. We also compare a posteriori against hand-coded categories assigned by healthcare personnel, and show that our approach outperforms LDA-based models. Our content clusters exhibit good correspondence with two levels of hand-coded categories, yet they also provide further medical detail in certain areas and reveal complementary descriptors of incidents beyond the external classification taxonomy.
△ Less
Submitted 6 July, 2018;
originally announced July 2018.
-
Community detection and role identification in directed networks: understanding the Twitter network of the care.data debate
Authors:
B. Amor,
S. Vuik,
R. Callahan,
A. Darzi,
S. N. Yaliraki,
M. Barahona
Abstract:
With the rise of social media as an important channel for the debate and discussion of public affairs, online social networks such as Twitter have become important platforms for public information and engagement by policy makers. To communicate effectively through Twitter, policy makers need to understand how influence and interest propagate within its network of users. In this chapter we use grap…
▽ More
With the rise of social media as an important channel for the debate and discussion of public affairs, online social networks such as Twitter have become important platforms for public information and engagement by policy makers. To communicate effectively through Twitter, policy makers need to understand how influence and interest propagate within its network of users. In this chapter we use graph-theoretic methods to analyse the Twitter debate surrounding NHS England's controversial care.data scheme. Directionality is a crucial feature of the Twitter social graph - information flows from the followed to the followers - but is often ignored in social network analyses; our methods are based on the behaviour of dynamic processes on the network and can be applied naturally to directed networks. We uncover robust communities of users and show that these communities reflect how information flows through the Twitter network. We are also able to classify users by their differing roles in directing the flow of information through the network. Our methods and results will be useful to policy makers who would like to use Twitter effectively as a communication medium.
△ Less
Submitted 13 August, 2015;
originally announced August 2015.
-
Great cities look small
Authors:
Aaron Sim,
Sophia N Yaliraki,
Mauricio Barahona,
Michael P H Stumpf
Abstract:
Great cities connect people; failed cities isolate people. Despite the fundamental importance of physical, face-to-face social-ties in the functioning of cities, these connectivity networks are not explicitly observed in their entirety. Attempts at estimating them often rely on unrealistic over-simplifications such as the assumption of spatial homogeneity. Here we propose a mathematical model of h…
▽ More
Great cities connect people; failed cities isolate people. Despite the fundamental importance of physical, face-to-face social-ties in the functioning of cities, these connectivity networks are not explicitly observed in their entirety. Attempts at estimating them often rely on unrealistic over-simplifications such as the assumption of spatial homogeneity. Here we propose a mathematical model of human interactions in terms of a local strategy of maximising the number of beneficial connections attainable under the constraint of limited individual travelling-time budgets. By incorporating census and openly-available online multi-modal transport data, we are able to characterise the connectivity of geometrically and topologically complex cities. Beyond providing a candidate measure of greatness, this model allows one to quantify and assess the impact of transport developments, population growth, and other infrastructure and demographic changes on a city. Supported by validations of GDP and HIV infection rates across United States metropolitan areas, we illustrate the effect of changes in local and city-wide connectivities by considering the economic impact of two contemporary inter- and intra-city transport developments in the United Kingdom: High Speed Rail 2 and London Crossrail. This derivation of the model suggests that the scaling of different urban indicators with population size has an explicitly mechanistic origin.
△ Less
Submitted 20 July, 2015;
originally announced July 2015.
-
Interest communities and flow roles in directed networks: the Twitter network of the UK riots
Authors:
Mariano Beguerisse-Díaz,
Guillermo Garduño-Hernández,
Borislav Vangelov,
Sophia N. Yaliraki,
Mauricio Barahona
Abstract:
Directionality is a crucial ingredient in many complex networks in which information, energy or influence are transmitted. In such directed networks, analysing flows (and not only the strength of connections) is crucial to reveal important features of the network that might go undetected if the orientation of connections is ignored. We showcase here a flow-based approach for community detection in…
▽ More
Directionality is a crucial ingredient in many complex networks in which information, energy or influence are transmitted. In such directed networks, analysing flows (and not only the strength of connections) is crucial to reveal important features of the network that might go undetected if the orientation of connections is ignored. We showcase here a flow-based approach for community detection in networks through the study of the network of the most influential Twitter users during the 2011 riots in England. Firstly, we use directed Markov Stability to extract descriptions of the network at different levels of coarseness in terms of interest communities, i.e., groups of nodes within which flows of information are contained and reinforced. Such interest communities reveal user grou**s according to location, profession, employer, and topic. The study of flows also allows us to generate an interest distance, which affords a personalised view of the attention in the network as viewed from the vantage point of any given user. Secondly, we analyse the profiles of incoming and outgoing long-range flows with a combined approach of role-based similarity and the novel relaxed minimum spanning tree algorithm to reveal that the users in the network can be classified into five roles. These flow roles go beyond the standard leader/follower dichotomy and differ from classifications based on regular/structural equivalence. We then show that the interest communities fall into distinct informational organigrams characterised by a different mix of user roles reflecting the quality of dialogue within them. Our generic framework can be used to provide insight into how flows are generated, distributed, preserved and consumed in directed networks.
△ Less
Submitted 8 October, 2014; v1 submitted 26 November, 2013;
originally announced November 2013.
-
The stability of a graph partition: A dynamics-based framework for community detection
Authors:
Jean-Charles Delvenne,
Michael T. Schaub,
Sophia N. Yaliraki,
Mauricio Barahona
Abstract:
Recent years have seen a surge of interest in the analysis of complex networks, facilitated by the availability of relational data and the increasingly powerful computational resources that can be employed for their analysis. Naturally, the study of real-world systems leads to highly complex networks and a current challenge is to extract intelligible, simplified descriptions from the network in te…
▽ More
Recent years have seen a surge of interest in the analysis of complex networks, facilitated by the availability of relational data and the increasingly powerful computational resources that can be employed for their analysis. Naturally, the study of real-world systems leads to highly complex networks and a current challenge is to extract intelligible, simplified descriptions from the network in terms of relevant subgraphs, which can provide insight into the structure and function of the overall system.
Sparked by seminal work by Newman and Girvan, an interesting line of research has been devoted to investigating modular community structure in networks, revitalising the classic problem of graph partitioning.
However, modular or community structure in networks has notoriously evaded rigorous definition. The most accepted notion of community is perhaps that of a group of elements which exhibit a stronger level of interaction within themselves than with the elements outside the community. This concept has resulted in a plethora of computational methods and heuristics for community detection. Nevertheless a firm theoretical understanding of most of these methods, in terms of how they operate and what they are supposed to detect, is still lacking to date.
Here, we will develop a dynamical perspective towards community detection enabling us to define a measure named the stability of a graph partition. It will be shown that a number of previously ad-hoc defined heuristics for community detection can be seen as particular cases of our method providing us with a dynamic reinterpretation of those measures. Our dynamics-based approach thus serves as a unifying framework to gain a deeper understanding of different aspects and problems associated with community detection and allows us to propose new dynamically-inspired criteria for community structure.
△ Less
Submitted 7 August, 2013;
originally announced August 2013.
-
Structure of complex networks: Quantifying edge-to-edge relations by failure-induced flow redistribution
Authors:
Michael T. Schaub,
Jörg Lehmann,
Sophia N. Yaliraki,
Mauricio Barahona
Abstract:
The analysis of complex networks has so far revolved mainly around the role of nodes and communities of nodes. However, the dynamics of interconnected systems is commonly focalised on edge processes, and a dual edge-centric perspective can often prove more natural. Here we present graph-theoretical measures to quantify edge-to-edge relations inspired by the notion of flow redistribution induced by…
▽ More
The analysis of complex networks has so far revolved mainly around the role of nodes and communities of nodes. However, the dynamics of interconnected systems is commonly focalised on edge processes, and a dual edge-centric perspective can often prove more natural. Here we present graph-theoretical measures to quantify edge-to-edge relations inspired by the notion of flow redistribution induced by edge failures. Our measures, which are related to the pseudo-inverse of the Laplacian of the network, are global and reveal the dynamical interplay between the edges of a network, including potentially non-local interactions. Our framework also allows us to define the embeddedness of an edge, a measure of how strongly an edge features in the weighted cuts of the network. We showcase the general applicability of our edge-centric framework through analyses of the Iberian Power grid, traffic flow in road networks, and the C. elegans neuronal network.
△ Less
Submitted 7 April, 2014; v1 submitted 25 March, 2013;
originally announced March 2013.
-
Markov dynamics as a zooming lens for multiscale community detection: non clique-like communities and the field-of-view limit
Authors:
Michael T. Schaub,
Jean-Charles Delvenne,
Sophia N. Yaliraki,
Mauricio Barahona
Abstract:
In recent years, there has been a surge of interest in community detection algorithms for complex networks. A variety of computational heuristics, some with a long history, have been proposed for the identification of communities or, alternatively, of good graph partitions. In most cases, the algorithms maximize a particular objective function, thereby finding the `right' split into communities. A…
▽ More
In recent years, there has been a surge of interest in community detection algorithms for complex networks. A variety of computational heuristics, some with a long history, have been proposed for the identification of communities or, alternatively, of good graph partitions. In most cases, the algorithms maximize a particular objective function, thereby finding the `right' split into communities. Although a thorough comparison of algorithms is still lacking, there has been an effort to design benchmarks, i.e., random graph models with known community structure against which algorithms can be evaluated. However, popular community detection methods and benchmarks normally assume an implicit notion of community based on clique-like subgraphs, a form of community structure that is not always characteristic of real networks. Specifically, networks that emerge from geometric constraints can have natural non clique-like substructures with large effective diameters, which can be interpreted as long-range communities. In this work, we show that long-range communities escape detection by popular methods, which are blinded by a restricted `field-of-view' limit, an intrinsic upper scale on the communities they can detect. The field-of-view limit means that long-range communities tend to be overpartitioned. We show how by adopting a dynamical perspective towards community detection (Delvenne et al. (2010) PNAS:107: 12755-12760; Lambiotte et al. (2008) arXiv:0812.1770), in which the evolution of a Markov process on the graph is used as a zooming lens over the structure of the network at all scales, one can detect both clique- or non clique-like communities without imposing an upper scale to the detection. Consequently, the performance of algorithms on inherently low-diameter, clique-like benchmarks may not always be indicative of equally good results in real networks with local, sparser connectivity.
△ Less
Submitted 17 January, 2012; v1 submitted 26 September, 2011;
originally announced September 2011.
-
Stability of graph communities across time scales
Authors:
J. -C. Delvenne,
S. N. Yaliraki,
M. Barahona
Abstract:
The complexity of biological, social and engineering networks makes it desirable to find natural partitions into communities that can act as simplified descriptions and provide insight into the structure and function of the overall system. Although community detection methods abound, there is a lack of consensus on how to quantify and rank the quality of partitions. We show here that the quality…
▽ More
The complexity of biological, social and engineering networks makes it desirable to find natural partitions into communities that can act as simplified descriptions and provide insight into the structure and function of the overall system. Although community detection methods abound, there is a lack of consensus on how to quantify and rank the quality of partitions. We show here that the quality of a partition can be measured in terms of its stability, defined in terms of the clustered autocovariance of a Markov process taking place on the graph. Because the stability has an intrinsic dependence on time scales of the graph, it allows us to compare and rank partitions at each time and also to establish the time spans over which partitions are optimal. Hence the Markov time acts effectively as an intrinsic resolution parameter that establishes a hierarchy of increasingly coarser clusterings. Within our framework we can then provide a unifying view of several standard partitioning measures: modularity and normalized cut size can be interpreted as one-step time measures, whereas Fiedler's spectral clustering emerges at long times. We apply our method to characterize the relevance and persistence of partitions over time for constructive and real networks, including hierarchical graphs and social networks. We also obtain reduced descriptions for atomic level protein structures over different time scales.
△ Less
Submitted 11 March, 2009; v1 submitted 9 December, 2008;
originally announced December 2008.