-
Comparing the hierarchy of keywords in on-line news portals
Authors:
Gergely Tibély,
David Sousa-Rodrigues,
Péter Pollner,
Gergely Palla
Abstract:
The tagging of on-line content with informative keywords is a widespread phenomenon from scientific article repositories through blogs to on-line news portals. In most of the cases, the tags on a given item are free words chosen by the authors independently. Therefore, relations among keywords in a collection of news items is unknown. However, in most cases the topics and concepts described by the…
▽ More
The tagging of on-line content with informative keywords is a widespread phenomenon from scientific article repositories through blogs to on-line news portals. In most of the cases, the tags on a given item are free words chosen by the authors independently. Therefore, relations among keywords in a collection of news items is unknown. However, in most cases the topics and concepts described by these keywords are forming a latent hierarchy, with the more general topics and categories at the top, and more specialised ones at the bottom. Here we apply a recent, cooccurrence-based tag hierarchy extraction method to sets of keywords obtained from four different on-line news portals. The resulting hierarchies show substantial differences not just in the topics rendered as important (being at the top of the hierarchy) or of less interest (categorised low in the hierarchy), but also in the underlying network structure. This reveals discrepancies between the plausible keyword association frameworks in the studied news portals.
△ Less
Submitted 20 June, 2016;
originally announced June 2016.
-
Partial order similarity based on mutual information
Authors:
Gergely Tibély,
Péter Pollner,
Gergely Palla
Abstract:
Comparing the ranking of candidates by different voters is an important topic in social and information science with a high relevance from the point of view of practical applications. In general, ties and pairs of incomparable candidates may occur, thus, the alternative rankings are described by partial orders. Various distance measures between partial orders have already been introduced, where ze…
▽ More
Comparing the ranking of candidates by different voters is an important topic in social and information science with a high relevance from the point of view of practical applications. In general, ties and pairs of incomparable candidates may occur, thus, the alternative rankings are described by partial orders. Various distance measures between partial orders have already been introduced, where zero distance is corresponding to a perfect match between a pair of partial orders, and larger values signal greater differences. Here we take a different approach and propose a similarity measure based on adjusted mutual information. In general, the similarity value of unity is corresponding to exactly matching partial orders, while a low similarity is associated to a pair of independent partial orders. The time complexity of the computation of this similarity measure is $\mathcal{O}(\left|{\mathcal C}\right|^3)$ in the worst case, and $\mathcal{O}(\left|{\mathcal C}\right|^2\ln \left|{\mathcal C}\right|)$ in the typical case of partial orders corresponding to trees with constant branching number, where $\left|{\mathcal C}\right|$ denotes the number of candidates. An interesting feature of our approach is that the similarity measure is sensitive to the position of the disagreements in the ranking: Differences at the highly ranked candidates induce larger similarity drop compared to disagreements at the bottom candidates.
△ Less
Submitted 22 January, 2016;
originally announced January 2016.
-
Comparing the hierarchy of author given tags and repository given tags in a large document archive
Authors:
Gergely Tibély,
Péter Pollner,
Gergely Palla
Abstract:
Folksonomies - large databases arising from collaborative tagging of items by independent users - are becoming an increasingly important way of categorizing information. In these systems users can tag items with free words, resulting in a tripartite item-tag-user network. Although there are no prescribed relations between tags, the way users think about the different categories presumably has some…
▽ More
Folksonomies - large databases arising from collaborative tagging of items by independent users - are becoming an increasingly important way of categorizing information. In these systems users can tag items with free words, resulting in a tripartite item-tag-user network. Although there are no prescribed relations between tags, the way users think about the different categories presumably has some built in hierarchy, in which more special concepts are descendants of some more general categories. Several applications would benefit from the knowledge of this hierarchy. Here we apply a recent method to check the differences and similarities of hierarchies resulting from tags given by independent individuals and from tags given by a centrally managed repository system. The results from out method showed substantial differences between the lower part of the hierarchies, and in contrast, a relatively high similarity at the top of the hierarchies.
△ Less
Submitted 30 June, 2015;
originally announced July 2015.
-
Hierarchical networks of scientific journals
Authors:
Gergely Palla,
Gergely Tibély,
Enys Mones,
Péter Pollner,
Tamás Vicsek
Abstract:
Scientific journals are the repositories of the gradually accumulating knowledge of mankind about the world surrounding us. Just as our knowledge is organised into classes ranging from major disciplines, subjects and fields to increasingly specific topics, journals can also be categorised into groups using various metrics. In addition to the set of topics characteristic for a journal, they can als…
▽ More
Scientific journals are the repositories of the gradually accumulating knowledge of mankind about the world surrounding us. Just as our knowledge is organised into classes ranging from major disciplines, subjects and fields to increasingly specific topics, journals can also be categorised into groups using various metrics. In addition to the set of topics characteristic for a journal, they can also be ranked regarding their relevance from the point of overall influence. One widespread measure is impact factor, but in the present paper we intend to reconstruct a much more detailed description by studying the hierarchical relations between the journals based on citation data. We use a measure related to the notion of m-reaching centrality and find a network which shows the level of influence of a journal from the point of the direction and efficiency with which information spreads through the network. We can also obtain an alternative network using a suitably modified nested hierarchy extraction method applied to the same data. The results are weakly methodology-dependent and reveal non-trivial relations among journals. The two alternative hierarchies show large similarity with some striking differences, providing together a complex picture of the intricate relations between scientific journals.
△ Less
Submitted 12 August, 2015; v1 submitted 18 June, 2015;
originally announced June 2015.
-
Extracting tag hierarchies
Authors:
Gergely Tibély,
Péter Pollner,
Tamás Vicsek,
Gergely Palla
Abstract:
Tagging items with descriptive annotations or keywords is a very natural way to compress and highlight information about the properties of the given entity. Over the years several methods have been proposed for extracting a hierarchy between the tags for systems with a "flat", egalitarian organization of the tags, which is very common when the tags correspond to free words given by numerous indepe…
▽ More
Tagging items with descriptive annotations or keywords is a very natural way to compress and highlight information about the properties of the given entity. Over the years several methods have been proposed for extracting a hierarchy between the tags for systems with a "flat", egalitarian organization of the tags, which is very common when the tags correspond to free words given by numerous independent people. Here we present a complete framework for automated tag hierarchy extraction based on tag occurrence statistics. Along with proposing new algorithms, we are also introducing different quality measures enabling the detailed comparison of competing approaches from different aspects. Furthermore, we set up a synthetic, computer generated benchmark providing a versatile tool for testing, with a couple of tunable parameters capable of generating a wide range of test beds. Beside the computer generated input we also use real data in our studies, including a biological example with a pre-defined hierarchy between the tags. The encouraging similarity between the pre-defined and reconstructed hierarchy, as well as the seemingly meaningful hierarchies obtained for other real systems indicate that tag hierarchy extraction is a very promising direction for further research with a great potential for practical applications.
△ Less
Submitted 22 January, 2014;
originally announced January 2014.
-
Ontologies and tag-statistics
Authors:
Gergely Tibely,
Peter Pollner,
Tamas Vicsek,
Gergely Palla
Abstract:
Due to the increasing popularity of collaborative tagging systems, the research on tagged networks, hypergraphs, ontologies, folksonomies and other related concepts is becoming an important interdisciplinary topic with great actuality and relevance for practical applications. In most collaborative tagging systems the tagging by the users is completely "flat", while in some cases they are allowed t…
▽ More
Due to the increasing popularity of collaborative tagging systems, the research on tagged networks, hypergraphs, ontologies, folksonomies and other related concepts is becoming an important interdisciplinary topic with great actuality and relevance for practical applications. In most collaborative tagging systems the tagging by the users is completely "flat", while in some cases they are allowed to define a shallow hierarchy for their own tags. However, usually no overall hierarchical organisation of the tags is given, and one of the interesting challenges of this area is to provide an algorithm generating the ontology of the tags from the available data. In contrast, there are also other type of tagged networks available for research, where the tags are already organised into a directed acyclic graph (DAG), encapsulating the "is a sub-category of" type of hierarchy between each other. In this paper we study how this DAG affects the statistical distribution of tags on the nodes marked by the tags in various real networks. We analyse the relation between the tag-frequency and the position of the tag in the DAG in two large sub-networks of the English Wikipedia and a protein-protein interaction network. We also study the tag co-occurrence statistics by introducing a 2d tag-distance distribution preserving both the difference in the levels and the absolute distance in the DAG for the co-occurring pairs of tags. Our most interesting finding is that the local relevance of tags in the DAG, (i.e., their rank or significance as characterised by, e.g., the length of the branches starting from them) is much more important than their global distance from the root. Furthermore, we also introduce a simple tagging model based on random walks on the DAG, capable of reproducing the main statistical features of tag co-occurrence.
△ Less
Submitted 5 January, 2012;
originally announced January 2012.
-
Criterions for locally dense subgraphs
Authors:
Gergely Tibély
Abstract:
Community detection is one of the most investigated problems in the field of complex networks. Although several methods were proposed, there is still no precise definition of communities. As a step towards a definition, I highlight two necessary properties of communities, separation and internal cohesion, the latter being a new concept. I propose a local method of community detection based on two-…
▽ More
Community detection is one of the most investigated problems in the field of complex networks. Although several methods were proposed, there is still no precise definition of communities. As a step towards a definition, I highlight two necessary properties of communities, separation and internal cohesion, the latter being a new concept. I propose a local method of community detection based on two-dimensional local optimization, which I tested on common benchmarks and on the word association database.
△ Less
Submitted 25 August, 2011; v1 submitted 17 March, 2011;
originally announced March 2011.
-
Communities and beyond: mesoscopic analysis of a large social network with complementary methods
Authors:
Gergely Tibely,
Lauri Kovanen,
Marton Karsai,
Kimmo Kaski,
Janos Kertesz,
Jari Saramaki
Abstract:
Community detection methods have so far been tested mostly on small empirical networks and on synthetic benchmarks. Much less is known about their performance on large real-world networks, which nonetheless are a significant target for application. We analyze the performance of three state-of-the-art community detection methods by using them to identify communities in a large social network constr…
▽ More
Community detection methods have so far been tested mostly on small empirical networks and on synthetic benchmarks. Much less is known about their performance on large real-world networks, which nonetheless are a significant target for application. We analyze the performance of three state-of-the-art community detection methods by using them to identify communities in a large social network constructed from mobile phone call records. We find that all methods detect communities that are meaningful in some respects but fall short in others, and that there often is a hierarchical relationship between communities detected by different methods. Our results suggest that community detection methods could be useful in studying the general mesoscale structure of networks, as opposed to only trying to identify dense structures.
△ Less
Submitted 8 June, 2011; v1 submitted 2 June, 2010;
originally announced June 2010.
-
Note on the equivalence of the label propagation method of community detection and a Potts model approach
Authors:
Gergely Tibely,
Janos Kertesz
Abstract:
We show that the recently introduced label propagation method for detecting communities in complex networks is equivalent to find the local minima of a simple Potts model. Applying to empirical data, the number of such local minima was found to be very high, much larger than the number of nodes in the graph. The aggregation method for combining information from more local minima shows a tendency…
▽ More
We show that the recently introduced label propagation method for detecting communities in complex networks is equivalent to find the local minima of a simple Potts model. Applying to empirical data, the number of such local minima was found to be very high, much larger than the number of nodes in the graph. The aggregation method for combining information from more local minima shows a tendency to fragment the communities into very small pieces.
△ Less
Submitted 31 March, 2008; v1 submitted 19 March, 2008;
originally announced March 2008.
-
Spectral methods and cluster structure in correlation-based networks
Authors:
Tapio Heimo,
Gergely Tibely,
Jari Saramaki,
Kimmo Kaski,
Janos Kertesz
Abstract:
We investigate how in complex systems the eigenpairs of the matrices derived from the correlations of multichannel observations reflect the cluster structure of the underlying networks. For this we use daily return data from the NYSE and focus specifically on the spectral properties of weight W_{ij} = |C|_{ij} - δ_{ij} and diffusion matrices D_{ij} = W_{ij}/s_j- δ_{ij}, where C_{ij} is the corre…
▽ More
We investigate how in complex systems the eigenpairs of the matrices derived from the correlations of multichannel observations reflect the cluster structure of the underlying networks. For this we use daily return data from the NYSE and focus specifically on the spectral properties of weight W_{ij} = |C|_{ij} - δ_{ij} and diffusion matrices D_{ij} = W_{ij}/s_j- δ_{ij}, where C_{ij} is the correlation matrix and s_i = \sum_j W_{ij} the strength of node j. The eigenvalues (and corresponding eigenvectors) of the weight matrix are ranked in descending order. In accord with the earlier observations the first eigenvector stands for a measure of the market correlations. Its components are to first approximation equal to the strengths of the nodes and there is a second order, roughly linear, correction. The high ranking eigenvectors, excluding the highest ranking one, are usually assigned to market sectors and industrial branches. Our study shows that both for weight and diffusion matrices the eigenpair analysis is not capable of easily deducing the cluster structure of the network without a priori knowledge. In addition we have studied the clustering of stocks using the asset graph approach with and without spectrum based noise filtering. It turns out that asset graphs are quite insensitive to noise and there is no sharp percolation transition as a function of the ratio of bonds included, thus no natural threshold value for that ratio seems to exist. We suggest that these observations can be of use for other correlation based networks as well.
△ Less
Submitted 14 August, 2007;
originally announced August 2007.
-
Spectrum, Intensity and Coherence in Weighted Networks of a Financial Market
Authors:
G. Tibely,
J. -P. Onnela,
J. Saramaki,
K. Kaski,
J. Kertesz
Abstract:
We construct a correlation matrix based financial network for a set of New York Stock Exchange (NYSE) traded stocks with stocks corresponding to nodes and the links between them added one after the other, according to the strength of the correlation between the nodes. The eigenvalue spectrum of the correlation matrix reflects the structure of the market, which also shows in the cluster structure…
▽ More
We construct a correlation matrix based financial network for a set of New York Stock Exchange (NYSE) traded stocks with stocks corresponding to nodes and the links between them added one after the other, according to the strength of the correlation between the nodes. The eigenvalue spectrum of the correlation matrix reflects the structure of the market, which also shows in the cluster structure of the emergent network. The stronger and more compact a cluster is, the earlier the eigenvalue representing the corresponding business sector occurs in the spectrum. On the other hand, if groups of stocks belonging to a given business sector are considered as a fully connected subgraph of the final network, their intensity and coherence can be monitored as a function of time. This approach indicates to what extent the business sector classifications are visible in market prices, which in turn enables us to gauge the extent of group-behaviour exhibited by stocks belonging to a given business sector.
△ Less
Submitted 23 March, 2006;
originally announced March 2006.
-
The effect of disorder on the hierarchical modularity in complex systems
Authors:
David Nagy,
Gergely Tibely,
Janos Kertesz
Abstract:
We consider a system hierarchically modular, if besides its hierarchical structure it shows a sequence of scale separations from the point of view of some functionality or property. Starting from regular, deterministic objects like the Vicsek snowflake or the deterministic scale free network by Ravasz et al. we first characterize the hierarchical modularity by the periodicity of some properties…
▽ More
We consider a system hierarchically modular, if besides its hierarchical structure it shows a sequence of scale separations from the point of view of some functionality or property. Starting from regular, deterministic objects like the Vicsek snowflake or the deterministic scale free network by Ravasz et al. we first characterize the hierarchical modularity by the periodicity of some properties on a logarithmic scale indicating separation of scales. Then we introduce randomness by kee** the scale freeness and other important characteristics of the objects and monitor the changes in the modularity. In the presented examples sufficient amount of randomness destroys hierarchical modularity. Our findings suggest that the experimentally observed hierarchical modularity in systems with algebraically decaying clustering coefficients indicates a limited level of randomness.
△ Less
Submitted 27 September, 2005; v1 submitted 16 June, 2005;
originally announced June 2005.