Search | arXiv e-print repository

Summarizing Labeled Multi-Graphs

Authors: Dimitris Berberidis, Pierre J. Liang, Leman Akoglu

Abstract: Real-world graphs can be difficult to interpret and visualize beyond a certain size. To address this issue, graph summarization aims to simplify and shrink a graph, while maintaining its high-level structure and characteristics. Most summarization methods are designed for homogeneous, undirected, simple graphs; however, many real-world graphs are ornate; with characteristics including node labels,… ▽ More Real-world graphs can be difficult to interpret and visualize beyond a certain size. To address this issue, graph summarization aims to simplify and shrink a graph, while maintaining its high-level structure and characteristics. Most summarization methods are designed for homogeneous, undirected, simple graphs; however, many real-world graphs are ornate; with characteristics including node labels, directed edges, edge multiplicities, and self-loops. In this paper we propose LM-Gsum, a versatile yet rigorous graph summarization model that (to the best of our knowledge, for the first time) can handle graphs with all the aforementioned characteristics (and any combination thereof). Moreover, our proposed model captures basic sub-structures that are prevalent in real-world graphs, such as cliques, stars, etc. LM-Gsum compactly quantifies the information content of a complex graph using a novel encoding scheme, where it seeks to minimize the total number of bits required to encode (i) the summary graph, as well as (ii) the corrections required for reconstructing the input graph losslessly. To accelerate the summary construction, it creates super-nodes efficiently by merging nodes in groups. Experiments demonstrate that LM-Gsum facilitates the visualization of real-world complex graphs, revealing interpretable structures and high- level relationships. Furthermore, LM-Gsum achieves better trade-off between compression rate and running time, relative to existing methods (only) on comparable settings. △ Less

Submitted 15 June, 2022; originally announced June 2022.

Comments: 17 pages, 8 figures, 4 tables

arXiv:1910.09589 [pdf, other]

GraphSAC: Detecting anomalies in large-scale graphs

Authors: Vassilis N. Ioannidis, Dimitris Berberidis, Georgios B. Giannakis

Abstract: A graph-based sampling and consensus (GraphSAC) approach is introduced to effectively detect anomalous nodes in large-scale graphs. Existing approaches rely on connectivity and attributes of all nodes to assign an anomaly score per node. However, nodal attributes and network links might be compromised by adversaries, rendering these holistic approaches vulnerable. Alleviating this limitation, Grap… ▽ More A graph-based sampling and consensus (GraphSAC) approach is introduced to effectively detect anomalous nodes in large-scale graphs. Existing approaches rely on connectivity and attributes of all nodes to assign an anomaly score per node. However, nodal attributes and network links might be compromised by adversaries, rendering these holistic approaches vulnerable. Alleviating this limitation, GraphSAC randomly draws subsets of nodes, and relies on graph-aware criteria to judiciously filter out sets contaminated by anomalous nodes, before employing a semi-supervised learning (SSL) module to estimate nominal label distributions per node. These learned nominal distributions are minimally affected by the anomalous nodes, and hence can be directly adopted for anomaly detection. Rigorous analysis provides performance guarantees for GraphSAC, by bounding the required number of draws. The per-draw complexity grows linearly with the number of edges, which implies efficient SSL, while draws can be run in parallel, thereby ensuring scalability to large graphs. GraphSAC is tested under different anomaly generation models based on random walks, clustered anomalies, as well as contemporary adversarial attacks for graph data. Experiments with real-world graphs showcase the advantage of GraphSAC relative to state-of-the-art alternatives. △ Less

Submitted 21 October, 2019; originally announced October 2019.

arXiv:1811.10797 [pdf, other]

Node Embedding with Adaptive Similarities for Scalable Learning over Graphs

Authors: Dimitris Berberidis, Georgios B. Giannakis

Abstract: Node embedding is the task of extracting informative and descriptive features over the nodes of a graph. The importance of node embeddings for graph analytics, as well as learning tasks such as node classification, link prediction and community detection, has led to increased interest on the problem leading to a number of recent advances. Much like PCA in the feature domain, node embedding is an i… ▽ More Node embedding is the task of extracting informative and descriptive features over the nodes of a graph. The importance of node embeddings for graph analytics, as well as learning tasks such as node classification, link prediction and community detection, has led to increased interest on the problem leading to a number of recent advances. Much like PCA in the feature domain, node embedding is an inherently \emph{unsupervised} task; in lack of metadata used for validation, practical methods may require standardization and limiting the use of tunable hyperparameters. Finally, node embedding methods are faced with maintaining scalability in the face of large-scale real-world graphs of ever-increasing sizes. In the present work, we propose an adaptive node embedding framework that adjusts the embedding process to a given underlying graph, in a fully unsupervised manner. To achieve this, we adopt the notion of a tunable node similarity matrix that assigns weights on paths of different length. The design of the multilength similarities ensures that the resulting embeddings also inherit interpretable spectral properties. The proposed model is carefully studied, interpreted, and numerically evaluated using stochastic block models. Moreover, an algorithmic scheme is proposed for training the model parameters effieciently and in an unsupervised manner. We perform extensive node classification, link prediction, and clustering experiments on many real world graphs from various domains, and compare with state-of-the-art scalable and unsupervised node embedding alternatives. The proposed method enjoys superior performance in many cases, while also yielding interpretable information on the underlying structure of the graph. △ Less

Submitted 13 June, 2019; v1 submitted 26 November, 2018; originally announced November 2018.

arXiv:1804.02081 [pdf, other]

doi 10.1109/TSP.2018.2889984

Adaptive Diffusions for Scalable Learning over Graphs

Authors: Dimitris Berberidis, Athanasios N. Nikolakopoulos, Georgios B. Giannakis

Abstract: Diffusion-based classifiers such as those relying on the Personalized PageRank and the Heat kernel, enjoy remarkable classification accuracy at modest computational requirements. Their performance however is affected by the extent to which the chosen diffusion captures a typically unknown label propagation mechanism, that can be specific to the underlying graph, and potentially different for each… ▽ More Diffusion-based classifiers such as those relying on the Personalized PageRank and the Heat kernel, enjoy remarkable classification accuracy at modest computational requirements. Their performance however is affected by the extent to which the chosen diffusion captures a typically unknown label propagation mechanism, that can be specific to the underlying graph, and potentially different for each class. The present work introduces a disciplined, data-efficient approach to learning class-specific diffusion functions adapted to the underlying network topology. The novel learning approach leverages the notion of "landing probabilities" of class-specific random walks, which can be computed efficiently, thereby ensuring scalability to large graphs. This is supported by rigorous analysis of the properties of the model as well as the proposed algorithms. Furthermore, a robust version of the classifier facilitates learning even in noisy environments. Classification tests on real networks demonstrate that adapting the diffusion function to the given graph and observed labels, significantly improves the performance over fixed diffusions; reaching -- and many times surpassing -- the classification accuracy of computationally heavier state-of-the-art competing methods, that rely on node embeddings and deep neural networks. △ Less

Submitted 8 January, 2019; v1 submitted 5 April, 2018; originally announced April 2018.

arXiv:1612.08263 [pdf, ps, other]

Decentralized RLS with Data-Adaptive Censoring for Regressions over Large-Scale Networks

Authors: Zifeng Wang, Zheng Yu, Qing Ling, Dimitris Berberidis, Georgios B. Giannakis

Abstract: The deluge of networked data motivates the development of algorithms for computation- and communication-efficient information processing. In this context, three data-adaptive censoring strategies are introduced to considerably reduce the computation and communication overhead of decentralized recursive least-squares (D-RLS) solvers. The first relies on alternating minimization and the stochastic N… ▽ More The deluge of networked data motivates the development of algorithms for computation- and communication-efficient information processing. In this context, three data-adaptive censoring strategies are introduced to considerably reduce the computation and communication overhead of decentralized recursive least-squares (D-RLS) solvers. The first relies on alternating minimization and the stochastic Newton iteration to minimize a network-wide cost, which discards observations with small innovations. In the resultant algorithm, each node performs local data-adaptive censoring to reduce computations, while exchanging its local estimate with neighbors so as to consent on a network-wide solution. The communication cost is further reduced by the second strategy, which prevents a node from transmitting its local estimate to neighbors when the innovation it induces to incoming data is minimal. In the third strategy, not only transmitting, but also receiving estimates from neighbors is prohibited when data-adaptive censoring is in effect. For all strategies, a simple criterion is provided for selecting the threshold of innovation to reach a prescribed average data reduction. The novel censoring-based (C)D-RLS algorithms are proved convergent to the optimal argument in the mean-square deviation sense. Numerical experiments validate the effectiveness of the proposed algorithms in reducing computation and communication overhead. △ Less

Submitted 12 January, 2018; v1 submitted 25 December, 2016; originally announced December 2016.

Comments: Part of this paper has appeared at the 42nd Intl. Conf. on Acoustics, Speech, and Signal Processing, New Orleans, USA, March 5-9, 2017

arXiv:1606.08136 [pdf, other]

doi 10.1109/TSP.2017.2691662

Data Sketching for Large-Scale Kalman Filtering

Authors: Dimitris Berberidis, Georgios B. Giannakis

Abstract: In an age of exponentially increasing data generation, performing inference tasks by utilizing the available information in its entirety is not always an affordable option. The present paper puts forth approaches to render tracking of large-scale dynamic processes via a Kalman filter affordable, by processing a reduced number of data. Three distinct methods are introduced for reducing the number o… ▽ More In an age of exponentially increasing data generation, performing inference tasks by utilizing the available information in its entirety is not always an affordable option. The present paper puts forth approaches to render tracking of large-scale dynamic processes via a Kalman filter affordable, by processing a reduced number of data. Three distinct methods are introduced for reducing the number of data involved in the correction step of the filter. Towards this goal, the first two methods employ random projections and innovation-based censoring to effect dimensionality reduction and measurement selection respectively. The third method achieves reduced complexity by leveraging sequential processing of observations and selecting a few informative updates based on an information-theoretic metric. Simulations on synthetic data, compare the proposed methods with competing alternatives, and corroborate their efficacy in terms of estimation accuracy over complexity reduction. Finally, monitoring large networks is considered as an application domain, with the proposed methods tested on Kronecker graphs to evaluate their efficiency in tracking traffic matrices and time-varying link costs. △ Less

Submitted 6 January, 2017; v1 submitted 27 June, 2016; originally announced June 2016.

arXiv:1601.07947 [pdf, ps, other]

Large-scale Kernel-based Feature Extraction via Budgeted Nonlinear Subspace Tracking

Authors: Fatemeh Sheikholeslami, Dimitris Berberidis, Georgios B. Giannakis

Abstract: Kernel-based methods enjoy powerful generalization capabilities in handling a variety of learning tasks. When such methods are provided with sufficient training data, broadly-applicable classes of nonlinear functions can be approximated with desired accuracy. Nevertheless, inherent to the nonparametric nature of kernel-based estimators are computational and memory requirements that become prohibit… ▽ More Kernel-based methods enjoy powerful generalization capabilities in handling a variety of learning tasks. When such methods are provided with sufficient training data, broadly-applicable classes of nonlinear functions can be approximated with desired accuracy. Nevertheless, inherent to the nonparametric nature of kernel-based estimators are computational and memory requirements that become prohibitive with large-scale datasets. In response to this formidable challenge, the present work puts forward a low-rank, kernel-based, feature extraction approach that is particularly tailored for online operation, where data streams need not be stored in memory. A novel generative model is introduced to approximate high-dimensional (possibly infinite) features via a low-rank nonlinear subspace, the learning of which leads to a direct kernel function approximation. Offline and online solvers are developed for the subspace learning task, along with affordable versions, in which the number of stored data vectors is confined to a predefined budget. Analytical results provide performance bounds on how well the kernel matrix as well as kernel-based classification and regression tasks can be approximated by leveraging budgeted online subspace learning and feature extraction schemes. Tests on synthetic and real datasets demonstrate and benchmark the efficiency of the proposed method when linear classification and regression is applied to the extracted features. △ Less

Submitted 26 December, 2017; v1 submitted 28 January, 2016; originally announced January 2016.

Showing 1–7 of 7 results for author: Berberidis, D