-
A comparative study of similarity-based and GNN-based link prediction approaches
Authors:
Md Kamrul Islam,
Sabeur Aridhi,
Malika Smail-Tabbone
Abstract:
The task of inferring the missing links in a graph based on its current structure is referred to as link prediction. Link prediction methods that are based on pairwise node similarity are well-established approaches in the literature. They show good prediction performance in many real-world graphs though they are heuristics and lack of universal applicability. On the other hand, the success of neu…
▽ More
The task of inferring the missing links in a graph based on its current structure is referred to as link prediction. Link prediction methods that are based on pairwise node similarity are well-established approaches in the literature. They show good prediction performance in many real-world graphs though they are heuristics and lack of universal applicability. On the other hand, the success of neural networks for classification tasks in various domains leads researchers to study them in graphs. When a neural network can operate directly on the graph, then it is termed as the graph neural network (GNN). GNN is able to learn hidden features from graphs which can be used for link prediction task in graphs. Link predictions based on GNNs have gained much attention of researchers due to their convincing high performance in many real-world graphs. This appraisal paper studies some similarity and GNN-based link prediction approaches in the domain of homogeneous graphs that consists of a single type of (attributed) nodes and single type of pairwise links. We evaluate the studied approaches against several benchmark graphs with different properties from various domains.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
Neighborhood-Based Label Propagation in Large Protein Graphs
Authors:
Sabeur Aridhi,
Seyed Ziaeddin Alborzi,
Malika Smaïl-Tabbone,
Marie-Dominique Devignes,
David Ritchie
Abstract:
Understanding protein function is one of the keys to understanding life at the molecular level. It is also important in several scenarios including human disease and drug discovery. In this age of rapid and affordable biological sequencing, the number of sequences accumulating in databases is rising with an increasing rate. This presents many challenges for biologists and computer scientists alike…
▽ More
Understanding protein function is one of the keys to understanding life at the molecular level. It is also important in several scenarios including human disease and drug discovery. In this age of rapid and affordable biological sequencing, the number of sequences accumulating in databases is rising with an increasing rate. This presents many challenges for biologists and computer scientists alike. In order to make sense of this huge quantity of data, these sequences should be annotated with functional properties. UniProtKB consists of two components: i) the UniProtKB/Swiss-Prot database containing protein sequences with reliable information manually reviewed by expert bio-curators and ii) the UniProtKB/TrEMBL database that is used for storing and processing the unknown sequences. Hence, for all proteins we have available the sequence along with few more information such as the taxon and some structural domains. Pairwise similarity can be defined and computed on proteins based on such attributes. Other important attributes, while present for proteins in Swiss-Prot, are often missing for proteins in TrEMBL, such as their function and cellular localization. The enormous number of protein sequences now in TrEMBL calls for rapid procedures to annotate them automatically. In this work, we present DistNBLP, a novel Distributed Neighborhood-Based Label Propagation approach for large-scale annotation of proteins. To do this, the functional annotations of reviewed proteins are used to predict those of non-reviewed proteins using label propagation on a graph representation of the protein database. DistNBLP is built on top of the "akka" toolkit for building resilient distributed message-driven applications.
△ Less
Submitted 9 August, 2017;
originally announced August 2017.
-
BLADYG: A Graph Processing Framework for Large Dynamic Graphs
Authors:
Sabeur Aridhi,
Alberto Montresor,
Yannis Velegrakis
Abstract:
Recently, distributed processing of large dynamic graphs has become very popular, especially in certain domains such as social network analysis, Web graph analysis and spatial network analysis. In this context, many distributed/parallel graph processing systems have been proposed, such as Pregel, GraphLab, and Trinity. These systems can be divided into two categories: (1) vertex-centric and (2) bl…
▽ More
Recently, distributed processing of large dynamic graphs has become very popular, especially in certain domains such as social network analysis, Web graph analysis and spatial network analysis. In this context, many distributed/parallel graph processing systems have been proposed, such as Pregel, GraphLab, and Trinity. These systems can be divided into two categories: (1) vertex-centric and (2) block-centric approaches. In vertex-centric approaches, each vertex corresponds to a process, and message are exchanged among vertices. In block-centric approaches, the unit of computation is a block, a connected subgraph of the graph, and message exchanges occur among blocks. In this paper, we are considering the issues of scale and dynamism in the case of block-centric approaches. We present bladyg, a block-centric framework that addresses the issue of dynamism in large-scale graphs. We present an implementation of BLADYG on top of akka framework. We experimentally evaluate the performance of the proposed framework.
△ Less
Submitted 2 January, 2017;
originally announced January 2017.
-
Scalable Semi-Supervised Learning over Networks using Nonsmooth Convex Optimization
Authors:
Alexander Jung,
Alfred O. Hero III,
Alexandru Mara,
Sabeur Aridhi
Abstract:
We propose a scalable method for semi-supervised (transductive) learning from massive network-structured datasets. Our approach to semi-supervised learning is based on representing the underlying hypothesis as a graph signal with small total variation. Requiring a small total variation of the graph signal representing the underlying hypothesis corresponds to the central smoothness assumption that…
▽ More
We propose a scalable method for semi-supervised (transductive) learning from massive network-structured datasets. Our approach to semi-supervised learning is based on representing the underlying hypothesis as a graph signal with small total variation. Requiring a small total variation of the graph signal representing the underlying hypothesis corresponds to the central smoothness assumption that forms the basis for semi-supervised learning, i.e., input points forming clusters have similar output values or labels. We formulate the learning problem as a nonsmooth convex optimization problem which we solve by appealing to Nesterovs optimal first-order method for nonsmooth optimization. We also provide a message passing formulation of the learning method which allows for a highly scalable implementation in big data frameworks.
△ Less
Submitted 2 November, 2016;
originally announced November 2016.
-
An Experimental Survey on Big Data Frameworks
Authors:
Wissem Inoubli,
Sabeur Aridhi,
Haithem Mezni,
Mondher Maddouri,
Engelbert Mephu Nguifo
Abstract:
Recently, increasingly large amounts of data are generated from a variety of sources. Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a buzzword referring to the processing of massive volumes of (unstructured) data. Recently proposed frameworks for Big Data applications help to store, analyze and pr…
▽ More
Recently, increasingly large amounts of data are generated from a variety of sources. Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a buzzword referring to the processing of massive volumes of (unstructured) data. Recently proposed frameworks for Big Data applications help to store, analyze and process the data. In this paper, we discuss the challenges of Big Data and we survey existing Big Data frameworks. We also present an experimental evaluation and a comparative study of the most popular Big Data frameworks. This survey is concluded with a presentation of best practices related to the use of the studied frameworks in several application domains such as machine learning, graph processing and real-world applications.
△ Less
Submitted 6 June, 2018; v1 submitted 31 October, 2016;
originally announced October 2016.
-
Big Graph Mining: Frameworks and Techniques
Authors:
Sabeur Aridhi,
Engelbert Mephu Nguifo
Abstract:
Big graph mining is an important research area and it has attracted considerable attention. It allows to process, analyze, and extract meaningful information from large amounts of graph data. Big graph mining has been highly motivated not only by the tremendously increasing size of graphs but also by its huge number of applications. Such applications include bioinformatics, chemoinformatics and so…
▽ More
Big graph mining is an important research area and it has attracted considerable attention. It allows to process, analyze, and extract meaningful information from large amounts of graph data. Big graph mining has been highly motivated not only by the tremendously increasing size of graphs but also by its huge number of applications. Such applications include bioinformatics, chemoinformatics and social networks. One of the most challenging tasks in big graph mining is pattern mining in big graphs. This task consists on using data mining algorithms to discover interesting, unexpected and useful patterns in large amounts of graph data. It aims also to provide deeper understanding of graph data. In this context, several graph processing frameworks and scaling data mining/pattern mining techniques have been proposed to deal with very big graphs. This paper gives an overview of existing data mining and graph processing frameworks that deal with very big graphs. Then it presents a survey of current researches in the field of data mining / pattern mining in big graphs and discusses the main research issues related to this field. It also gives a categorization of both distributed data mining and machine learning techniques, graph processing frameworks and large scale pattern mining approaches.
△ Less
Submitted 9 February, 2016;
originally announced February 2016.
-
Multiple instance learning for sequence data with across bag dependencies
Authors:
Manel Zoghlami,
Sabeur Aridhi,
Mondher Maddouri,
Engelbert Mephu Nguifo
Abstract:
In Multiple Instance Learning (MIL) problem for sequence data, the instances inside the bags are sequences. In some real world applications such as bioinformatics, comparing a random couple of sequences makes no sense. In fact, each instance may have structural and/or functional relations with instances of other bags. Thus, the classification task should take into account this across bag relation.…
▽ More
In Multiple Instance Learning (MIL) problem for sequence data, the instances inside the bags are sequences. In some real world applications such as bioinformatics, comparing a random couple of sequences makes no sense. In fact, each instance may have structural and/or functional relations with instances of other bags. Thus, the classification task should take into account this across bag relation. In this work, we present two novel MIL approaches for sequence data classification named ABClass and ABSim. ABClass extracts motifs from related instances and use them to encode sequences. A discriminative classifier is then applied to compute a partial classification result for each set of related sequences. ABSim uses a similarity measure to discriminate the related instances and to compute a scores matrix. For both approaches, an aggregation method is applied in order to generate the final classification result. We applied both approaches to solve the problem of bacterial Ionizing Radiation Resistance prediction. The experimental results of the presented approaches are satisfactory.
△ Less
Submitted 11 June, 2020; v1 submitted 30 January, 2016;
originally announced February 2016.
-
Towards a constructive multilayer perceptron for regression task using non-parametric clustering. A case study of Photo-Z redshift reconstruction
Authors:
Cyrine Arouri,
Engelbert Mephu Nguifo,
Sabeur Aridhi,
Cécile Roucelle,
Gaelle Bonnet-Loosli,
Norbert Tsopzé
Abstract:
The choice of architecture of artificial neuron network (ANN) is still a challenging task that users face every time. It greatly affects the accuracy of the built network. In fact there is no optimal method that is applicable to various implementations at the same time. In this paper we propose a method to construct ANN based on clustering, that resolves the problems of random and ad hoc approache…
▽ More
The choice of architecture of artificial neuron network (ANN) is still a challenging task that users face every time. It greatly affects the accuracy of the built network. In fact there is no optimal method that is applicable to various implementations at the same time. In this paper we propose a method to construct ANN based on clustering, that resolves the problems of random and ad hoc approaches for multilayer ANN architecture. Our method can be applied to regression problems. Experimental results obtained with different datasets, reveals the efficiency of our method.
△ Less
Submitted 17 December, 2014;
originally announced December 2014.
-
A large-scale and fault-tolerant approach of subgraph mining using density-based partitioning
Authors:
Sabeur Aridhi,
Laurent d'Orazio,
Mondher Maddouri,
Engelbert Mephu Nguifo
Abstract:
Recently, graph mining approaches have become very popular, especially in domains such as bioinformatics, chemoinformatics and social networks. In this scope, one of the most challenging tasks is frequent subgraph discovery. This task has been motivated by the tremendously increasing size of existing graph databases. Since then, an important problem of designing efficient and scaling approaches fo…
▽ More
Recently, graph mining approaches have become very popular, especially in domains such as bioinformatics, chemoinformatics and social networks. In this scope, one of the most challenging tasks is frequent subgraph discovery. This task has been motivated by the tremendously increasing size of existing graph databases. Since then, an important problem of designing efficient and scaling approaches for frequent subgraph discovery in large clusters, has taken place. However, failures are a norm rather than being an exception in large clusters. In this context, the MapReduce framework was designed so that node failures are automatically handled by the framework. In this paper, we propose a large-scale and fault-tolerant approach of subgraph mining by means of a density-based partitioning technique, using MapReduce. Our partitioning aims to balance computation load on a collection of machines. We experimentally show that our approach decreases significantly the execution time and scales the subgraph discovery process to large graph databases.
△ Less
Submitted 4 December, 2012; v1 submitted 30 November, 2012;
originally announced December 2012.
-
Feature extraction in protein sequences classification : a new stability measure
Authors:
Rabie Saidi,
Sabeur Aridhi,
Mondher Maddouri,
Engelbert Mephu Nguifo
Abstract:
Feature extraction is an unavoidable task, especially in the critical step of preprocessing biological sequences. This step consists for example in transforming the biological sequences into vectors of motifs where each motif is a subsequence that can be seen as a property (or attribute) characterizing the sequence. Hence, we obtain an object-property table where objects are sequences and properti…
▽ More
Feature extraction is an unavoidable task, especially in the critical step of preprocessing biological sequences. This step consists for example in transforming the biological sequences into vectors of motifs where each motif is a subsequence that can be seen as a property (or attribute) characterizing the sequence. Hence, we obtain an object-property table where objects are sequences and properties are motifs extracted from sequences. This output can be used to apply standard machine learning tools to perform data mining tasks such as classification. Several previous works have described feature extraction methods for bio-sequence classification, but none of them discussed the robustness of these methods when perturbing the input data. In this work, we introduce the notion of stability of the generated motifs in order to study the robustness of motif extraction methods. We express this robustness in terms of the ability of the method to reveal any change occurring in the input data and also its ability to target the interesting motifs. We use these criteria to evaluate and experimentally compare four existing extraction methods for biological sequences.
△ Less
Submitted 5 December, 2012; v1 submitted 21 June, 2012;
originally announced June 2012.
-
Optimization of automatically generated multi-core code for the LTE RACH-PD algorithm
Authors:
Maxime Pelcat,
Slaheddine Aridhi,
Jean François Nezan
Abstract:
Embedded real-time applications in communication systems require high processing power. Manual scheduling devel-oped for single-processor applications is not suited to multi-core architectures. The Algorithm Architecture Matching (AAM) methodology optimizes static application implementation on multi-core architectures. The Random Access Channel Preamble Detection (RACH-PD) is an algorithm for no…
▽ More
Embedded real-time applications in communication systems require high processing power. Manual scheduling devel-oped for single-processor applications is not suited to multi-core architectures. The Algorithm Architecture Matching (AAM) methodology optimizes static application implementation on multi-core architectures. The Random Access Channel Preamble Detection (RACH-PD) is an algorithm for non-synchronized access of Long Term Evolu-tion (LTE) wireless networks. LTE aims to improve the spectral efficiency of the next generation cellular system. This paper de-scribes a complete methodology for implementing the RACH-PD. AAM prototy** is applied to the RACH-PD which is modelled as a Synchronous DataFlow graph (SDF). An efficient implemen-tation of the algorithm onto a multi-core DSP, the TI C6487, is then explained. Benchmarks for the solution are given.
△ Less
Submitted 4 November, 2008;
originally announced November 2008.