Search | arXiv e-print repository

doi 10.1016/j.future.2018.04.032

An Experimental Survey on Big Data Frameworks

Authors: Wissem Inoubli, Sabeur Aridhi, Haithem Mezni, Mondher Maddouri, Engelbert Mephu Nguifo

Abstract: Recently, increasingly large amounts of data are generated from a variety of sources. Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a buzzword referring to the processing of massive volumes of (unstructured) data. Recently proposed frameworks for Big Data applications help to store, analyze and pr… ▽ More Recently, increasingly large amounts of data are generated from a variety of sources. Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a buzzword referring to the processing of massive volumes of (unstructured) data. Recently proposed frameworks for Big Data applications help to store, analyze and process the data. In this paper, we discuss the challenges of Big Data and we survey existing Big Data frameworks. We also present an experimental evaluation and a comparative study of the most popular Big Data frameworks. This survey is concluded with a presentation of best practices related to the use of the studied frameworks in several application domains such as machine learning, graph processing and real-world applications. △ Less

Submitted 6 June, 2018; v1 submitted 31 October, 2016; originally announced October 2016.

arXiv:1602.00163 [pdf]

doi 10.1007/s13042-019-01021-5

Multiple instance learning for sequence data with across bag dependencies

Authors: Manel Zoghlami, Sabeur Aridhi, Mondher Maddouri, Engelbert Mephu Nguifo

Abstract: In Multiple Instance Learning (MIL) problem for sequence data, the instances inside the bags are sequences. In some real world applications such as bioinformatics, comparing a random couple of sequences makes no sense. In fact, each instance may have structural and/or functional relations with instances of other bags. Thus, the classification task should take into account this across bag relation.… ▽ More In Multiple Instance Learning (MIL) problem for sequence data, the instances inside the bags are sequences. In some real world applications such as bioinformatics, comparing a random couple of sequences makes no sense. In fact, each instance may have structural and/or functional relations with instances of other bags. Thus, the classification task should take into account this across bag relation. In this work, we present two novel MIL approaches for sequence data classification named ABClass and ABSim. ABClass extracts motifs from related instances and use them to encode sequences. A discriminative classifier is then applied to compute a partial classification result for each set of related sequences. ABSim uses a similarity measure to discriminate the related instances and to compute a scores matrix. For both approaches, an aggregation method is applied in order to generate the final classification result. We applied both approaches to solve the problem of bacterial Ionizing Radiation Resistance prediction. The experimental results of the presented approaches are satisfactory. △ Less

Submitted 11 June, 2020; v1 submitted 30 January, 2016; originally announced February 2016.

arXiv:1212.0017

doi 10.1016/j.is.2013.08.005

A large-scale and fault-tolerant approach of subgraph mining using density-based partitioning

Authors: Sabeur Aridhi, Laurent d'Orazio, Mondher Maddouri, Engelbert Mephu Nguifo

Abstract: Recently, graph mining approaches have become very popular, especially in domains such as bioinformatics, chemoinformatics and social networks. In this scope, one of the most challenging tasks is frequent subgraph discovery. This task has been motivated by the tremendously increasing size of existing graph databases. Since then, an important problem of designing efficient and scaling approaches fo… ▽ More Recently, graph mining approaches have become very popular, especially in domains such as bioinformatics, chemoinformatics and social networks. In this scope, one of the most challenging tasks is frequent subgraph discovery. This task has been motivated by the tremendously increasing size of existing graph databases. Since then, an important problem of designing efficient and scaling approaches for frequent subgraph discovery in large clusters, has taken place. However, failures are a norm rather than being an exception in large clusters. In this context, the MapReduce framework was designed so that node failures are automatically handled by the framework. In this paper, we propose a large-scale and fault-tolerant approach of subgraph mining by means of a density-based partitioning technique, using MapReduce. Our partitioning aims to balance computation load on a collection of machines. We experimentally show that our approach decreases significantly the execution time and scales the subgraph discovery process to large graph databases. △ Less

Submitted 4 December, 2012; v1 submitted 30 November, 2012; originally announced December 2012.

Comments: The paper is under reviewing and we want to cancel the submission. Thank you for your understanding

arXiv:1206.4822

doi 10.1145/2382936.2383060

Feature extraction in protein sequences classification : a new stability measure

Authors: Rabie Saidi, Sabeur Aridhi, Mondher Maddouri, Engelbert Mephu Nguifo

Abstract: Feature extraction is an unavoidable task, especially in the critical step of preprocessing biological sequences. This step consists for example in transforming the biological sequences into vectors of motifs where each motif is a subsequence that can be seen as a property (or attribute) characterizing the sequence. Hence, we obtain an object-property table where objects are sequences and properti… ▽ More Feature extraction is an unavoidable task, especially in the critical step of preprocessing biological sequences. This step consists for example in transforming the biological sequences into vectors of motifs where each motif is a subsequence that can be seen as a property (or attribute) characterizing the sequence. Hence, we obtain an object-property table where objects are sequences and properties are motifs extracted from sequences. This output can be used to apply standard machine learning tools to perform data mining tasks such as classification. Several previous works have described feature extraction methods for bio-sequence classification, but none of them discussed the robustness of these methods when perturbing the input data. In this work, we introduce the notion of stability of the generated motifs in order to study the robustness of motif extraction methods. We express this robustness in terms of the ability of the method to reveal any change occurring in the input data and also its ability to target the interesting motifs. We use these criteria to evaluate and experimentally compare four existing extraction methods for biological sequences. △ Less

Submitted 5 December, 2012; v1 submitted 21 June, 2012; originally announced June 2012.

Comments: The paper has been accepted by the ACM Conference on Bioinformatics, Computational Biology and Biomedicine (ACM BCB) 2012. We want to cancel the submission because of the double entries of the paper in DBLP. Thank you for your understanding

Showing 1–4 of 4 results for author: Maddouri, M