-
Knowledge Map: Toward a New Approach Supporting the Knowledge Management in Distributed Data Mining
Authors:
Nhien-An Le-Khac,
Lamine M. Aouad,
M-Tahar Kechadi
Abstract:
Distributed data mining (DDM) deals with the problem of finding patterns or models, called knowledge, in an environment with distributed data and computations. Today, a massive amounts of data which are often geographically distributed and owned by different organisation are being mined. As consequence, a large mount of knowledge are being produced. This causes problems of not only knowledge manag…
▽ More
Distributed data mining (DDM) deals with the problem of finding patterns or models, called knowledge, in an environment with distributed data and computations. Today, a massive amounts of data which are often geographically distributed and owned by different organisation are being mined. As consequence, a large mount of knowledge are being produced. This causes problems of not only knowledge management but also visualization in data mining. Besides, the main aim of DDM is to exploit fully the benefit of distributed data analysis while minimising the communication. Existing DDM techniques perform partial analysis of local data at individual sites and then generate a global model by aggregating these local results. These two steps are not independent since naive approaches to local analysis may produce an incorrect and ambiguous global data model. The integrating and cooperating of these two steps need an effective knowledge management, concretely an efficient map of knowledge in order to take the advantage of mined knowledge to guide mining the data. In this paper, we present "knowledge map", a representation of knowledge about mined knowledge. This new approach aims to manage efficiently mined knowledge in large scale distributed platform such as Grid. This knowledge map is used to facilitate not only the visualization, evaluation of mining results but also the coordinating of local mining process and existing knowledge to increase the accuracy of final model.
△ Less
Submitted 23 October, 2019;
originally announced October 2019.
-
Performance study of distributed Apriori-like frequent itemsets mining
Authors:
Lamine M. Aouad,
Nhien-An Le-Khac,
Tahar M. Kechadi
Abstract:
In this article, we focus on distributed Apriori-based frequent itemsets mining. We present a new distributed approach which takes into account inherent characteristics of this algorithm. We study the distribution aspect of this algorithm and give a comparison of the proposed approach with a classical Apriori-like distributed algorithm, using both analytical and experimental studies. We find that…
▽ More
In this article, we focus on distributed Apriori-based frequent itemsets mining. We present a new distributed approach which takes into account inherent characteristics of this algorithm. We study the distribution aspect of this algorithm and give a comparison of the proposed approach with a classical Apriori-like distributed algorithm, using both analytical and experimental studies. We find that under a wide range of conditions and datasets, the performance of a distributed Apriori-like algorithm is not related to global strategies of pruning since the performance of the local Apriori generation is usually characterized by relatively high success rates of candidate sets frequency at low levels which switch to very low rates at some stage, and often drops to zero. This means that the intermediate communication steps and remote support counts computation and collection in classical distributed schemes are computationally inefficient locally, and then constrains the global performance. Our performance evaluation is done on a large cluster of workstations using the Condor system and its workflow manager DAGMan. The results show that the presented approach greatly enhances the performance and achieves good scalability compared to a typical distributed Apriori founded algorithm.
△ Less
Submitted 21 February, 2019;
originally announced March 2019.
-
Variance-based Clustering Technique for Distributed Data Mining Applications
Authors:
Lamine M. Aouad,
Nhien-An Le-Khac,
Tahar Kechadi
Abstract:
Nowadays, huge amounts of data are naturally collected in distributed sites due to different facts and moving these data through the network for extracting useful knowledge is almost unfeasible for either technical reasons or policies. Furthermore, classical par- allel algorithms cannot be applied, specially in loosely coupled environments. This requires to develop scalable distributed algorithms…
▽ More
Nowadays, huge amounts of data are naturally collected in distributed sites due to different facts and moving these data through the network for extracting useful knowledge is almost unfeasible for either technical reasons or policies. Furthermore, classical par- allel algorithms cannot be applied, specially in loosely coupled environments. This requires to develop scalable distributed algorithms able to return the global knowledge by aggregating local results in an effective way. In this paper we propose a distributed algorithm based on independent local clustering processes and a global merging based on minimum variance increases and requires a limited communication overhead. We also introduce the notion of distributed sub-clusters perturbation to improve the global generated distribution. We show that this algorithm improves the quality of clustering compared to classical local centralized ones and is able to find real global data nature or distribution.
△ Less
Submitted 28 March, 2017;
originally announced March 2017.
-
Grid-based Approaches for Distributed Data Mining Applications
Authors:
Lamine M. Aouad,
Nhien-An Le-Khac,
Tahar Kechadi
Abstract:
The data mining field is an important source of large-scale applications and datasets which are getting more and more common. In this paper, we present grid-based approaches for two basic data mining applications, and a performance evaluation on an experimental grid environment that provides interesting monitoring capabilities and configuration tools. We propose a new distributed clustering approa…
▽ More
The data mining field is an important source of large-scale applications and datasets which are getting more and more common. In this paper, we present grid-based approaches for two basic data mining applications, and a performance evaluation on an experimental grid environment that provides interesting monitoring capabilities and configuration tools. We propose a new distributed clustering approach and a distributed frequent itemsets generation well-adapted for grid environments. Performance evaluation is done using the Condor system and its workflow manager DAGMan. We also compare this performance analysis to a simple analytical model to evaluate the overheads related to the workflow engine and the underlying grid system. This will specifically show that realistic performance expectations are currently difficult to achieve on the grid.
△ Less
Submitted 28 March, 2017;
originally announced March 2017.