Search | arXiv e-print repository

Benchmark Data Contamination of Large Language Models: A Survey

Authors: Cheng Xu, Shuhao Guan, Derek Greene, M-Tahar Kechadi

Abstract: The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable per… ▽ More The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 31 pages, 7 figures, 3 tables

arXiv:2403.17648 [pdf, ps, other]

Healthcare Data Governance, Privacy, and Security -- A Conceptual Framework

Authors: Amen Faridoon, M. Tahar Kechadi

Abstract: The abundance of data has transformed the world in every aspect. It has become the core element in decision making, problem solving, and innovation in almost all areas of life, including business, science, healthcare, education, and many others. Despite all these advances, privacy and security remain critical concerns of the healthcare industry. It is important to note that healthcare data can als… ▽ More The abundance of data has transformed the world in every aspect. It has become the core element in decision making, problem solving, and innovation in almost all areas of life, including business, science, healthcare, education, and many others. Despite all these advances, privacy and security remain critical concerns of the healthcare industry. It is important to note that healthcare data can also be a liability if it is not managed correctly. This data mismanagement can have severe consequences for patients and healthcare organisations, including patient safety, legal liability, damage to reputation, financial loss, and operational inefficiency. Healthcare organisations must comply with a range of regulations to protect patient data. We perform a classification of data governance elements or components in a manner that thoroughly assesses the healthcare data chain from a privacy and security standpoint. After deeply analysing the existing literature, we propose a conceptual privacy and security driven healthcare data governance framework. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2307.06779 [pdf, other]

Data Behind the Walls An Advanced Architecture for Data Privacy Management

Authors: Amen Faridoon, M. Tahar Kechadi

Abstract: In today's highly connected society, we are constantly asked to provide personal information to retailers, voter surveys, medical professionals, and other data collection efforts. The collected data is stored in large data warehouses. Organisations and statistical agencies share and use this data to facilitate research in public health, economics, sociology, etc. However, this data contains sensit… ▽ More In today's highly connected society, we are constantly asked to provide personal information to retailers, voter surveys, medical professionals, and other data collection efforts. The collected data is stored in large data warehouses. Organisations and statistical agencies share and use this data to facilitate research in public health, economics, sociology, etc. However, this data contains sensitive information about individuals, which can result in identity theft, financial loss, stress and depression, embarrassment, abuse, etc. Therefore, one must ensure rigorous management of individuals' privacy. We propose, an advanced data privacy management architecture composed of three layers. The data management layer consists of de-identification and anonymisation, the access management layer for re-enforcing data access based on the concepts of Role-Based Access Control and the Chinese Wall Security Policy, and the roles layer for regulating different users. The proposed system architecture is validated on healthcare datasets. △ Less

Submitted 13 July, 2023; originally announced July 2023.

Comments: 7 pages

arXiv:2306.11946 [pdf, other]

doi 10.1109/CSCI58124.2022.00040

Winter Wheat Crop Yield Prediction on Multiple Heterogeneous Datasets using Machine Learning

Authors: Yogesh Bansal, Dr. David Lillis, Prof. Mohand Tahar Kechadi

Abstract: Winter wheat is one of the most important crops in the United Kingdom, and crop yield prediction is essential for the nation's food security. Several studies have employed machine learning (ML) techniques to predict crop yield on a county or farm-based level. The main objective of this study is to predict winter wheat crop yield using ML models on multiple heterogeneous datasets, i.e., soil and we… ▽ More Winter wheat is one of the most important crops in the United Kingdom, and crop yield prediction is essential for the nation's food security. Several studies have employed machine learning (ML) techniques to predict crop yield on a county or farm-based level. The main objective of this study is to predict winter wheat crop yield using ML models on multiple heterogeneous datasets, i.e., soil and weather on a zone-based level. Experimental results demonstrated their impact when used alone and in combination. In addition, we employ numerous ML algorithms to emphasize the significance of data quality in any machine-learning strategy. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Journal ref: International Conference on Computational Science and Computational Intelligence (CSCI 2022)

arXiv:2306.11942

A Deep Learning Model for Heterogeneous Dataset Analysis -- Application to Winter Wheat Crop Yield Prediction

Authors: Yogesh Bansal, David Lillis, Mohand Tahar Kechadi

Abstract: Western countries rely heavily on wheat, and yield prediction is crucial. Time-series deep learning models, such as Long Short Term Memory (LSTM), have already been explored and applied to yield prediction. Existing literature reported that they perform better than traditional Machine Learning (ML) models. However, the existing LSTM cannot handle heterogeneous datasets (a combination of data which… ▽ More Western countries rely heavily on wheat, and yield prediction is crucial. Time-series deep learning models, such as Long Short Term Memory (LSTM), have already been explored and applied to yield prediction. Existing literature reported that they perform better than traditional Machine Learning (ML) models. However, the existing LSTM cannot handle heterogeneous datasets (a combination of data which varies and remains static with time). In this paper, we propose an efficient deep learning model that can deal with heterogeneous datasets. We developed the system architecture and applied it to the real-world dataset in the digital agriculture area. We showed that it outperforms the existing ML models. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: This version has been removed by arXiv administrators because the submitter did not have the authority to grant the license at the time of submission

arXiv:2103.11271 [pdf]

doi 10.1007/s10791-020-09384-y

Structural Textile Pattern Recognition and Processing Based on Hypergraphs

Authors: Vuong M. Ngo, Sven Helmer, Nhien-An Le-Khac, M-Tahar Kechadi

Abstract: The humanities, like many other areas of society, are currently undergoing major changes in the wake of digital transformation. However, in order to make collection of digitised material in this area easily accessible, we often still lack adequate search functionality. For instance, digital archives for textiles offer keyword search, which is fairly well understood, and arrange their content follo… ▽ More The humanities, like many other areas of society, are currently undergoing major changes in the wake of digital transformation. However, in order to make collection of digitised material in this area easily accessible, we often still lack adequate search functionality. For instance, digital archives for textiles offer keyword search, which is fairly well understood, and arrange their content following a certain taxonomy, but search functionality at the level of thread structure is still missing. To facilitate the clustering and search, we introduce an approach for recognising similar weaving patterns based on their structures for textile archives. We first represent textile structures using hypergraphs and extract multisets of k-neighbourhoods describing weaving patterns from these graphs. Then, the resulting multisets are clustered using various distance measures and various clustering algorithms (K-Means for simplicity and hierarchical agglomerative algorithms for precision). We evaluate the different variants of our approach experimentally, showing that this can be implemented efficiently (meaning it has linear complexity), and demonstrate its quality to query and cluster datasets containing large textile samples. As, to the est of our knowledge, this is the first practical approach for explicitly modelling complex and irregular weaving patterns usable for retrieval, we aim at establishing a solid baseline. △ Less

Submitted 20 March, 2021; originally announced March 2021.

Comments: 38 pages, 23 figures

Journal ref: Information Retrieval Journal, Springer, 2021

arXiv:2007.05119 [pdf]

Multi-objective Clustering Algorithm with Parallel Games

Authors: Dalila Kessira, Mohand-Tahar Kechadi

Abstract: Data mining and knowledge discovery are two important growing research fields in the last two decades due to the abundance of data collected from various sources. The exponentially growing volumes of generated data urge the development of several mining techniques to feed the needs for automatically derived knowledge. Clustering analysis (finding similar groups of data) is a well-established and w… ▽ More Data mining and knowledge discovery are two important growing research fields in the last two decades due to the abundance of data collected from various sources. The exponentially growing volumes of generated data urge the development of several mining techniques to feed the needs for automatically derived knowledge. Clustering analysis (finding similar groups of data) is a well-established and widely used approach in data mining and knowledge discovery. In this paper, we introduce a clustering technique that uses game theory models to tackle multi-objective application problems. The main idea is to exploit a specific type of simultaneous move games, called congestion games. Congestion games offer numerous advantages ranging from being succinctly represented to possessing Nash equilibrium that is reachable in a polynomial-time. The proposed algorithm has three main steps: 1) it starts by identifying the initial players (or the cluster-heads), 2) it establishes the initial clusters' composition by constructing the game and try to find the equilibrium of the game. The third step consists of merging close clusters to obtain the final clusters. The experimental results show that the proposed clustering approach obtains good results and it is very promising in terms of scalability and performance. △ Less

Submitted 9 July, 2020; originally announced July 2020.

arXiv:2003.05043 [pdf]

Crop Knowledge Discovery Based on Agricultural Big Data Integration

Authors: Vuong M. Ngo, M-Tahar Kechadi

Abstract: Nowadays, the agricultural data can be generated through various sources, such as: Internet of Thing (IoT), sensors, satellites, weather stations, robots, farm equipment, agricultural laboratories, farmers, government agencies and agribusinesses. The analysis of this big data enables farmers, companies and agronomists to extract high business and scientific knowledge, improving their operational p… ▽ More Nowadays, the agricultural data can be generated through various sources, such as: Internet of Thing (IoT), sensors, satellites, weather stations, robots, farm equipment, agricultural laboratories, farmers, government agencies and agribusinesses. The analysis of this big data enables farmers, companies and agronomists to extract high business and scientific knowledge, improving their operational processes and product quality. However, before analysing this data, different data sources need to be normalised, homogenised and integrated into a unified data representation. In this paper, we propose an agricultural data integration method using a constellation schema which is designed to be flexible enough to incorporate other datasets and big data models. We also apply some methods to extract knowledge with the view to improve crop yield; these include finding suitable quantities of soil properties, herbicides and insecticides for both increasing crop yield and protecting the environment. △ Less

Submitted 10 March, 2020; originally announced March 2020.

Comments: 5 pages

Journal ref: ICMLSC-2020

arXiv:2003.04470 [pdf, ps, other]

doi 10.1504/IJBPIM.2020.113115

Data Warehouse and Decision Support on Integrated Crop Big Data

Authors: V. M. Ngo, N. A. Le-Khac, M. T. Kechadi

Abstract: In recent years, precision agriculture is becoming very popular. The introduction of modern information and communication technologies for collecting and processing Agricultural data revolutionise the agriculture practises. This has started a while ago (early 20th century) and it is driven by the low cost of collecting data about everything; from information on fields such as seed, soil, fertilise… ▽ More In recent years, precision agriculture is becoming very popular. The introduction of modern information and communication technologies for collecting and processing Agricultural data revolutionise the agriculture practises. This has started a while ago (early 20th century) and it is driven by the low cost of collecting data about everything; from information on fields such as seed, soil, fertiliser, pest, to weather data, drones and satellites images. Specially, the agricultural data mining today is considered as Big Data application in terms of volume, variety, velocity and veracity. Hence it leads to challenges in processing vast amounts of complex and diverse information to extract useful knowledge for the farmer, agronomist, and other businesses. It is a key foundation to establishing a crop intelligence platform, which will enable efficient resource management and high quality agronomy decision making and recommendations. In this paper, we designed and implemented a continental level agricultural data warehouse (ADW). ADW is characterised by its (1) flexible schema; (2) data integration from real agricultural multi datasets; (3) data science and business intelligent support; (4) high performance; (5) high storage; (6) security; (7) governance and monitoring; (8) consistency, availability and partition tolerant; (9) cloud compatibility. We also evaluate the performance of ADW and present some complex queries to extract and return necessary knowledge about crop management. △ Less

Submitted 12 April, 2021; v1 submitted 9 March, 2020; originally announced March 2020.

Comments: 13 pages, 11 figures. arXiv admin note: text overlap with arXiv:1905.12411

Journal ref: International Journal of Business Process Integration and Management 2020 Vol.10 No.1

arXiv:1910.10547 [pdf]

doi 10.1109/CONIELECOMP.2007.80

Knowledge Map: Toward a New Approach Supporting the Knowledge Management in Distributed Data Mining

Authors: Nhien-An Le-Khac, Lamine M. Aouad, M-Tahar Kechadi

Abstract: Distributed data mining (DDM) deals with the problem of finding patterns or models, called knowledge, in an environment with distributed data and computations. Today, a massive amounts of data which are often geographically distributed and owned by different organisation are being mined. As consequence, a large mount of knowledge are being produced. This causes problems of not only knowledge manag… ▽ More Distributed data mining (DDM) deals with the problem of finding patterns or models, called knowledge, in an environment with distributed data and computations. Today, a massive amounts of data which are often geographically distributed and owned by different organisation are being mined. As consequence, a large mount of knowledge are being produced. This causes problems of not only knowledge management but also visualization in data mining. Besides, the main aim of DDM is to exploit fully the benefit of distributed data analysis while minimising the communication. Existing DDM techniques perform partial analysis of local data at individual sites and then generate a global model by aggregating these local results. These two steps are not independent since naive approaches to local analysis may produce an incorrect and ambiguous global data model. The integrating and cooperating of these two steps need an effective knowledge management, concretely an efficient map of knowledge in order to take the advantage of mined knowledge to guide mining the data. In this paper, we present "knowledge map", a representation of knowledge about mined knowledge. This new approach aims to manage efficiently mined knowledge in large scale distributed platform such as Grid. This knowledge map is used to facilitate not only the visualization, evaluation of mining results but also the coordinating of local mining process and existing knowledge to increase the accuracy of final model. △ Less

Submitted 23 October, 2019; originally announced October 2019.

Comments: Third International Conference on Autonomic and Autonomous Systems (ICAS'07)

arXiv:1910.09437 [pdf]

doi 10.1016/j.jmsy.2013.02.001

Recurrent neural network approach for cyclic job shop scheduling problem

Authors: M-Tahar Kechadi, Kok Seng Low, G. Goncalves

Abstract: While cyclic scheduling is involved in numerous real-world applications, solving the derived problem is still of exponential complexity. This paper focuses specifically on modelling the manufacturing application as a cyclic job shop problem and we have developed an efficient neural network approach to minimise the cycle time of a schedule. Our approach introduces an interesting model for a manufac… ▽ More While cyclic scheduling is involved in numerous real-world applications, solving the derived problem is still of exponential complexity. This paper focuses specifically on modelling the manufacturing application as a cyclic job shop problem and we have developed an efficient neural network approach to minimise the cycle time of a schedule. Our approach introduces an interesting model for a manufacturing production, and it is also very efficient, adaptive and flexible enough to work with other techniques. Experimental results validated the approach and confirmed our hypotheses about the system model and the efficiency of neural networks for such a class of problems. △ Less

Submitted 21 October, 2019; originally announced October 2019.

Comments: Journal of Manufacturing Systems, Volume 32, Issue 4, October 2013, Pages 689-699

arXiv:1910.08626 [pdf]

Collection of Historical Weather Data: Issues with Missing Values

Authors: Fadoua Rafii, Tahar Kechadi

Abstract: Weather data collected from automated weather stations have become a crucial component for making decisions in agriculture and in forestry. Over time, weather stations may become out-of-order or stopped for maintenance, and therefore, during those periods, the data values will be missing. Unfortunately, this will cause huge problems when analysing the data. The main aim of this study is to create… ▽ More Weather data collected from automated weather stations have become a crucial component for making decisions in agriculture and in forestry. Over time, weather stations may become out-of-order or stopped for maintenance, and therefore, during those periods, the data values will be missing. Unfortunately, this will cause huge problems when analysing the data. The main aim of this study is to create high-quality historical weather datasets by dealing efficiently with missing values. In this paper, we present a set of missing data imputation methods and study their effectiveness. These methods were used based on different types of missing values. The experimental results show that two the proposed methods are very promising and can be used at larger scale. △ Less

Submitted 15 October, 2019; originally announced October 2019.

Comments: The 4th International Conference on Smart City Applications

arXiv:1908.10229 [pdf, other]

A Security-Aware Access Model for Data-Driven EHR System

Authors: Ngoc Hong Tran, Thien-An Nguyen-Ngoc, Nhien-An Le-Khac, M-Tahar Kechadi

Abstract: Digital healthcare systems are very popular lately, as they provide a variety of helpful means to monitor people's health state as well as to protect people against an unexpected health situation. These systems contain a huge amount of personal information in a form of electronic health records that are not allowed to be disclosed to unauthorized users. Hence, health data and information need to b… ▽ More Digital healthcare systems are very popular lately, as they provide a variety of helpful means to monitor people's health state as well as to protect people against an unexpected health situation. These systems contain a huge amount of personal information in a form of electronic health records that are not allowed to be disclosed to unauthorized users. Hence, health data and information need to be protected against attacks and thefts. In this paper, we propose a secure distributed architecture for healthcare data storage and analysis. It uses a novel security model to rigorously control permissions of accessing sensitive data in the system, as well as to protect the transmitted data between distributed system servers and nodes. The model also satisfies the NIST security requirements. Thorough experimental results show that the model is very promising. △ Less

Submitted 27 August, 2019; originally announced August 2019.

Comments: 13 pages, 12 figures, 3 tables

arXiv:1905.12411 [pdf, other]

Designing and Implementing Data Warehouse for Agricultural Big Data

Authors: Vuong M. Ngo, Nhien-An Le-Khac, M-Tahar Kechadi

Abstract: In recent years, precision agriculture that uses modern information and communication technologies is becoming very popular. Raw and semi-processed agricultural data are usually collected through various sources, such as: Internet of Thing (IoT), sensors, satellites, weather stations, robots, farm equipment, farmers and agribusinesses, etc. Besides, agricultural datasets are very large, complex, u… ▽ More In recent years, precision agriculture that uses modern information and communication technologies is becoming very popular. Raw and semi-processed agricultural data are usually collected through various sources, such as: Internet of Thing (IoT), sensors, satellites, weather stations, robots, farm equipment, farmers and agribusinesses, etc. Besides, agricultural datasets are very large, complex, unstructured, heterogeneous, non-standardized, and inconsistent. Hence, the agricultural data mining is considered as Big Data application in terms of volume, variety, velocity and veracity. It is a key foundation to establishing a crop intelligence platform, which will enable resource efficient agronomy decision making and recommendations. In this paper, we designed and implemented a continental level agricultural data warehouse by combining Hive, MongoDB and Cassandra. Our data warehouse capabilities: (1) flexible schema; (2) data integration from real agricultural multi datasets; (3) data science and business intelligent support; (4) high performance; (5) high storage; (6) security; (7) governance and monitoring; (8) replication and recovery; (9) consistency, availability and partition tolerant; (10) distributed and cloud deployment. We also evaluate the performance of our data warehouse. △ Less

Submitted 29 May, 2019; originally announced May 2019.

Comments: Business intelligent, data warehouse, constellation schema, Big Data, precision agriculture

Journal ref: BigData 2019

arXiv:1903.03061 [pdf]

doi 10.1016/j.diin.2009.06.014

DIALOG: A framework for modeling, analysis and reuse of digital forensic knowledge

Authors: Damir Kahvedzic, Tahar Kechadi

Abstract: This paper presents DIALOG (Digital Investigation Ontology); a framework for the management, reuse, and analysis of Digital Investigation knowledge. DIALOG provides a general, application independent vocabulary that can be used to describe an investigation at different levels of detail. DIALOG is defined to encapsulate all concepts of the digital forensics field and the relationships between them.… ▽ More This paper presents DIALOG (Digital Investigation Ontology); a framework for the management, reuse, and analysis of Digital Investigation knowledge. DIALOG provides a general, application independent vocabulary that can be used to describe an investigation at different levels of detail. DIALOG is defined to encapsulate all concepts of the digital forensics field and the relationships between them. In particular, we concentrate on the Windows Registry, where registry keys are modeled in terms of both their structure and function. Registry analysis software tools are modeled in a similar manner and we illustrate how the interpretation of their results can be done using the reasoning capabilities of ontology △ Less

Submitted 21 February, 2019; originally announced March 2019.

Journal ref: Digital Investigation Volume 6, Supplement, September 2009, Pages S23-S33

arXiv:1903.03008 [pdf]

doi 10.1007/s10115-009-0205-3

Performance study of distributed Apriori-like frequent itemsets mining

Authors: Lamine M. Aouad, Nhien-An Le-Khac, Tahar M. Kechadi

Abstract: In this article, we focus on distributed Apriori-based frequent itemsets mining. We present a new distributed approach which takes into account inherent characteristics of this algorithm. We study the distribution aspect of this algorithm and give a comparison of the proposed approach with a classical Apriori-like distributed algorithm, using both analytical and experimental studies. We find that… ▽ More In this article, we focus on distributed Apriori-based frequent itemsets mining. We present a new distributed approach which takes into account inherent characteristics of this algorithm. We study the distribution aspect of this algorithm and give a comparison of the proposed approach with a classical Apriori-like distributed algorithm, using both analytical and experimental studies. We find that under a wide range of conditions and datasets, the performance of a distributed Apriori-like algorithm is not related to global strategies of pruning since the performance of the local Apriori generation is usually characterized by relatively high success rates of candidate sets frequency at low levels which switch to very low rates at some stage, and often drops to zero. This means that the intermediate communication steps and remote support counts computation and collection in classical distributed schemes are computationally inefficient locally, and then constrains the global performance. Our performance evaluation is done on a large cluster of workstations using the Condor system and its workflow manager DAGMan. The results show that the presented approach greatly enhances the performance and achieves good scalability compared to a typical distributed Apriori founded algorithm. △ Less

Submitted 21 February, 2019; originally announced March 2019.

Journal ref: Knowledge and Information Systems April 2010, Volume 23, Issue 1

arXiv:1903.01396 [pdf]

doi 10.1016/j.diin.2014.05.009

A complete formalized knowledge representation model for advanced digital forensics timeline analysis

Authors: Yoan Chabot, Aurélie Bertaux, Christophe Nicollea, Tahar Kechadi

Abstract: Having a clear view of events that occurred over time is a difficult objective to achieve in digital investigations (DI). Event reconstruction, which allows investigators to understand the timeline of a crime, is one of the most important step of a DI process. This complex task requires exploration of a large amount of events due to the pervasiveness of new technologies nowadays. Any evidence prod… ▽ More Having a clear view of events that occurred over time is a difficult objective to achieve in digital investigations (DI). Event reconstruction, which allows investigators to understand the timeline of a crime, is one of the most important step of a DI process. This complex task requires exploration of a large amount of events due to the pervasiveness of new technologies nowadays. Any evidence produced at the end of the investigative process must also meet the requirements of the courts, such as reproducibility, verifiability, validation, etc. For this purpose, we propose a new methodology, supported by theoretical concepts, that can assist investigators through the whole process including the construction and the interpretation of the events describing the case. The proposed approach is based on a model which integrates knowledge of experts from the fields of digital forensics and software development to allow a semantically rich representation of events related to the incident. The main purpose of this model is to allow the analysis of these events in an automatic and efficient way. This paper describes the approach and then focuses on the main conceptual and formal aspects: a formal incident modelization and operators for timeline reconstruction and analysis. △ Less

Submitted 21 February, 2019; originally announced March 2019.

Journal ref: Digital Investigation Volume 11, Supplement 2, August 2014, Pages S95-S105

arXiv:1902.08040 [pdf]

doi 10.1109/ISPDC.2004.21

Dynamic task scheduling in computing cluster environments

Authors: I. K. Savvas, M. Tahar Kechadi

Abstract: In this study, a cluster-computing environment is employed as a computational platform. In order to increase the efficiency of the system, a dynamic task scheduling algorithm is proposed, which balances the load among the nodes of the cluster. The technique is dynamic, nonpreemptive, adaptive, and it uses a mixed centralised and decentralised policies. Based on the divide and conquer principle, th… ▽ More In this study, a cluster-computing environment is employed as a computational platform. In order to increase the efficiency of the system, a dynamic task scheduling algorithm is proposed, which balances the load among the nodes of the cluster. The technique is dynamic, nonpreemptive, adaptive, and it uses a mixed centralised and decentralised policies. Based on the divide and conquer principle, the algorithm models the cluster as hyper-grids and then balances the load among them. Recursively, the hyper-grids of dimension k are divided into grids of dimensions k - 1, until the dimension is 1. Then, all the nodes of the cluster are almost equally loaded. The optimum dimension of the hyper-grid is chosen in order to achieve the best performance. The simulation results show the effective use of the algorithm. In addition, we determined the critical points (lower bounds) in which the algorithm can to be triggered. △ Less

Submitted 21 February, 2019; originally announced February 2019.

Comments: Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks

arXiv:1807.00035 [pdf]

An Efficient Data Warehouse for Crop Yield Prediction

Authors: Vuong M. Ngo, Nhien-An Le-Khac, M-Tahar Kechadi

Abstract: Nowadays, precision agriculture combined with modern information and communications technologies, is becoming more common in agricultural activities such as automated irrigation systems, precision planting, variable rate applications of nutrients and pesticides, and agricultural decision support systems. In the latter, crop management data analysis, based on machine learning and data mining, focus… ▽ More Nowadays, precision agriculture combined with modern information and communications technologies, is becoming more common in agricultural activities such as automated irrigation systems, precision planting, variable rate applications of nutrients and pesticides, and agricultural decision support systems. In the latter, crop management data analysis, based on machine learning and data mining, focuses mainly on how to efficiently forecast and improve crop yield. In recent years, raw and semi-processed agricultural data are usually collected using sensors, robots, satellites, weather stations, farm equipment, farmers and agribusinesses while the Internet of Things (IoT) should deliver the promise of wirelessly connecting objects and devices in the agricultural ecosystem. Agricultural data typically captures information about farming entities and operations. Every farming entity encapsulates an individual farming concept, such as field, crop, seed, soil, temperature, humidity, pest, and weed. Agricultural datasets are spatial, temporal, complex, heterogeneous, non-standardized, and very large. In particular, agricultural data is considered as Big Data in terms of volume, variety, velocity and veracity. Designing and develo** a data warehouse for precision agriculture is a key foundation for establishing a crop intelligence platform, which will enable resource efficient agronomy decision making and recommendations. Some of the requirements for such an agricultural data warehouse are privacy, security, and real-time access among its stakeholders (e.g., farmers, farm equipment manufacturers, agribusinesses, co-operative societies, customers and possibly Government agencies). However, currently there are very few reports in the literature that focus on the design of efficient data warehouses with the view of enabling Agricultural Big Data analysis and data mining. In this paper ... △ Less

Submitted 26 June, 2018; originally announced July 2018.

Comments: 12 pages. Keywords. Data warehouse, constellation schema, crop yield prediction, precision agriculture

Journal ref: Proceedings of the 14th International Conference on Precision Agriculture. June 24 to June 27, 2018, Montreal, Quebec, Canada

arXiv:1804.08653 [pdf]

Forensic Analysis of the exFAT artefacts

Authors: Yves Vandermeer, Nhien-An Le-Khac, Joe Carthy, Tahar Kechadi

Abstract: Although kee** some basic concepts inherited from FAT32, the exFAT file system introduces many differences, such as the new map** scheme of directory entries. The combination of exFAT map** scheme with the allocation of bitmap files and the use of FAT leads to new forensic possibilities. The recovery of deleted files, including fragmented ones and carving becomes more accurate compared with… ▽ More Although kee** some basic concepts inherited from FAT32, the exFAT file system introduces many differences, such as the new map** scheme of directory entries. The combination of exFAT map** scheme with the allocation of bitmap files and the use of FAT leads to new forensic possibilities. The recovery of deleted files, including fragmented ones and carving becomes more accurate compared with former forensic processes. Nowadays, the accurate and sound forensic analysis is more than ever needed, as there is a high risk of erroneous interpretation. Indeed, most of the related work in the literature on exFAT structure and forensics, is mainly based on reverse engineering research, and only few of them cover the forensic interpretation. In this paper, we propose a new methodology using of exFAT file systems features to improve the interpretation of inactive entries by using bitmap file analysis and recover the file system metadata information for carved files. Experimental results show how our approach improves the forensic interpretation accuracy. △ Less

Submitted 23 April, 2018; originally announced April 2018.

arXiv:1802.00688 [pdf, other]

doi 10.1109/ICDMW.2016.0158

Hierarchical Aggregation Approach for Distributed clustering of spatial datasets

Authors: Malika Bendechache, Nhien-An Le-Khac, M-Tahar Kechadi

Abstract: In this paper, we present a new approach of distributed clustering for spatial datasets, based on an innovative and efficient aggregation technique. This distributed approach consists of two phases: 1) local clustering phase, where each node performs a clustering on its local data, 2) aggregation phase, where the local clusters are aggregated to produce global clusters. This approach is characteri… ▽ More In this paper, we present a new approach of distributed clustering for spatial datasets, based on an innovative and efficient aggregation technique. This distributed approach consists of two phases: 1) local clustering phase, where each node performs a clustering on its local data, 2) aggregation phase, where the local clusters are aggregated to produce global clusters. This approach is characterised by the fact that the local clusters are represented in a simple and efficient way. And The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in both response time and memory allocation. We evaluated the approach with different datasets and compared it to well-known clustering techniques. The experimental results show that our approach is very promising and outperforms all those algorithms △ Less

Submitted 1 February, 2018; originally announced February 2018.

Comments: 6 pages. arXiv admin note: substantial text overlap with arXiv:1704.03421

arXiv:1802.00304 [pdf]

doi 10.1109/ICSDM.2015.7298026

Distributed Clustering Algorithm for Spatial Data Mining

Authors: Malika Bendechache, M-Tahar Kechadi

Abstract: Distributed data mining techniques and mainly distributed clustering are widely used in the last decade because they deal with very large and heterogeneous datasets which cannot be gathered centrally. Current distributed clustering approaches are normally generating global models by aggregating local results that are obtained on each site. While this approach mines the datasets on their locations… ▽ More Distributed data mining techniques and mainly distributed clustering are widely used in the last decade because they deal with very large and heterogeneous datasets which cannot be gathered centrally. Current distributed clustering approaches are normally generating global models by aggregating local results that are obtained on each site. While this approach mines the datasets on their locations the aggregation phase is complex, which may produce incorrect and ambiguous global clusters and therefore incorrect knowledge. In this paper we propose a new clustering approach for very large spatial datasets that are heterogeneous and distributed. The approach is based on K-means Algorithm but it generates the number of global clusters dynamically. Moreover, this approach uses an elaborated aggregation phase. The aggregation phase is designed in such a way that the overall process is efficient in time and memory allocation. Preliminary results show that the proposed approach produces high quality results and scales up well. We also compared it to two popular clustering algorithms and show that this approach is much more efficient. △ Less

Submitted 1 February, 2018; originally announced February 2018.

Comments: 6 pages. arXiv admin note: text overlap with arXiv:1704.03421

Journal ref: Spatial Data Mining and Geographical Knowledge Services (ICSDM), 2015 2nd IEEE International Conference on, pages 60--65, 2015

arXiv:1801.10391 [pdf]

Internet of things forensics: Challenges and Case Study

Authors: Saad Alabdulsalam, Kevin Schaefer, Tahar Kechadi, Nhien-An Le-Khac

Abstract: Today is the era of Internet of Things (IoT), millions of machines such as cars, smoke detectors, watches, glasses, webcams, etc. are being connected to the Internet. The number of machines that possess the ability of remote access to monitor and collect data is continuously increasing. This development makes, on one hand, the human life more comfort- able, convenient, but it also raises on other… ▽ More Today is the era of Internet of Things (IoT), millions of machines such as cars, smoke detectors, watches, glasses, webcams, etc. are being connected to the Internet. The number of machines that possess the ability of remote access to monitor and collect data is continuously increasing. This development makes, on one hand, the human life more comfort- able, convenient, but it also raises on other hand issues on security and privacy. However, this development also raises challenges for the digital investigator when IoT devices involve in criminal scenes. Indeed, current research in the literature focuses on security and privacy for IoT environments rather than methods or techniques of forensic acquisition and analysis for IoT devices. Therefore, in this paper, we discuss firstly different aspects related to IoT forensics and then focus on the cur- rent challenges. We also describe forensic approaches for a IoT device smartwatch as a case study. We analyze forensic artifacts retrieved from smartwatch devices and discuss on evidence found aligned with challenges in IoT forensics △ Less

Submitted 31 January, 2018; originally announced January 2018.

arXiv:1710.09593 [pdf, other]

Distributed Spatial Data Clustering as a New Approach for Big Data Analysis

Authors: Malika Bendechache, Nhien-An Le-Khac, M-Tahar Kechadi

Abstract: In this paper we propose a new approach for Big Data mining and analysis. This new approach works well on distributed datasets and deals with data clustering task of the analysis. The approach consists of two main phases, the first phase executes a clustering algorithm on local data, assuming that the datasets was already distributed among the system processing nodes. The second phase deals with t… ▽ More In this paper we propose a new approach for Big Data mining and analysis. This new approach works well on distributed datasets and deals with data clustering task of the analysis. The approach consists of two main phases, the first phase executes a clustering algorithm on local data, assuming that the datasets was already distributed among the system processing nodes. The second phase deals with the local clusters aggregation to generate global clusters. This approach not only generates local clusters on each processing node in parallel, but also facilitates the formation of global clusters without prior knowledge of the number of the clusters, which many partitioning clustering algorithm require. In this study, this approach was applied on spatial datasets. The proposed aggregation phase is very efficient and does not involve the exchange of large amounts of data between the processing nodes. The experimental results show that the approach has super linear speed up, scales up very well, and can take advantage of the recent programming models, such as MapReduce model, as its results are not affected by the types of communications. △ Less

Submitted 1 March, 2018; v1 submitted 26 October, 2017; originally announced October 2017.

arXiv:1708.09053 [pdf]

Increasing digital investigator availability through efficient workflow management and automation

Authors: Ronald In de Braekt, Nhien-An Le-Khac, Jason Farina, Mark Scanlon, M-Tahar Kechadi

Abstract: The growth of digital storage capacities and diversity devices has had a significant time impact on digital forensic laboratories in law enforcement. Backlogs have become commonplace and increasingly more time is spent in the acquisition and preparation steps of an investigation as opposed to detailed evidence analysis and reporting. There is generally little room for increasing digital investigat… ▽ More The growth of digital storage capacities and diversity devices has had a significant time impact on digital forensic laboratories in law enforcement. Backlogs have become commonplace and increasingly more time is spent in the acquisition and preparation steps of an investigation as opposed to detailed evidence analysis and reporting. There is generally little room for increasing digital investigation capacity in law enforcement digital forensic units and the allocated budgets for these units are often decreasing. In the context of develo** an efficient investigation process, one of the key challenges amounts to how to achieve more with less. This paper proposes a workflow management automation framework for handling common digital forensic tools. The objective is to streamline the digital investigation workflow - enabling more efficient use of limited hardware and software. The proposed automation framework reduces the time digital forensic experts waste conducting time-consuming, though necessary, tasks. The evidence processing time is decreased through server-side automation resulting in 24/7 evidence preparation. The proposed framework increases efficiency of use of forensic software and hardware, reduces the infrastructure costs and license fees, and simplifies the preparation steps for the digital investigator. The proposed approach is evaluated in a real-world scenario to evaluate its robustness and highlight its benefits. △ Less

Submitted 29 August, 2017; originally announced August 2017.

arXiv:1708.09051 [pdf]

Investigation and Automating Extraction of Thumbnails Produced by Image viewers

Authors: Wybren van der Meer, Kim-Kwang Raymond Choo, Nhien-An Le-Khac, M-Tahar Kechadi

Abstract: Today, in digital forensics, images normally provide important information within an investigation. However, not all images may still be available within a forensic digital investigation as they were all deleted for example. Data carving can be used in this case to retrieve deleted images but the carving time is normally significant and these images can be moreover overwritten by other data. One o… ▽ More Today, in digital forensics, images normally provide important information within an investigation. However, not all images may still be available within a forensic digital investigation as they were all deleted for example. Data carving can be used in this case to retrieve deleted images but the carving time is normally significant and these images can be moreover overwritten by other data. One of the solutions is to look at thumbnails of images that are no longer available. These thumbnails can often be found within databases created by either operating systems or image viewers. In literature, most research and practical focus on the extraction of thumbnails from databases created by the operating system. There is a little research working on the thumbnails created by the image reviewers as these thumbnails are application-driven in terms of pre-defined sizes, adjustments and storage location. Eventually, thumbnail databases from image viewers are significant forensic artefacts for investigators as these programs deal with large amounts of images. However, investigating these databases so far is still manual or semi-automatic task that leads to the huge amount of forensic time. Therefore, in this paper we propose a new approach of automating extraction of thumbnails produced by image viewers. We also test our approach with popular image viewers in different storage structures and locations to show its robustness. △ Less

Submitted 29 August, 2017; originally announced August 2017.

arXiv:1704.04302 [pdf]

On a Distributed Approach for Density-based Clustering

Authors: Nhien-An Le-Khac, M-Tahar Kechadi

Abstract: Efficient extraction of useful knowledge from these data is still a challenge, mainly when the data is distributed, heterogeneous and of different quality depending on its corresponding local infrastructure. To reduce the overhead cost, most of the existing distributed clustering approaches generate global models by aggregating local results obtained on each individual node. The complexity and qua… ▽ More Efficient extraction of useful knowledge from these data is still a challenge, mainly when the data is distributed, heterogeneous and of different quality depending on its corresponding local infrastructure. To reduce the overhead cost, most of the existing distributed clustering approaches generate global models by aggregating local results obtained on each individual node. The complexity and quality of solutions depend highly on the quality of the aggregation. In this respect, we proposed for distributed density-based clustering that both reduces the communication overheads due to the data exchange and improves the quality of the global models by considering the shapes of local clusters. From preliminary results we show that this algorithm is very promising. △ Less

Submitted 13 April, 2017; originally announced April 2017.

arXiv:1704.04301 [pdf]

A Tree-based Approach for Detecting Redundant Business Rules in very Large Financial Datasets

Authors: Nhien-An Le-Khac, Sammer Markos, M-Tahar Kechadi

Abstract: Net Asset Value (NAV) calculation and validation is the principle task of a fund administrator. If the NAV of a fund is calculated incorrectly then there is huge impact on the fund administrator; such as monetary compensation, reputational loss, or loss of business. In general, these companies use the same methodology to calculate the NAV of a fund, however the type of fund in question dictates th… ▽ More Net Asset Value (NAV) calculation and validation is the principle task of a fund administrator. If the NAV of a fund is calculated incorrectly then there is huge impact on the fund administrator; such as monetary compensation, reputational loss, or loss of business. In general, these companies use the same methodology to calculate the NAV of a fund, however the type of fund in question dictates the set of business rules used to validate this. Today, most Fund Administrators depend heavily on human resources due to the lack of an automated standardized solutions, however due to economic climate and the need for efficiency and costs reduction many banks are now looking for an automated solution with minimal human interaction; i.e., straight through processing (STP). Within the scope of a collaboration project that focuses on building an optimal solution for NAV validation, in this paper, we will present a new approach for detecting correlated business rules. We also show how we evaluate this approach using real-world financial data. △ Less

Submitted 13 April, 2017; originally announced April 2017.

arXiv:1704.03538 [pdf]

Toward a Distributed Knowledge Discovery system for Grid systems

Authors: Nhien-An Le-Khac, Lamine Aouad, M-Tahar Kechadi

Abstract: During the last decade or so, we have had a deluge of data from not only science fields but also industry and commerce fields. Although the amount of data available to us is constantly increasing, our ability to process it becomes more and more difficult. Efficient discovery of useful knowledge from these datasets is therefore becoming a challenge and a massive economic need. This led to the need… ▽ More During the last decade or so, we have had a deluge of data from not only science fields but also industry and commerce fields. Although the amount of data available to us is constantly increasing, our ability to process it becomes more and more difficult. Efficient discovery of useful knowledge from these datasets is therefore becoming a challenge and a massive economic need. This led to the need of develo** large-scale data mining (DM) techniques to deal with these huge datasets either from science or economic applications. In this chapter, we present a new DDM system combining dataset-driven and architecture-driven strategies. Data-driven strategies will consider the size and heterogeneity of the data, while architecture driven will focus on the distribution of the datasets. This system is based on a Grid middleware tools that integrate appropriate large data manipulation operations. Therefore, this allows more dynamicity and autonomicity during the mining, integrating and processing phases △ Less

Submitted 11 April, 2017; originally announced April 2017.

arXiv:1704.03530 [pdf]

Feature Selection Parallel Technique for Remotely Sensed Imagery Classification

Authors: Nhien-An Le-Khac, M-Tahar Kechadi, Bo Wu, C. Chen

Abstract: Remote sensing research focusing on feature selection has long attracted the attention of the remote sensing community because feature selection is a prerequisite for image processing and various applications. Different feature selection methods have been proposed to improve the classification accuracy. They vary from basic search techniques to clonal selections, and various optimal criteria have… ▽ More Remote sensing research focusing on feature selection has long attracted the attention of the remote sensing community because feature selection is a prerequisite for image processing and various applications. Different feature selection methods have been proposed to improve the classification accuracy. They vary from basic search techniques to clonal selections, and various optimal criteria have been investigated. Recently, methods using dependence-based measures have attracted much attention due to their ability to deal with very high dimensional datasets. However, these methods are based on Cramers V test, which has performance issues with large datasets. In this paper, we propose a parallel approach to improve their performance. We evaluate our approach on hyper-spectral and high spatial resolution images and compare it to the proposed methods with a centralized version as preliminary results. The results are very promising. △ Less

Submitted 11 April, 2017; originally announced April 2017.

arXiv:1704.03527 [pdf]

Toward a new approach for massive LiDAR data processing

Authors: V-H Cao, K-X Chu, Nhien-An Le-Khac, M-T Kechadi, Debra F. Laefer, Linh Truong-Hong

Abstract: Laser scanning (also known as Light Detection And Ranging) has been widely applied in various application. As part of that, aerial laser scanning (ALS) has been used to collect topographic data points for a large area, which triggers to million points to be acquired. Furthermore, today, with integrating full wareform (FWF) technology during ALS data acquisition, all return information of laser pul… ▽ More Laser scanning (also known as Light Detection And Ranging) has been widely applied in various application. As part of that, aerial laser scanning (ALS) has been used to collect topographic data points for a large area, which triggers to million points to be acquired. Furthermore, today, with integrating full wareform (FWF) technology during ALS data acquisition, all return information of laser pulse is stored. Thus, ALS data are to be massive and complexity since the FWF of each laser pulse can be stored up to 256 samples and density of ALS data is also increasing significantly. Processing LiDAR data demands heavy operations and the traditional approaches require significant hardware and running time. On the other hand, researchers have recently proposed parallel approaches for analysing LiDAR data. These approaches are normally based on parallel architecture of target systems such as multi-core processors, GPU, etc. However, there is still missing efficient approaches/tools supporting the analysis of LiDAR data due to the lack of a deep study on both library tools and algorithms used in processing this data. In this paper, we present a comparative study of software libraries and algorithms to optimise the processing of LiDAR data. We also propose new method to improve this process with experiments on large LiDAR data. Finally, we discuss on a parallel solution of our approach where we integrate parallel computing in processing LiDAR data. △ Less

Submitted 11 April, 2017; originally announced April 2017.

arXiv:1704.03524 [pdf]

Forensic Analysis of TomTom Navigation Application

Authors: Nhien-An Le-Khac, Mark Roeloffs, M-Tahar Kechadi

Abstract: In the forensic field of digital technology, there has been a great deal of investigation into the decoding of navigation systems of the brand TomTom. As TomTom is the market leader in navigation systems, a large number of these devices are investigated. These devices can hold an abundance of significant location information. Currently, it is possible with the use of multiple methods to make physi… ▽ More In the forensic field of digital technology, there has been a great deal of investigation into the decoding of navigation systems of the brand TomTom. As TomTom is the market leader in navigation systems, a large number of these devices are investigated. These devices can hold an abundance of significant location information. Currently, it is possible with the use of multiple methods to make physical copies of mobile devices running Android. The next great forensic problem is all the various programs that can be installed on these devices. There is now an application available from the company TomTom in the Google Play Store. This application mimics a navigation system on your Android mobile device. Indeed, the TomTom application on Android can hold a great deal of information. In this paper, we present a process of forensic acquisition and analysis of the TomTom Android application. We focus on the following questions: Is there a possibility to find previously driven routes or GPS coordinates with timestamps in the memory of the mobile device? To investigate what is stored in these files, driving tests were performed. During these driving tests a copy was made of the most important file using a self-written program. The significant files were found and the data in these files was decoded. We show and analyse our results with Samsung mobile devices. We compare also these results with forensic acquisition from TomTom GPS devices. △ Less

Submitted 11 April, 2017; originally announced April 2017.

arXiv:1704.03421 [pdf, other]

doi 10.1109/DSAA.2016.70

Efficient Large Scale Clustering based on Data Partitioning

Authors: Malika Bendechache, Nhien-An Le-Khac, M-Tahar Kechadi

Abstract: Clustering techniques are very attractive for extracting and identifying patterns in datasets. However, their application to very large spatial datasets presents numerous challenges such as high-dimensionality data, heterogeneity, and high complexity of some algorithms. For instance, some algorithms may have linear complexity but they require the domain knowledge in order to determine their input… ▽ More Clustering techniques are very attractive for extracting and identifying patterns in datasets. However, their application to very large spatial datasets presents numerous challenges such as high-dimensionality data, heterogeneity, and high complexity of some algorithms. For instance, some algorithms may have linear complexity but they require the domain knowledge in order to determine their input parameters. Distributed clustering techniques constitute a very good alternative to the big data challenges (e.g.,Volume, Variety, Veracity, and Velocity). Usually these techniques consist of two phases. The first phase generates local models or patterns and the second one tends to aggregate the local results to obtain global models. While the first phase can be executed in parallel on each site and, therefore, efficient, the aggregation phase is complex, time consuming and may produce incorrect and ambiguous global clusters and therefore incorrect models. In this paper we propose a new distributed clustering approach to deal efficiently with both phases, generation of local results and generation of global models by aggregation. For the first phase, our approach is capable of analysing the datasets located in each site using different clustering techniques. The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. For the evaluation, we use two well-known clustering algorithms, K-Means and DBSCAN. One of the key outputs of this distributed clustering technique is that the number of global clusters is dynamic, no need to be fixed in advance. Experimental results show that the approach is scalable and produces high quality results. △ Less

Submitted 26 February, 2018; v1 submitted 11 April, 2017; originally announced April 2017.

Comments: 10 pages

Journal ref: Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on, 612--621, 2016

arXiv:1703.09823 [pdf]

Variance-based Clustering Technique for Distributed Data Mining Applications

Authors: Lamine M. Aouad, Nhien-An Le-Khac, Tahar Kechadi

Abstract: Nowadays, huge amounts of data are naturally collected in distributed sites due to different facts and moving these data through the network for extracting useful knowledge is almost unfeasible for either technical reasons or policies. Furthermore, classical par- allel algorithms cannot be applied, specially in loosely coupled environments. This requires to develop scalable distributed algorithms… ▽ More Nowadays, huge amounts of data are naturally collected in distributed sites due to different facts and moving these data through the network for extracting useful knowledge is almost unfeasible for either technical reasons or policies. Furthermore, classical par- allel algorithms cannot be applied, specially in loosely coupled environments. This requires to develop scalable distributed algorithms able to return the global knowledge by aggregating local results in an effective way. In this paper we propose a distributed algorithm based on independent local clustering processes and a global merging based on minimum variance increases and requires a limited communication overhead. We also introduce the notion of distributed sub-clusters perturbation to improve the global generated distribution. We show that this algorithm improves the quality of clustering compared to classical local centralized ones and is able to find real global data nature or distribution. △ Less

Submitted 28 March, 2017; originally announced March 2017.

arXiv:1703.09807 [pdf]

Grid-based Approaches for Distributed Data Mining Applications

Authors: Lamine M. Aouad, Nhien-An Le-Khac, Tahar Kechadi

Abstract: The data mining field is an important source of large-scale applications and datasets which are getting more and more common. In this paper, we present grid-based approaches for two basic data mining applications, and a performance evaluation on an experimental grid environment that provides interesting monitoring capabilities and configuration tools. We propose a new distributed clustering approa… ▽ More The data mining field is an important source of large-scale applications and datasets which are getting more and more common. In this paper, we present grid-based approaches for two basic data mining applications, and a performance evaluation on an experimental grid environment that provides interesting monitoring capabilities and configuration tools. We propose a new distributed clustering approach and a distributed frequent itemsets generation well-adapted for grid environments. Performance evaluation is done using the Condor system and its workflow manager DAGMan. We also compare this performance analysis to a simple analytical model to evaluate the overheads related to the workflow engine and the underlying grid system. This will specifically show that realistic performance expectations are currently difficult to achieve on the grid. △ Less

Submitted 28 March, 2017; originally announced March 2017.

arXiv:1703.09756 [pdf]

Admire framework: Distributed data mining on data grid platforms

Authors: Nhien-An Le-Khac, M-Tahar Kechadi, Joe Carthy

Abstract: In this paper, we present the ADMIRE architecture; a new framework for develo** novel and innovative data mining techniques to deal with very large and distributed heterogeneous datasets in both commercial and academic applications. The main ADMIRE components are detailed as well as its interfaces allowing the user to efficiently develop and implement their data mining applications techniques on… ▽ More In this paper, we present the ADMIRE architecture; a new framework for develo** novel and innovative data mining techniques to deal with very large and distributed heterogeneous datasets in both commercial and academic applications. The main ADMIRE components are detailed as well as its interfaces allowing the user to efficiently develop and implement their data mining applications techniques on a Grid platform such as Globus ToolKit, DGET, etc. △ Less

Submitted 28 March, 2017; originally announced March 2017.

arXiv:1612.00204 [pdf]

Forensics Acquisition and Analysis of instant messaging and VoIP applications

Authors: Christos Sgaras, M-Tahar Kechadi, Nhien-An Le-Khac

Abstract: The advent of the Internet has significantly transformed the daily activities of millions of people, with one of them being the way people communicate where Instant Messaging (IM) and Voice over IP (VoIP) communications have become prevalent. Although IM applications are ubiquitous communication tools nowadays, it was observed that the relevant research on the topic of evidence collection from IM… ▽ More The advent of the Internet has significantly transformed the daily activities of millions of people, with one of them being the way people communicate where Instant Messaging (IM) and Voice over IP (VoIP) communications have become prevalent. Although IM applications are ubiquitous communication tools nowadays, it was observed that the relevant research on the topic of evidence collection from IM services was limited. The reason is an IM can serve as a very useful yet very dangerous platform for the victim and the suspect to communicate. Indeed, the increased use of Instant Messengers on smart phones has turned to be the goldmine for mobile and computer forensic experts. Traces and Evidence left by applications can be held on smart phones and retrieving those potential evidences with right forensic technique is strongly required. Recently, most research on IM forensics focus on applications such as WhatsApp, Viber and Skype. However, in the literature, there are very few forensic analysis and comparison related to IM applications such as WhatsApp, Viber and Skype and Tango on both iOS and Android platforms, even though the total users of this application already exceeded 1 billion. Therefore, in this paper we present forensic acquisition and analysis of these four IMs and VoIPs for both iOS and Android platforms. We try to answer on how evidence can be collected when IM communications are used. We also define taxonomy of target artefacts in order to guide and structure the subsequent forensic analysis. Finally, a review of the information that can become available via the IM vendor was conducted. The achieved results of this research provided elaborative answers on the types of artifacts that can be identified by these IM and VoIP applications. We compare moreover the forensics analysis of these popular applications: WhatApp, Skype, Viber and Tango. △ Less

Submitted 1 December, 2016; originally announced December 2016.

arXiv:1611.09566 [pdf]

The State of the Art Forensic Techniques in Mobile Cloud Environment: A Survey, Challenges and Current Trends

Authors: Muhammad Faheem, M-Tahar Kechadi, Nhien-An Le-Khac

Abstract: Smartphones have become popular in recent days due to the accessibility of a wide range of applications. These sophisticated applications demand more computing resources in a resource constraint smartphone. Cloud computing is the motivating factor for the progress of these applications. The emerging mobile cloud computing introduces a new architecture to offload smartphone and utilize cloud comput… ▽ More Smartphones have become popular in recent days due to the accessibility of a wide range of applications. These sophisticated applications demand more computing resources in a resource constraint smartphone. Cloud computing is the motivating factor for the progress of these applications. The emerging mobile cloud computing introduces a new architecture to offload smartphone and utilize cloud computing technology to solve resource requirements. The popularity of mobile cloud computing is an opportunity for misuse and unlawful activities. Therefore, it is a challenging platform for digital forensic investigations due to the non-availability of methodologies, tools and techniques. The aim of this work is to analyze the forensic tools and methodologies for crime investigation in a mobile cloud platform as it poses challenges in proving the evidence. The advancement of forensic tools and methodologies are much slower than the current technology development in mobile cloud computing. Thus, forces the available tools, and techniques become increasingly obsolete. Therefore, it opens up the door for the new forensic tools and techniques to cope up with recent developments. Hence, this work presents a detailed survey of forensic methodology and corresponding issues in a mobile device, cloud environment, and mobile cloud applications. It mainly focuses on digital forensic issues related to mobile cloud applications and also analyze the scope, challenges and opportunities. Finally, this work reviewed the forensic procedures of two cloud storage services used for mobile cloud applications such as Dropbox and SkyDrive. △ Less

Submitted 29 November, 2016; originally announced November 2016.

arXiv:1611.09564 [pdf]

Toward a new mobile cloud forensic framework

Authors: Muhammad Faheem, M-Tahar Kechadi, Nhien-An Le-Khac

Abstract: Smartphones have created a significant impact on the day to day activities of every individual. Now a days a wide range of Smartphone applications are available and it necessitates high computing resources in order to build these applications. Cloud computing offers enormous resources and extends services to resource-constrained mobile devices. Mobile Cloud Computing is emerging as a key technolog… ▽ More Smartphones have created a significant impact on the day to day activities of every individual. Now a days a wide range of Smartphone applications are available and it necessitates high computing resources in order to build these applications. Cloud computing offers enormous resources and extends services to resource-constrained mobile devices. Mobile Cloud Computing is emerging as a key technology to utilize virtually unlimited resources over the Internet using Smartphones. Offloading data and computations to improve productivity, enhance performance, save energy, and improve user experience. Social network applications largely utilize Mobile Cloud Computing to reap the benefits. The social network has witnessed unprecedented growth in the recent years, and millions of registered users access it using Smartphones. The mobile cloud social network applications introduce not only convenience but also various issues related to criminal and illegal activities. Despite being primarily used to communicate and socialize with contacts, the multifarious and anonymous nature of social networking websites increases susceptibility to cybercrimes. Taking into account, the advantage of mobile cloud computing and popularity of social network applications, it is essential to establish a forensic framework based on mobile cloud platform that solves the problems of today forensic requirements. In this paper we present a mobile cloud forensic framework that allows the forensic investigator to collect the automated synchronized copies of data on both mobile and cloud servers to prove the evidence of cloud usage. We also show our preliminary results of this study. △ Less

Submitted 29 November, 2016; originally announced November 2016.

arXiv:1611.01754 [pdf]

Forensics in Industrial Control System: A Case Study

Authors: Pieter Van Vliet, M-T. Kechadi, Nhien-An Le-Khac

Abstract: Industrial Control Systems (ICS) are used worldwide in critical infrastructures. An ICS system can be a single embedded system working stand-alone for controlling a simple process or ICS can also be a very complex Distributed Control System (DCS) connected to Supervisory Control And Data Acquisition (SCADA) system(s) in a nuclear power plant. Although ICS are widely used to-day, there are very lit… ▽ More Industrial Control Systems (ICS) are used worldwide in critical infrastructures. An ICS system can be a single embedded system working stand-alone for controlling a simple process or ICS can also be a very complex Distributed Control System (DCS) connected to Supervisory Control And Data Acquisition (SCADA) system(s) in a nuclear power plant. Although ICS are widely used to-day, there are very little research on the forensic acquisition and analyze ICS artefacts. In this paper we present a case study of forensics in ICS where we de-scribe a method of safeguarding important volatile artefacts from an embedded industrial control system and several other sources △ Less

Submitted 6 November, 2016; originally announced November 2016.

arXiv:1609.08520 [pdf]

Clustering Approaches for Financial Data Analysis: a Survey

Authors: Fan Cai, Nhien-An Le-Khac, Tahar Kechadi

Abstract: Nowadays, financial data analysis is becoming increasingly important in the business market. As companies collect more and more data from daily operations, they expect to extract useful knowledge from existing collected data to help make reasonable decisions for new customer requests, e.g. user credit category, confidence of expected return, etc. Banking and financial institutes have applied diffe… ▽ More Nowadays, financial data analysis is becoming increasingly important in the business market. As companies collect more and more data from daily operations, they expect to extract useful knowledge from existing collected data to help make reasonable decisions for new customer requests, e.g. user credit category, confidence of expected return, etc. Banking and financial institutes have applied different data mining techniques to enhance their business performance. Among these techniques, clustering has been considered as a significant method to capture the natural structure of data. However, there are not many studies on clustering approaches for financial data analysis. In this paper, we evaluate different clustering algorithms for analysing different financial datasets varied from time series to transactions. We also discuss the advantages and disadvantages of each method to enhance the understanding of inner structure of financial datasets as well as the capability of each clustering method in this context. △ Less

Submitted 4 September, 2016; originally announced September 2016.

arXiv:1609.02976 [pdf]

An Integrated Classification Model for Financial Data Mining

Authors: Fan Cai, Nhien-An Le-Khac, M-T. Kechadi

Abstract: Nowadays, financial data analysis is becoming increasingly important in the business market. As companies collect more and more data from daily operations, they expect to extract useful knowledge from existing collected data to help make reasonable decisions for new customer requests, e.g. user credit category, churn analysis, real estate analysis, etc. Financial institutes have applied different… ▽ More Nowadays, financial data analysis is becoming increasingly important in the business market. As companies collect more and more data from daily operations, they expect to extract useful knowledge from existing collected data to help make reasonable decisions for new customer requests, e.g. user credit category, churn analysis, real estate analysis, etc. Financial institutes have applied different data mining techniques to enhance their business performance. However, simple ap-proach of these techniques could raise a performance issue. Besides, there are very few general models for both understanding and forecasting different finan-cial fields. We present in this paper a new classification model for analyzing fi-nancial data. We also evaluate this model with different real-world data to show its performance. △ Less

Submitted 9 September, 2016; originally announced September 2016.

arXiv:1609.02966 [pdf]

Android and Wireless data-extraction using Wi-Fi

Authors: Bert Busstra, Nhien-An Le-Khac, M-Tahar Kechadi

Abstract: Today, mobile phones are very popular, fast growing technology. Mobile phones of the present day are more and more like small computers. The so-called "smartphones" contain a wealth of information each. This information has been proven to be very useful in crime investigations, because relevant evidence can be found in data retrieved from mobile phones used by criminals. In traditional methods, th… ▽ More Today, mobile phones are very popular, fast growing technology. Mobile phones of the present day are more and more like small computers. The so-called "smartphones" contain a wealth of information each. This information has been proven to be very useful in crime investigations, because relevant evidence can be found in data retrieved from mobile phones used by criminals. In traditional methods, the data from mobile phones can be extracted using an USB-cable. However, for some reason this USB-cable connection cannot be made, the data should be extracted in an alternative way. In this paper, we study the possibility of extracting data from mobile devices using a Wi-Fi connection. We describe our approach on mobile devices containing the Android Operating System. Through our experiments, we also give recommendation on which application and protocol can be used best to retrieve data. △ Less

Submitted 9 September, 2016; originally announced September 2016.

arXiv:1609.02954 [pdf]

Forensics Acquisition of IMVU: A Case Study

Authors: Robert van Voorst, M-Tahar Kechadi, Nhien-An Le-Khac

Abstract: There are many applications available for personal computers and mobile devices that facilitate users in meeting potential partners. There is, however, a risk associated with the level of anonymity on using instant message applications, because there exists the potential for predators to attract and lure vulnerable users. Today Instant Messaging within a Virtual Universe (IMVU) combines custom ava… ▽ More There are many applications available for personal computers and mobile devices that facilitate users in meeting potential partners. There is, however, a risk associated with the level of anonymity on using instant message applications, because there exists the potential for predators to attract and lure vulnerable users. Today Instant Messaging within a Virtual Universe (IMVU) combines custom avatars, chat or instant message (IM), community, content creation, commerce, and anonymity. IMVU is also being exploited by criminals to commit a wide variety of offenses. However, there are very few researches on digital forensic acquisition of IMVU applications. In this paper, we discuss first of all on challenges of IMVU forensics. We present a forensic acquisition of an IMVU 3D application as a case study. We also describe and analyse our experiments with this application. △ Less

Submitted 9 September, 2016; originally announced September 2016.

arXiv:1609.02953 [pdf]

Toward a new tool to extract the Evidence from a Memory Card of Mobile phones

Authors: Rob Witteman, Arjen Meijer, M-T. Kechadi, Nhien-An Le-Khac

Abstract: Today, a mobile phone is not just a phone but it is a computer that you can also use for calling someone. Besides, in criminal investigations the importance of evidence from the mobile phone is increasing as more and more phones are seized at the Digital Forensic Department of the police. Indeed, the amount of memory cards of these mobile phones that need to be investigated separately is also incr… ▽ More Today, a mobile phone is not just a phone but it is a computer that you can also use for calling someone. Besides, in criminal investigations the importance of evidence from the mobile phone is increasing as more and more phones are seized at the Digital Forensic Department of the police. Indeed, the amount of memory cards of these mobile phones that need to be investigated separately is also increasing. Possible reasons are that the mobile phone investigation software does not support the specific mobile phone or the specific, for that investigation, artefacts. Sometimes the software investigates just the internal memory of the mobile phone and not the data which is written on the memory card. Fact is also that although the mobile phone was investigated by the dedicated software, the possibility that the associated memory card contains additional important information is evident. The current procedure to get all of the usable information from a memory card of a mobile phone is very time-consuming process and not user friendly. In this paper, we present a new single tool to simplify the investigation of a memory card from a mobile phone. We also test our tool with WhatsApp application installed on the memory card from different mobile phones. △ Less

Submitted 9 September, 2016; originally announced September 2016.

arXiv:1609.02031 [pdf]

An efficient Search Tool for an Anti-Money Laundering Application of an Multi-national Bank's Dataset

Authors: Nhien-An Le-Khac, Sammer Markos, Michael O'Neill, Anthony Brabazon, Tahar Kechadi

Abstract: Today, money laundering (ML) poses a serious threat not only to financial institutions but also to the nations. This criminal activity is becoming more and more sophisticated and seems to have moved from the clichy of drug trafficking to financing terrorism and surely not forgetting personal gain. Most of the financial institutions internationally have been implementing anti-money laundering solut… ▽ More Today, money laundering (ML) poses a serious threat not only to financial institutions but also to the nations. This criminal activity is becoming more and more sophisticated and seems to have moved from the clichy of drug trafficking to financing terrorism and surely not forgetting personal gain. Most of the financial institutions internationally have been implementing anti-money laundering solutions (AML) to fight investment fraud activities. In AML, the customer identification is an important task which helps AML experts to monitor customer habits: some being customer domicile, transactions that they are involved in etc. However, simple query tools provided by current DBMS as well as naive approaches in customer searching may produce incorrect and ambiguous results and their processing time is also very high due to the complexity of the database system architecture. In this paper, we present a new approach for identifying customers registered in an investment bank. This approach is developed as a tool that allows AML experts to quickly identify customers who are managed independently across separate databases. It is tested on real-world datasets, which are real and large financial datasets. Some preliminary experimental results show that this new approach is efficient and effective. △ Less

Submitted 28 March, 2017; v1 submitted 4 September, 2016; originally announced September 2016.

arXiv:1609.00992 [pdf]

Performance Evaluation of a Natural Language Processing approach applied in White Collar crime investigation

Authors: Maarten Banerveld, Nhien-An Le-Khac, Tahar Kechadi

Abstract: In today world we are confronted with increasing amounts of information every day coming from a large variety of sources. People and co-operations are producing data on a large scale, and since the rise of the internet, e-mail and social media the amount of produced data has grown exponentially. From a law enforcement perspective we have to deal with these huge amounts of data when a criminal inve… ▽ More In today world we are confronted with increasing amounts of information every day coming from a large variety of sources. People and co-operations are producing data on a large scale, and since the rise of the internet, e-mail and social media the amount of produced data has grown exponentially. From a law enforcement perspective we have to deal with these huge amounts of data when a criminal investigation is launched against an individual or company. Relevant questions need to be answered like who committed the crime, who were involved, what happened and on what time, who were communicating and about what? Not only the amount of available data to investigate has increased enormously, but also the complexity of this data has increased. When these communication patterns need to be combined with for instance a seized financial administration or corporate document shares a complex investigation problem arises. Recently, criminal investigators face a huge challenge when evidence of a crime needs to be found in the Big Data environment where they have to deal with large and complex datasets especially in financial and fraud investigations. To tackle this problem, a financial and fraud investigation unit of a European country has developed a new tool named LES that uses Natural Language Processing (NLP) techniques to help criminal investigators handle large amounts of textual information in a more efficient and faster way. In this paper, we present briefly this tool and we focus on the evaluation its performance in terms of the requirements of forensic investigation: speed, smarter and easier for investigators. In order to evaluate this LES tool, we use different performance metrics. We also show experimental results of our evaluation with large and complex datasets from real-world application. △ Less

Submitted 4 September, 2016; originally announced September 2016.

arXiv:1609.00990 [pdf]

A data mining-based solution for detecting suspicious money laundering cases in an investment bank

Authors: Nhien-An Le-Khac, Sammer Markos, Tahar Kechadi

Abstract: Today, money laundering poses a serious threat not only to financial institutions but also to the nation. This criminal activity is becoming more and more sophisticated and seems to have moved from the clichy of drug trafficking to financing terrorism and surely not forgetting personal gain. Most international financial institutions have been implementing anti-money laundering solutions to fight i… ▽ More Today, money laundering poses a serious threat not only to financial institutions but also to the nation. This criminal activity is becoming more and more sophisticated and seems to have moved from the clichy of drug trafficking to financing terrorism and surely not forgetting personal gain. Most international financial institutions have been implementing anti-money laundering solutions to fight investment fraud. However, traditional investigative techniques consume numerous man-hours. Recently, data mining approaches have been developed and are considered as well-suited techniques for detecting money laundering activities. Within the scope of a collaboration project for the purpose of develo** a new solution for the anti-money laundering Units in an international investment bank, we proposed a simple and efficient data mining-based solution for anti-money laundering. In this paper, we present this solution developed as a tool and show some preliminary experiment results with real transaction datasets. △ Less

Submitted 4 September, 2016; originally announced September 2016.

arXiv:1609.00988 [pdf]

A clustering-based data reduction for very large spatio-temporal datasets

Authors: Nhien-An Le-Khac, Martin Bue, Michael Whelan, Tahar Kechadi

Abstract: Today, huge amounts of data are being collected with spatial and temporal components from sources such as meteorological, satellite imagery etc. Efficient visualisation as well as discovery of useful knowledge from these datasets is therefore very challenging and becoming a massive economic need. Data Mining has emerged as the technology to discover hidden knowledge in very large amounts of data.… ▽ More Today, huge amounts of data are being collected with spatial and temporal components from sources such as meteorological, satellite imagery etc. Efficient visualisation as well as discovery of useful knowledge from these datasets is therefore very challenging and becoming a massive economic need. Data Mining has emerged as the technology to discover hidden knowledge in very large amounts of data. Furthermore, data mining techniques could be applied to decrease the large size of raw data by retrieving its useful knowledge as representatives. As a consequence, instead of dealing with a large size of raw data, we can use these representatives to visualise or to analyse without losing important information. This paper presents a new approach based on different clustering techniques for data reduction to help analyse very large spatio-temporal data. We also present and discuss preliminary results of this approach. △ Less

Submitted 29 March, 2017; v1 submitted 4 September, 2016; originally announced September 2016.

arXiv:1510.00661 [pdf, other]

HTML5 Zero Configuration Covert Channels: Security Risks and Challenges

Authors: Jason Farina, Mark Scanlon, Stephen Kohlmann, Nhien-An Le Khac, M-Tahar Kechadi

Abstract: In recent months there has been an increase in the popularity and public awareness of secure, cloudless file transfer systems. The aim of these services is to facilitate the secure transfer of files in a peer-to- peer (P2P) fashion over the Internet without the need for centralised authentication or storage. These services can take the form of client installed applications or entirely web browser… ▽ More In recent months there has been an increase in the popularity and public awareness of secure, cloudless file transfer systems. The aim of these services is to facilitate the secure transfer of files in a peer-to- peer (P2P) fashion over the Internet without the need for centralised authentication or storage. These services can take the form of client installed applications or entirely web browser based interfaces. Due to their P2P nature, there is generally no limit to the file sizes involved or to the volume of data transmitted - and where these limitations do exist they will be purely reliant on the capacities of the systems at either end of the transfer. By default, many of these services provide seamless, end-to-end encryption to their users. The cybersecurity and cyberforensic consequences of the potential criminal use of such services are significant. The ability to easily transfer encrypted data over the Internet opens up a range of opportunities for illegal use to cybercriminals requiring minimal technical know-how. This paper explores a number of these services and provides an analysis of the risks they pose to corporate and governmental security. A number of methods for the forensic investigation of such transfers are discussed. △ Less

Submitted 2 October, 2015; originally announced October 2015.

Comments: 15 pages; Proc. of Tenth ADFSL Conference on Digital Forensics, Security and Law (CDFSL 2015)

Showing 1–50 of 60 results for author: Kechadi, M