Search | arXiv e-print repository

arXiv:2403.20063 [pdf, other]

Dataversifying Natural Sciences: Pioneering a Data Lake Architecture for Curated Data-Centric Experiments in Life \& Earth Sciences

Authors: Genoveva Vargas-Solar, Jérôme Darmont, Alejandro Adorjan, Javier A. Espinosa-Oviedo, Carmem Hara, Sabine Loudcher, Regina Motz, Martin Musicante, José-Luis Zechinelli-Martini

Abstract: This vision paper introduces a pioneering data lake architecture designed to meet Life \& Earth sciences' burgeoning data management needs. As the data landscape evolves, the imperative to navigate and maximize scientific opportunities has never been greater. Our vision paper outlines a strategic approach to unify and integrate diverse datasets, aiming to cultivate a collaborative space conducive… ▽ More This vision paper introduces a pioneering data lake architecture designed to meet Life \& Earth sciences' burgeoning data management needs. As the data landscape evolves, the imperative to navigate and maximize scientific opportunities has never been greater. Our vision paper outlines a strategic approach to unify and integrate diverse datasets, aiming to cultivate a collaborative space conducive to scientific discovery.The core of the design and construction of a data lake is the development of formal and semi-automatic tools, enabling the meticulous curation of quantitative and qualitative data from experiments. Our unique ''research-in-the-loop'' methodology ensures that scientists across various disciplines are integrally involved in the curation process, combining automated, mathematical, and manual tasks to address complex problems, from seismic detection to biodiversity studies. By fostering reproducibility and applicability of research, our approach enhances the integrity and impact of scientific experiments. This initiative is set to improve data management practices, strengthening the capacity of Life \& Earth sciences to solve some of our time's most critical environmental and biological challenges. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Journal ref: 8th International workshop on Data Analytics solutions for Real-LIfe APplications (DARLI-AP@EDBT/ICDT 2024), Mar 2024, Paestum, Italy

arXiv:2307.01568 [pdf]

An Ontology-based Collaborative Business Intelligence Framework

Authors: Muhammad Fahad, Jérôme Darmont

Abstract: Business Intelligence constitutes a set of methodologies and tools aiming at querying, reporting, on-line analytic processing (OLAP), generating alerts, performing business analytics, etc. When in need to perform these tasks collectively by different collaborators, we need a Collaborative Business Intelligence (CBI) platform. CBI plays a significant role in targeting a common goal among various co… ▽ More Business Intelligence constitutes a set of methodologies and tools aiming at querying, reporting, on-line analytic processing (OLAP), generating alerts, performing business analytics, etc. When in need to perform these tasks collectively by different collaborators, we need a Collaborative Business Intelligence (CBI) platform. CBI plays a significant role in targeting a common goal among various companies, but it requires them to connect, organize and coordinate with each other to share opportunities, respecting their own autonomy and heterogeneity. This paper presents a CBI platform that hat democratizes data by allowing BI users to easily connect, share and visualize data among collaborators, obtain actionable answers by collaborative analysis, investigate and make collaborative decisions, and also store the analyses along graphical diagrams and charts in a collaborative ontology knowledge base. Our CBI framework supports and assists information sharing, collaborative decision-making and annotation management beyond the boundaries of individuals and enterprises. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Journal ref: 12th International Conference on Data Science, Technology and Applications (DATA 2023), INSTICC, Jul 2023, Rome, Italy

arXiv:2304.10556 [pdf]

A Reference Model for Collaborative Business Intelligence Virtual Assistants

Authors: Olga Cherednichenko, Fahad Muhammad, Jérôme Darmont, Cécile Favre

Abstract: Collaborative Business Analysis (CBA) is a methodology that involves bringing together different stakeholders, including business users, analysts, and technical specialists, to collaboratively analyze data and gain insights into business operations. The primary objective of CBA is to encourage knowledge sharing and collaboration between the different groups involved in business analysis, as this c… ▽ More Collaborative Business Analysis (CBA) is a methodology that involves bringing together different stakeholders, including business users, analysts, and technical specialists, to collaboratively analyze data and gain insights into business operations. The primary objective of CBA is to encourage knowledge sharing and collaboration between the different groups involved in business analysis, as this can lead to a more comprehensive understanding of the data and better decision-making. CBA typically involves a range of activities, including data gathering and analysis, brainstorming, problem-solving, decision-making and knowledge sharing. These activities may take place through various channels, such as in-person meetings, virtual collaboration tools or online forums. This paper deals with virtual collaboration tools as an important part of Business Intelligence (BI) platform. Collaborative Business Intelligence (CBI) tools are becoming more user-friendly, accessible, and flexible, allowing users to customize their experience and adapt to their specific needs. The goal of a virtual assistant is to make data exploration more accessible to a wider range of users and to reduce the time and effort required for data analysis. It describes the unified business intelligence semantic model, coupled with a data warehouse and collaborative unit to employ data mining technology. Moreover, we propose a virtual assistant for CBI and a reference model of virtual tools for CBI, which consists of three components: conversational, data exploration and recommendation agents. We believe that the allocation of these three functional tasks allows you to structure the CBI issue and apply relevant and productive models for human-like dialogue, text-to-command transferring, and recommendations simultaneously. The complex approach based on these three points gives the basis for virtual tool for collaboration. CBI encourages people, processes, and technology to enable everyone sharing and leveraging collective expertise, knowledge and data to gain valuable insights for making better decisions. This allows to respond more quickly and effectively to changes in the market or internal operations and improve the progress. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Journal ref: 6th International Conference on Computational Linguistics and Intelligent Systems (CoLInS 2022), Apr 2023, Kharkiv, Ukraine

arXiv:2302.05289 [pdf, other]

doi 10.1007/s10796-022-10315-z

Rumor Classification through a Multimodal Fusion Framework and Ensemble Learning

Authors: Abderrazek Azri, Cécile Favre, Nouria Harbi, Jérôme Darmont, Camille Noûs

Abstract: The proliferation of rumors on social media has become a major concern due to its ability to create a devastating impact. Manually assessing the veracity of social media messages is a very time-consuming task that can be much helped by machine learning. Most message veracity verification methods only exploit textual contents and metadata. Very few take both textual and visual contents, and more pa… ▽ More The proliferation of rumors on social media has become a major concern due to its ability to create a devastating impact. Manually assessing the veracity of social media messages is a very time-consuming task that can be much helped by machine learning. Most message veracity verification methods only exploit textual contents and metadata. Very few take both textual and visual contents, and more particularly images, into account. Moreover, prior works have used many classical machine learning models to detect rumors. However, although recent studies have proven the effectiveness of ensemble machine learning approaches, such models have seldom been applied. Thus, in this paper, we propose a set of advanced image features that are inspired from the field of image quality assessment, and introduce the Multimodal fusiON framework to assess message veracIty in social neTwORks (MONITOR), which exploits all message features by exploring various machine learning models. Moreover, we demonstrate the effectiveness of ensemble learning algorithms for rumor detection by using five metalearning models. Eventually, we conduct extensive experiments on two real-world datasets. Results show that MONITOR outperforms state-of-the-art machine learning baselines and that ensemble models significantly increase MONITOR's performance. △ Less

Submitted 4 January, 2023; originally announced February 2023.

Comments: Information Systems Frontiers, 2022

arXiv:2211.00995 [pdf, other]

The Collaborative Business Intelligence Ontology (CBIOnt)

Authors: Muhammad Fahad, Jérôme Darmont, Cécile Favre

Abstract: In the current era, many disciplines are seen devoted towards ontology development for their domains with the intention of creating, disseminating and managing resource descriptions of their domain knowledge into machine understandable and processable manner. Ontology construction is a difficult group activity that involves many people with the different expertise. Generally, domain experts are no… ▽ More In the current era, many disciplines are seen devoted towards ontology development for their domains with the intention of creating, disseminating and managing resource descriptions of their domain knowledge into machine understandable and processable manner. Ontology construction is a difficult group activity that involves many people with the different expertise. Generally, domain experts are not familiar with the ontology implementation environments and implementation experts do not have all the domain knowledge. We have designed Collaborative Business Intelligence Ontology (CBIOnt) for BI4People project. In this paper, we present CBIOnt that is OWL 2 DL ontology for the description of collaborative session between different collaborators working together on the business intelligent platform. As the collaborative session between various collaborators belongs to some collaborative form, phase and research aspect, therefore CBIOnt captures this knowledge along with the collaborative session content (comments, questions, answers, etc.) so that one can inference various types of information stored on ontologies when required. In addition, it stores the location and temporal-spatial information about the collaboration held between collaborators. We believe CBIOnt serves as a formal framework for dealing with the collaborative session taken place among collaborators on the semantic Web. △ Less

Submitted 2 November, 2022; originally announced November 2022.

Journal ref: 18e journ{é}es Business Intelligence et Big Data (EDA 2022), Oct 2022, Clermont-Ferrand, France

arXiv:2210.02237 [pdf, other]

doi 10.1007/978-3-031-15740-0_23

Dimensional Data KNN-Based Imputation

Authors: Yuzhao Yang, Jérôme Darmont, Franck Ravat, Olivier Teste

Abstract: Data Warehouses (DWs) are core components of Business Intelligence (BI). Missing data in DWs have a great impact on data analyses. Therefore, missing data need to be completed. Unlike other existing data imputation methods mainly adapted for facts, we propose a new imputation method for dimensions. This method contains two steps: 1) a hierarchical imputation and 2) a k-nearest neighbors (KNN) base… ▽ More Data Warehouses (DWs) are core components of Business Intelligence (BI). Missing data in DWs have a great impact on data analyses. Therefore, missing data need to be completed. Unlike other existing data imputation methods mainly adapted for facts, we propose a new imputation method for dimensions. This method contains two steps: 1) a hierarchical imputation and 2) a k-nearest neighbors (KNN) based imputation. Our solution has the advantage of taking into account the DW structure and dependency constraints. Experimental assessments validate our method in terms of effectiveness and efficiency. △ Less

Submitted 5 October, 2022; originally announced October 2022.

Journal ref: 26th European Conference on Advances in Databases and Information Systems (ADBIS 2022), Sep 2020, Turin, Italy. pp.315-329

arXiv:2110.15727 [pdf, other]

doi 10.1007/978-3-030-86517-7_31

Calling to CNN-LSTM for Rumor Detection: A Deep Multi-channel Model for Message Veracity Classification in Microblogs

Authors: Abderrazek Azri, Cécile Favre, Nouria Harbi, Jérôme Darmont, Camille Noûs

Abstract: Reputed by their low-cost, easy-access, real-time and valuable information, social media also wildly spread unverified or fake news. Rumors can notably cause severe damage on individuals and the society. Therefore, rumor detection on social media has recently attracted tremendous attention. Most rumor detection approaches focus on rumor feature analysis and social features, i.e., metadata in socia… ▽ More Reputed by their low-cost, easy-access, real-time and valuable information, social media also wildly spread unverified or fake news. Rumors can notably cause severe damage on individuals and the society. Therefore, rumor detection on social media has recently attracted tremendous attention. Most rumor detection approaches focus on rumor feature analysis and social features, i.e., metadata in social media. Unfortunately, these features are data-specific and may not always be available, e.g., when the rumor has just popped up and not yet propagated. In contrast, post contents (including images or videos) play an important role and can indicate the diffusion purpose of a rumor. Furthermore, rumor classification is also closely related to opinion mining and sentiment analysis. Yet, to the best of our knowledge, exploiting images and sentiments is little investigated.Considering the available multimodal features from microblogs, notably, we propose in this paper an end-to-end model called deepMONITOR that is based on deep neural networks and allows quite accurate automated rumor verification, by utilizing all three characteristics: post textual and image contents, as well as sentiment. deepMONITOR concatenates image features with the joint text and sentiment features to produce a reliable, fused classification. We conduct extensive experiments on two large-scale, real-world datasets. The results show that deepMONITOR achieves a higher accuracy than state-of-the-art methods. △ Less

Submitted 11 October, 2021; originally announced October 2021.

Journal ref: Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2021), Sep 2021, Bilbao, Spain. pp.497-513

arXiv:2110.01228 [pdf, ps, other]

doi 10.1007/978-3-030-86472-9_22

Internal Data Imputation in Data Warehouse Dimensions

Authors: Yuzhao Yang, Fatma Abdelhedi, Jérôme Darmont, Franck Ravat, Olivier Teste

Abstract: Missing values occur commonly in the multidimensional data warehouses. They may generate problems of usefulness of data since the analysis performed on a multidimensional data warehouse is through different dimensions with hierarchies where we can roll up or drill down to the different parameters of analysis. Therefore, it's essential to complete these missing values in order to carry out a better… ▽ More Missing values occur commonly in the multidimensional data warehouses. They may generate problems of usefulness of data since the analysis performed on a multidimensional data warehouse is through different dimensions with hierarchies where we can roll up or drill down to the different parameters of analysis. Therefore, it's essential to complete these missing values in order to carry out a better analysis. There are existing data imputation methods which are suitable for numeric data, so they can be applied for fact tables but not for dimension tables. Some other data imputation methods need extra time and effort costs. As consequence, we propose in this article an internal data imputation method for multidimensional data warehouse based on the existing data and considering the intra-dimension and inter-dimension relationships. △ Less

Submitted 4 October, 2021; originally announced October 2021.

Journal ref: 32nd International Conference on Database and Expert Systems Applications (DEXA 2021), Sep 2021, Linz, Austria. pp.237-244

arXiv:2110.01227 [pdf, other]

doi 10.1007/978-3-030-86534-4_2

Benchmarking Data Lakes Featuring Structured and Unstructured Data with DLBench

Authors: Pegdwendé Sawadogo, Jérôme Darmont

Abstract: In the last few years, the concept of data lake has become trendy for data storage and analysis. Thus, several design alternatives have been proposed to build data lake systems. However, these proposals are difficult to evaluate as there are no commonly shared criteria for comparing data lake systems. Thus, we introduce DLBench, a benchmark to evaluate and compare data lake implementations that su… ▽ More In the last few years, the concept of data lake has become trendy for data storage and analysis. Thus, several design alternatives have been proposed to build data lake systems. However, these proposals are difficult to evaluate as there are no commonly shared criteria for comparing data lake systems. Thus, we introduce DLBench, a benchmark to evaluate and compare data lake implementations that support textual and/or tabular contents. More concretely, we propose a data model made of both textual and raw tabular documents, a workload model composed of a set of various tasks, as well as a set of performance-based metrics, all relevant to the context of data lakes. As a proof of concept, we use DLBench to evaluate an open source data lake system we previously developed. △ Less

Submitted 4 October, 2021; originally announced October 2021.

Journal ref: 23rd International Conference on Big Data Analytics and Knowledge Discovery (DaWaK 2021), Sep 2021, Linz, Austria. pp.15-26

arXiv:2109.02271 [pdf, other]

doi 10.1007/978-3-030-82472-3

MONITOR: A Multimodal Fusion Framework to Assess Message Veracity in Social Networks

Authors: Abderrazek Azri, Cécile Favre, Nouria Harbi, Jérôme Darmont, Camille Noûs

Abstract: Users of social networks tend to post and share content with little restraint. Hence, rumors and fake news can quickly spread on a huge scale. This may pose a threat to the credibility of social media and can cause serious consequences in real life. Therefore, the task of rumor detection and verification has become extremely important. Assessing the veracity of a social media message (e.g., by fac… ▽ More Users of social networks tend to post and share content with little restraint. Hence, rumors and fake news can quickly spread on a huge scale. This may pose a threat to the credibility of social media and can cause serious consequences in real life. Therefore, the task of rumor detection and verification has become extremely important. Assessing the veracity of a social media message (e.g., by fact checkers) involves analyzing the text of the message, its context and any multimedia attachment. This is a very time-consuming task that can be much helped by machine learning. In the literature, most message veracity verification methods only exploit textual contents and metadata. Very few take both textual and visual contents, and more particularly images, into account. In this paper, we second the hypothesis that exploiting all of the components of a social media post enhances the accuracy of veracity detection. To further the state of the art, we first propose using a set of advanced image features that are inspired from the field of image quality assessment, which effectively contributes to rumor detection. These metrics are good indicators for the detection of fake images, even for those generated by advanced techniques like generative adversarial networks (GANs). Then, we introduce the Multimodal fusiON framework to assess message veracIty in social neTwORks (MONITOR), which exploits all message features (i.e., text, social context, and image features) by supervised machine learning. Such algorithms provide interpretability and explainability in the decisions taken, which we believe is particularly important in the context of rumor verification. Experimental results show that MONITOR can detect rumors with an accuracy of 96% and 89% on the MediaEval benchmark and the FakeNewsNet dataset, respectively. These results are significantly better than those of state-of-the-art machine learning baselines. △ Less

Submitted 6 September, 2021; originally announced September 2021.

Comments: 25th European Conference on Advances in Databases and Information Systems (ADBIS 2021), Aug 2021, Tartu, Estonia

arXiv:2109.01374 [pdf, other]

Joint Management and Analysis of Textual Documents and Tabular Data within the AUDAL Data Lake

Authors: Pegdwendé Sawadogo, Jérôme Darmont, Camille Noûs

Abstract: In 2010, the concept of data lake emerged as an alternative to data warehouses for big data management. Data lakes follow a schema-on-read approach to provide rich and flexible analyses. However, although trendy in both the industry and academia, the concept of data lake is still maturing, and there are still few methodological approaches to data lake design. Thus, we introduce a new approach to d… ▽ More In 2010, the concept of data lake emerged as an alternative to data warehouses for big data management. Data lakes follow a schema-on-read approach to provide rich and flexible analyses. However, although trendy in both the industry and academia, the concept of data lake is still maturing, and there are still few methodological approaches to data lake design. Thus, we introduce a new approach to design a data lake and propose an extensive metadata system to activate richer features than those usually supported in data lake approaches. We implement our approach in the AUDAL data lake, where we jointly exploit both textual documents and tabular data, in contrast with structured and/or semi-structured data typically processed in data lakes from the literature. Furthermore, we also innovate by leveraging metadata to activate both data retrieval and content analysis, including Text-OLAP and SQL querying. Finally, we show the feasibility of our approach using a real-word use case on the one hand, and a benchmark on the other hand. △ Less

Submitted 3 September, 2021; originally announced September 2021.

Journal ref: 25th European Conference on Advances in Databases and Information Systems (ADBIS 2021), Aug 2021, Tartu, Estonia. pp.88-101

arXiv:2108.05689 [pdf, other]

doi 10.1007/s10796-020-09999-y

TextBenDS: a generic Textual data Benchmark for Distributed Systems

Authors: Ciprian-Octavian Truica, Elena Apostol, Jérôme Darmont, Ira Assent

Abstract: Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, computation errors are introduced when analyzing on… ▽ More Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, computation errors are introduced when analyzing only subsets of the dataset. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of top-k keywords and documents, it is customary to design benchmarks that compare weighting schemes within various configurations of distributed frameworks and database management systems. Thus, we propose a generic document-oriented benchmark for storing textual data and constructing weighting schemes (TextBenDS). Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB proves to have the best overall performance, while Spark's execution time remains almost the same, regardless of the weighting schemes. △ Less

Submitted 12 August, 2021; originally announced August 2021.

Journal ref: Information Systems Frontiers, Springer Verlag, 2021, Breakthroughs on Cross-Cutting Data Management, Data Analytics and Applied Data Science, 23, pp.81-100

arXiv:2107.12055 [pdf, other]

doi 10.1145/3472163.3472268

An Automatic Schema-Instance Approach for Merging Multidimensional Data Warehouses

Authors: Yuzhao Yang, Jérôme Darmont, Franck Ravat, Olivier Teste

Abstract: Using data warehouses to analyse multidimensional data is a significant task in company decision-making.The data warehouse merging process is composed of two steps: matching multidimensional components and then merging them. Current approaches do not take all the particularities of multidimensional data warehouses into account, e.g., only merging schemata, but not instances; or not exploiting hier… ▽ More Using data warehouses to analyse multidimensional data is a significant task in company decision-making.The data warehouse merging process is composed of two steps: matching multidimensional components and then merging them. Current approaches do not take all the particularities of multidimensional data warehouses into account, e.g., only merging schemata, but not instances; or not exploiting hierarchies nor fact tables. Thus, in this paper, we propose an automatic merging approach for star schema-modeled data warehouses that works at both the schema and instance levels. We also provide algorithms for merging hierarchies, dimensions and facts. Eventually, we implement our merging algorithms and validate them with the use of both synthetic and benchmark datasets. △ Less

Submitted 26 July, 2021; originally announced July 2021.

Comments: 25th International Database Engineering & Applications Symposium (IDEAS 2021), Jul 2021, Montreal, Canada

arXiv:2107.11157 [pdf, other]

doi 10.1145/3472163.3472266

ArchaeoDAL: A Data Lake for Archaeological Data Management and Analytics

Authors: Pengfei Liu, Sabine Loudcher, Jérôme Darmont, Camille Noûs

Abstract: With new emerging technologies, such as satellites and drones, archaeologists collect data over large areas. However, it becomes difficult to process such data in time. Archaeological data also have many different formats (images, texts, sensor data) and can be structured, semi-structured and unstructured. Such variety makes data difficult to collect, store, manage, search and analyze effectively.… ▽ More With new emerging technologies, such as satellites and drones, archaeologists collect data over large areas. However, it becomes difficult to process such data in time. Archaeological data also have many different formats (images, texts, sensor data) and can be structured, semi-structured and unstructured. Such variety makes data difficult to collect, store, manage, search and analyze effectively. A few approaches have been proposed, but none of them covers the full data lifecycle nor provides an efficient data management system. Hence, we propose the use of a data lake to provide centralized data stores to host heterogeneous data, as well as tools for data quality checking, cleaning, transformation, and analysis. In this paper, we propose a generic, flexible and complete data lake architecture. Our metadata management system exploits goldMEDAL, which is the most complete metadata model currently available. Finally, we detail the concrete implementation of this architecture dedicated to an archaeological project. △ Less

Submitted 23 July, 2021; originally announced July 2021.

Comments: 25th International Database Engineering & Applications Symposium (IDEAS 2021), Jul 2021, Montréal, Canada

arXiv:2107.11152 [pdf, other]

doi 10.1007/s10844-020-00608-7

On data lake architectures and metadata management

Authors: Pegdwendé Sawadogo, Jérôme Darmont

Abstract: Over the past two decades, we have witnessed an exponential increase of data production in the world. So-called big data generally come from transactional systems, and even more so from the Internet of Things and social media. They are mainly characterized by volume, velocity, variety and veracity issues. Big data-related issues strongly challenge traditional data management and analysis systems.… ▽ More Over the past two decades, we have witnessed an exponential increase of data production in the world. So-called big data generally come from transactional systems, and even more so from the Internet of Things and social media. They are mainly characterized by volume, velocity, variety and veracity issues. Big data-related issues strongly challenge traditional data management and analysis systems. The concept of data lake was introduced to address them. A data lake is a large, raw data repository that stores and manages all company data bearing any format. However, the data lake concept remains ambiguous or fuzzy for many researchers and practitioners, who often confuse it with the Hadoop technology. Thus, we provide in this paper a comprehensive state of the art of the different approaches to data lake design. We particularly focus on data lake architectures and metadata management, which are key issues in successful data lakes. We also discuss the pros and cons of data lakes and their design alternatives. △ Less

Submitted 23 July, 2021; originally announced July 2021.

Journal ref: Journal of Intelligent Information Systems, Springer Verlag, 2021, 56 (1), pp.97-120

arXiv:2107.04027 [pdf, other]

goldMEDAL : une nouvelle contribution {à} la mod{é}lisation g{é}n{é}rique des m{é}tadonn{é}es des lacs de donn{é}es

Authors: Etienne Scholly, Pegdwendé Sawadogo, Pengfei Liu, Javier Espinosa-Oviedo, Cécile Favre, Sabine Loudcher, Jérôme Darmont, Camille Noûs

Abstract: We summarize here a paper published in 2021 in the DOLAP international workshop DOLAP associated with the EDBT and ICDT conferences. We propose goldMEDAL, a generic metadata model for data lakes based on four concepts and a three-level modeling: conceptual, logical and physical. We summarize here a paper published in 2021 in the DOLAP international workshop DOLAP associated with the EDBT and ICDT conferences. We propose goldMEDAL, a generic metadata model for data lakes based on four concepts and a three-level modeling: conceptual, logical and physical. △ Less

Submitted 5 July, 2021; originally announced July 2021.

Comments: in French. 17e journ{é}es Business Intelligence et Big Data (EDA 2021), Jul 2021, Toulouse, France

arXiv:2103.13155 [pdf, other]

Coining goldMEDAL: A New Contribution to Data Lake Generic Metadata Modeling

Authors: Etienne Scholly, Pegdwendé Sawadogo, Pengfei Liu, Javier Alfonso Espinosa-Oviedo, Cécile Favre, Sabine Loudcher, Jérôme Darmont, Camille Noûs

Abstract: The rise of big data has revolutionized data exploitation practices and led to the emergence of new concepts. Among them, data lakes have emerged as large heterogeneous data repositories that can be analyzed by various methods. An efficient data lake requires a metadata system that addresses the many problems arising when dealing with big data. In consequence, the study of data lake metadata model… ▽ More The rise of big data has revolutionized data exploitation practices and led to the emergence of new concepts. Among them, data lakes have emerged as large heterogeneous data repositories that can be analyzed by various methods. An efficient data lake requires a metadata system that addresses the many problems arising when dealing with big data. In consequence, the study of data lake metadata models is currently an active research topic and many proposals have been made in this regard. However, existing metadata models are either tailored for a specific use case or insufficiently generic to manage different types of data lakes, including our previous model MEDAL. In this paper, we generalize MEDAL's concepts in a new metadata model called goldMEDAL. Moreover, we compare goldMEDAL with the most recent state-of-the-art metadata models aiming at genericity and show that we can reproduce these metadata models with goldMEDAL's concepts. As a proof of concept, we also illustrate that goldMEDAL allows the design of various data lakes by presenting three different use cases. △ Less

Submitted 24 March, 2021; originally announced March 2021.

Journal ref: 23rd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP@EDBT/ICDT 2021), Mar 2021, Nicosia, Cyprus

arXiv:2102.02246 [pdf, other]

doi 10.1016/j.bdr.2021.100205

The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes

Authors: Ciprian-Octavian Truică, Elena-Simona Apostol, Jérôme Darmont, Torben Bach Pedersen

Abstract: In the current context of Big Data, a multitude of new NoSQL solutions for storing, managing, and extracting information and patterns from semi-structured data have been proposed and implemented. These solutions were developed to relieve the issue of rigid data structures present in relational databases, by introducing semi-structured and flexible schema design. As current data generated by differ… ▽ More In the current context of Big Data, a multitude of new NoSQL solutions for storing, managing, and extracting information and patterns from semi-structured data have been proposed and implemented. These solutions were developed to relieve the issue of rigid data structures present in relational databases, by introducing semi-structured and flexible schema design. As current data generated by different sources and devices, especially from IoT sensors and actuators, use either XML or JSON format, depending on the application, database technologies that store and query semi-structured data in XML format are needed. Thus, Native XML Databases, which were initially designed to manipulate XML data using standardized querying languages, i.e., XQuery and XPath, were rebranded as NoSQL Document-Oriented Databases Systems. Currently, the majority of these solutions have been replaced with the more modern JSON based Database Management Systems. However, we believe that XML-based solutions can still deliver performance in executing complex queries on heterogeneous collections. Unfortunately nowadays, research lacks a clear comparison of the scalability and performance for database technologies that store and query documents in XML versus the more modern JSON format. Moreover, to the best of our knowledge, there are no Big Data-compliant benchmarks for such database technologies. In this paper, we present a comparison for selected Document-Oriented Database Systems that either use the XML format to encode documents, i.e., BaseX, eXist-db, and Sedna, or the JSON format, i.e., MongoDB, CouchDB, and Couchbase. To underline the performance differences we also propose a benchmark that uses a heterogeneous complex schema on a large DBLP corpus. △ Less

Submitted 3 February, 2021; originally announced February 2021.

Comments: 28 pages, 6 figures, 7 tables

ACM Class: H.2

Journal ref: Big Data Research, Vol. 25, July 2021

arXiv:2012.02454 [pdf, other]

doi 10.1145/3423603.3424004

Data Lakes for Digital Humanities

Authors: Jérôme Darmont, Cécile Favre, Sabine Loudcher, Camille Noûs

Abstract: Traditional data in Digital Humanities projects bear various formats (structured, semi-structured, textual) and need substantial transformations (encoding and tagging, stemming, lemmatization, etc.) to be managed and analyzed. To fully master this process, we propose the use of data lakes as a solution to data siloing and big data variety problems. We describe data lake projects we currently run i… ▽ More Traditional data in Digital Humanities projects bear various formats (structured, semi-structured, textual) and need substantial transformations (encoding and tagging, stemming, lemmatization, etc.) to be managed and analyzed. To fully master this process, we propose the use of data lakes as a solution to data siloing and big data variety problems. We describe data lake projects we currently run in close collaboration with researchers in humanities and social sciences and discuss the lessons learned running these projects. △ Less

Submitted 4 December, 2020; originally announced December 2020.

Comments: Data and Digital Humanities Track

Journal ref: 2nd International Digital Tools & Uses Congress (DTUC 2020), Oct 2020, Hammamet, Tunisia. pp.38-41

arXiv:2012.01184 [pdf, other]

Feedback from the participants of the ADBIS, TPDL and EDA 2020 joint conferences

Authors: Pegdwendé Sawadogo, Jérôme Darmont, Fabien Duchateau

Abstract: This paper presents the way the joint ADBIS, TPDL and EDA 2020 conferences were organized online and the results of the participant survey conducted thereafter. We present the lessons learned from the participants' feedback. This paper presents the way the joint ADBIS, TPDL and EDA 2020 conferences were organized online and the results of the participant survey conducted thereafter. We present the lessons learned from the participants' feedback. △ Less

Submitted 19 December, 2020; v1 submitted 27 November, 2020; originally announced December 2020.

Comments: 7 pages, 16 figures

ACM Class: E.0; H.0; I.0

arXiv:2008.11409 [pdf, other]

Automatic Integration Issues of Tabular Data for On-Line Analysis Processing

Authors: Yuzhao Yang, Jérôme Darmont, Franck Ravat, Olivier Teste

Abstract: Companies and individuals produce numerous tabular data. The objective of this position paper is to draw up the challenges posed by the automatic integration of data in the form of tables so that they can be cross-analyzed. We provide a first automatic solution for the integration of such tabular data to allow On-Line Analysis Processing. To fulfil this task, features of tabular data should be ana… ▽ More Companies and individuals produce numerous tabular data. The objective of this position paper is to draw up the challenges posed by the automatic integration of data in the form of tables so that they can be cross-analyzed. We provide a first automatic solution for the integration of such tabular data to allow On-Line Analysis Processing. To fulfil this task, features of tabular data should be analyzed and the challenge of automatic multidimensional schema generation should be addressed. Hence, we propose a typology of tabular data and discuss our idea of an automatic solution. △ Less

Submitted 1 September, 2020; v1 submitted 26 August, 2020; originally announced August 2020.

Journal ref: 16e journ{é}es EDA Business Intelligence & Big Data (EDA 2020), Aug 2020, Lyon, France. pp.5-18

arXiv:2008.01196 [pdf, other]

Including Images into Message Veracity Assessment in Social Media

Authors: Abderrazek Azri, Cécile Favre, Nouria Harbi, Jérôme Darmont

Abstract: The extensive use of social media in the diffusion of information has also laid a fertile ground for the spread of rumors, which could significantly affect the credibility of social media. An ever-increasing number of users post news including, in addition to text, multimedia data such as images and videos. Yet, such multimedia content is easily editable due to the broad availability of simple and… ▽ More The extensive use of social media in the diffusion of information has also laid a fertile ground for the spread of rumors, which could significantly affect the credibility of social media. An ever-increasing number of users post news including, in addition to text, multimedia data such as images and videos. Yet, such multimedia content is easily editable due to the broad availability of simple and effective image and video processing tools. The problem of assessing the veracity of social network posts has attracted a lot of attention from researchers in recent years. However, almost all previous works have focused on analyzing textual contents to determine veracity, while visual contents, and more particularly images, remains ignored or little exploited in the literature. In this position paper, we propose a framework that explores two novel ways to assess the veracity of messages published on social networks by analyzing the credibility of both their textual and visual contents. △ Less

Submitted 20 July, 2020; originally announced August 2020.

Journal ref: 8th International Conference on Innovation and New Trends in Information Technology (INTIS 2019), Dec 2019, Tangier, Morocco

arXiv:1909.09377 [pdf, other]

doi 10.1007/978-3-030-30278-8

Metadata Systems for Data Lakes: Models and Features

Authors: Pegdwendé Sawadogo, Etienne Scholly, Cécile Favre, Eric Ferey, Sabine Loudcher, Jérôme Darmont

Abstract: Over the past decade, the data lake concept has emerged as an alternative to data warehouses for storing and analyzing big data. A data lake allows storing data without any predefined schema. Therefore, data querying and analysis depend on a metadata system that must be efficient and comprehensive. However, metadata management in data lakes remains a current issue and the criteria for evaluating i… ▽ More Over the past decade, the data lake concept has emerged as an alternative to data warehouses for storing and analyzing big data. A data lake allows storing data without any predefined schema. Therefore, data querying and analysis depend on a metadata system that must be efficient and comprehensive. However, metadata management in data lakes remains a current issue and the criteria for evaluating its effectiveness are more or less nonexistent.In this paper, we introduce MEDAL, a generic, graph-based model for metadata management in data lakes. We also propose evaluation criteria for data lake metadata systems through a list of expected features. Eventually, we show that our approach is more comprehensive than existing metadata systems. △ Less

Submitted 20 September, 2019; originally announced September 2019.

Journal ref: 1st International Workshop on BI and Big Data Applications (BBIGAP@ADBIS 2019), Sep 2019, Bled, Slovenia. pp.440-451

arXiv:1905.04037 [pdf, other]

Metadata Management for Textual Documents in Data Lakes

Authors: Pegdwendé Sawadogo, Tokio Kibata, Jérôme Darmont

Abstract: Data lakes have emerged as an alternative to data warehouses for the storage, exploration and analysis of big data. In a data lake, data are stored in a raw state and bear no explicit schema. Thence, an efficient metadata system is essential to avoid the data lake turning to a so-called data swamp. Existing works about managing data lake metadata mostly focus on structured and semi-structured data… ▽ More Data lakes have emerged as an alternative to data warehouses for the storage, exploration and analysis of big data. In a data lake, data are stored in a raw state and bear no explicit schema. Thence, an efficient metadata system is essential to avoid the data lake turning to a so-called data swamp. Existing works about managing data lake metadata mostly focus on structured and semi-structured data, with little research on unstructured data. Thus, we propose in this paper a methodological approach to build and manage a metadata system that is specific to textual documents in data lakes. First, we make an inventory of usual and meaningful metadata to extract. Then, we apply some specific techniques from the text mining and information retrieval domains to extract, store and reuse these metadata within the COREL research project, in order to validate our proposals. △ Less

Submitted 10 May, 2019; originally announced May 2019.

Journal ref: 21st International Conference on Enterprise Information Systems (ICEIS 2019), May 2019, Heraklion, Greece. pp.72-83

arXiv:1808.00197 [pdf, ps, other]

MaxMin Linear Initialization for Fuzzy C-Means

Authors: Aybükë Oztürk, Stéphane Lallich, Jérôme Darmont, Sylvie Yona Waksman

Abstract: Clustering is an extensive research area in data science. The aim of clustering is to discover groups and to identify interesting patterns in datasets. Crisp (hard) clustering considers that each data point belongs to one and only one cluster. However, it is inadequate as some data points may belong to several clusters, as is the case in text categorization. Thus, we need more flexible clustering.… ▽ More Clustering is an extensive research area in data science. The aim of clustering is to discover groups and to identify interesting patterns in datasets. Crisp (hard) clustering considers that each data point belongs to one and only one cluster. However, it is inadequate as some data points may belong to several clusters, as is the case in text categorization. Thus, we need more flexible clustering. Fuzzy clustering methods, where each data point can belong to several clusters, are an interesting alternative. Yet, seeding iterative fuzzy algorithms to achieve high quality clustering is an issue. In this paper, we propose a new linear and efficient initialization algorithm MaxMin Linear to deal with this problem. Then, we validate our theoretical results through extensive experiments on a variety of numerical real-world and artificial datasets. We also test several validity indices, including a new validity index that we propose, Transformed Standardized Fuzzy Difference (TSFD). △ Less

Submitted 1 August, 2018; originally announced August 2018.

Journal ref: IBaI. 14th International Conference on Machine Learning and Data Mining (MLDM 2018), Jul 2018, New York, United States. Springer, Lecture Notes in Artificial Intelligence, 10934-10935, 2018, Machine Learning and Data Mining in Pattern Recognition. http://www.mldm.de

arXiv:1807.04035 [pdf, other]

Modeling Data Lake Metadata with a Data Vault

Authors: Iuri Nogueira, Maram Romdhane, Jérôme Darmont

Abstract: With the rise of big data, business intelligence had to find solutions for managing even greater data volumes and variety than in data warehouses, which proved ill-adapted. Data lakes answer these needs from a storage point of view, but require managing adequate metadata to guarantee an efficient access to data. Starting from a multidimensional metadata model designed for an industrial heritage da… ▽ More With the rise of big data, business intelligence had to find solutions for managing even greater data volumes and variety than in data warehouses, which proved ill-adapted. Data lakes answer these needs from a storage point of view, but require managing adequate metadata to guarantee an efficient access to data. Starting from a multidimensional metadata model designed for an industrial heritage data lake presenting a lack of schema evolutivity, we propose in this paper to use ensemble modeling, and more precisely a data vault, to address this issue. To illustrate the feasibility of this approach, we instantiate our metadata conceptual model into relational and document-oriented logical and physical models, respectively. We also compare the physical models in terms of metadata storage and query response time. △ Less

Submitted 11 July, 2018; originally announced July 2018.

Journal ref: 22nd International Database Engineering & Applications Symposium (IDEAS 2018), Jun 2018, Villa San Giovanni, Italy. ACM, pp.253-261, 2018, http://confsys.encs.concordia.ca/IDEAS/ideas18/ideas18.php

arXiv:1806.01552 [pdf, other]

doi 10.1007/978-3-319-92007-8

A Visual Quality Index for Fuzzy C-Means

Authors: Aybükë Oztürk, Stéphane Lallich, Jérôme Darmont

Abstract: Cluster analysis is widely used in the areas of machine learning and data mining. Fuzzy clustering is a particular method that considers that a data point can belong to more than one cluster. Fuzzy clustering helps obtain flexible clusters, as needed in such applications as text categorization. The performance of a clustering algorithm critically depends on the number of clusters, and estimating t… ▽ More Cluster analysis is widely used in the areas of machine learning and data mining. Fuzzy clustering is a particular method that considers that a data point can belong to more than one cluster. Fuzzy clustering helps obtain flexible clusters, as needed in such applications as text categorization. The performance of a clustering algorithm critically depends on the number of clusters, and estimating the optimal number of clusters is a challenging task. Quality indices help estimate the optimal number of clusters. However, there is no quality index that can obtain an accurate number of clusters for different datasets. Thence, in this paper, we propose a new cluster quality index associated with a visual, graph-based solution that helps choose the optimal number of clusters in fuzzy partitions. Moreover, we validate our theoretical results through extensive comparison experiments against state-of-the-art quality indices on a variety of numerical real-world and artificial datasets. △ Less

Submitted 5 June, 2018; originally announced June 2018.

Journal ref: 14th International Conference on Artificial Intelligence Applications and Innovations (AIAI 2018), May 2018, Rhodes, Greece. Springer, IFIP Advances in Information and Communication Technology, 519, pp.546-555, 2018, http://easyconferences.eu/aiai2018/

arXiv:1804.07525 [pdf, other]

doi 10.1016/j.future.2018.02.037

Benchmarking Top-K Keyword and Top-K Document Processing with T${}^2$K${}^2$ and T${}^2$K${}^2$D${}^2$

Authors: Ciprian-Octavian Truica, Jérôme Darmont, Alexandru Boicea, Florin Radulescu

Abstract: Top-k keyword and top-k document extraction are very popular text analysis techniques. Top-k keywords and documents are often computed on-the-fly, but they exploit weighted vocabularies that are costly to build. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in thi… ▽ More Top-k keyword and top-k document extraction are very popular text analysis techniques. Top-k keywords and documents are often computed on-the-fly, but they exploit weighted vocabularies that are costly to build. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present T${}^2$K${}^2$, a top-k keywords and documents benchmark, and its decision support-oriented evolution T${}^2$K${}^2$D${}^2$. Both benchmarks feature a real tweet dataset and queries with various complexities and selectivities. They help evaluate weighting schemes and database implementations in terms of computing performance. To illustrate our bench-marks' relevance and genericity, we successfully ran performance tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on different relational (Oracle, PostgreSQL) and document-oriented (MongoDB) database implementations, on the other hand. △ Less

Submitted 20 April, 2018; originally announced April 2018.

Journal ref: Future Generation Computer Systems, Elsevier, 2018, 85, pp.60-75. https://www.sciencedirect.com/science/article/pii/S0167739X17323580

arXiv:1712.10155 [pdf, other]

doi 10.1007/s00778-017-0470-9

Secret Sharing for Cloud Data Security

Authors: Varunya Attasena, Jérôme Darmont, Nouria Harbi

Abstract: Cloud computing helps reduce costs, increase business agility and deploy solutions with a high return on investment for many types of applications. However, data security is of premium importance to many users and often restrains their adoption of cloud technologies. Various approaches, i.e., data encryption, anonymization, replication and verification, help enforce different facets of data securi… ▽ More Cloud computing helps reduce costs, increase business agility and deploy solutions with a high return on investment for many types of applications. However, data security is of premium importance to many users and often restrains their adoption of cloud technologies. Various approaches, i.e., data encryption, anonymization, replication and verification, help enforce different facets of data security. Secret sharing is a particularly interesting cryptographic technique. Its most advanced variants indeed simultaneously enforce data privacy, availability and integrity, while allowing computation on encrypted data. The aim of this paper is thus to wholly survey secret sharing schemes with respect to data security, data access and costs in the pay-as-you-go paradigm. △ Less

Submitted 29 December, 2017; originally announced December 2017.

Journal ref: The International Journal on Very Large Databases, Springer-Verlag, 2017, 26 (5), pp.657-681

arXiv:1709.04747 [pdf, other]

doi 10.1007/978-3-319-67162-8

T${}^2$K${}^2$: The Twitter Top-K Keywords Benchmark

Authors: Ciprian-Octavian Truică, Jérôme Darmont

Abstract: Information retrieval from textual data focuses on the construction of vocabularies that contain weighted term tuples. Such vocabularies can then be exploited by various text analysis algorithms to extract new knowledge, e.g., top-k keywords, top-k documents, etc. Top-k keywords are casually used for various purposes, are often computed on-the-fly, and thus must be efficiently computed. To compare… ▽ More Information retrieval from textual data focuses on the construction of vocabularies that contain weighted term tuples. Such vocabularies can then be exploited by various text analysis algorithms to extract new knowledge, e.g., top-k keywords, top-k documents, etc. Top-k keywords are casually used for various purposes, are often computed on-the-fly, and thus must be efficiently computed. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present a top-k keywords benchmark, T${}^2$K${}^2$, which features a real tweet dataset and queries with various complexities and selectivities. T${}^2$K${}^2$ helps evaluate weighting schemes and database implementations in terms of computing performance. To illustrate T${}^2$K${}^2$'s relevance and genericity, we successfully performed tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on different relational (Oracle, PostgreSQL) and document-oriented (MongoDB) database implementations, on the other hand. △ Less

Submitted 14 September, 2017; originally announced September 2017.

Journal ref: 21st European Conference on Advances in Databases and Information Systems (ADBIS 2017), Sep 2017, Nicosie, Cyprus. Springer, Communications in Computer and Information Science, 767, pp.21-28, 2017, New Trends in Databases and Information Systems

arXiv:1708.09171 [pdf, other]

Enforcing Privacy in Cloud Databases

Authors: Somayeh Sobati Moghadam, Jérôme Darmont, Gérald Gavin

Abstract: Outsourcing databases, i.e., resorting to Database-as-a-Service (DBaaS), is nowadays a popular choice due to the elasticity, availability, scalability and pay-as-you-go features of cloud computing. However, most data are sensitive to some extent, and data privacy remains one of the top concerns to DBaaS users, for obvious legal and competitive reasons.In this paper, we survey the mechanisms that a… ▽ More Outsourcing databases, i.e., resorting to Database-as-a-Service (DBaaS), is nowadays a popular choice due to the elasticity, availability, scalability and pay-as-you-go features of cloud computing. However, most data are sensitive to some extent, and data privacy remains one of the top concerns to DBaaS users, for obvious legal and competitive reasons.In this paper, we survey the mechanisms that aim at making databases secure in a cloud environment, and discuss current pitfalls and related research challenges. △ Less

Submitted 30 August, 2017; originally announced August 2017.

Journal ref: 19th International Conference on Big Data Analytics and Knowledge Discovery (DaWaK 2017), Aug 2017, Lyon, France. Springer, Lecture Notes in Computer Science, 10440, pp.53-73, 2017

arXiv:1708.06574 [pdf, other]

S4: A New Secure Scheme for Enforcing Privacy in Cloud Data Warehouses

Authors: Somayeh Moghadam, Jérôme Darmont, Gérald Gavin

Abstract: Outsourcing data into the cloud becomes popular thanks to the pay-as-you-go paradigm. However, such practice raises privacy concerns. The conventional way to achieve data privacy is to encrypt sensitive data before outsourcing. When data are encrypted, a trade-off must be achieved between security and efficient query processing. Existing solutions that adopt multiple encryption schemes induce a he… ▽ More Outsourcing data into the cloud becomes popular thanks to the pay-as-you-go paradigm. However, such practice raises privacy concerns. The conventional way to achieve data privacy is to encrypt sensitive data before outsourcing. When data are encrypted, a trade-off must be achieved between security and efficient query processing. Existing solutions that adopt multiple encryption schemes induce a heavy overhead in terms of data storage and query performance, and are not suited for cloud data warehouses. In this paper, we propose an efficient additive encryption scheme (S4) based on Shamir's secret sharing for securing data warehouses in the cloud. S4 addresses the shortcomings of existing approaches by reducing overhead while still enforcing good data privacy. Experimental results show the efficiency of S4 in terms of computation and storage overhead with respect to existing solutions. △ Less

Submitted 22 August, 2017; originally announced August 2017.

Journal ref: 7th International Conference on Information Systems and Technologies (ICIST 2017), Mar 2017, Dubai, United Arab Emirates. pp.9-16, 2017, Proceedings of the 7th International Conference on Information Systems and Technologies (ICIST 2017)

arXiv:1701.08643 [pdf]

Innovative Approaches for efficiently Warehousing Complex Data from the Web

Authors: Fadila Bentayeb, Nora Maïz, Hadj Mahboubi, Cécile Favre, Sabine Loudcher, Nouria Harbi, Omar Boussaïd, Jérôme Darmont

Abstract: Research in data warehousing and OLAP has produced important technologies for the design, management and use of information systems for decision support. With the development of Internet, the availability of various types of data has increased. Thus, users require applications to help them obtaining knowledge from the Web. One possible solution to facilitate this task is to extract information fro… ▽ More Research in data warehousing and OLAP has produced important technologies for the design, management and use of information systems for decision support. With the development of Internet, the availability of various types of data has increased. Thus, users require applications to help them obtaining knowledge from the Web. One possible solution to facilitate this task is to extract information from the Web, transform and load it to a Web Warehouse, which provides uniform access methods for automatic processing of the data. In this chapter, we present three innovative researches recently introduced to extend the capabilities of decision support systems, namely (1) the use of XML as a logical and physical model for complex data warehouses, (2) associating data mining to OLAP to allow elaborated analysis tasks for complex data and (3) schema evolution in complex data warehouses for personalized analyses. Our contributions cover the main phases of the data warehouse design process: data integration and modeling and user driven-OLAP analysis. △ Less

Submitted 30 January, 2017; originally announced January 2017.

Comments: Business Intelligence Applications and the Web: Models, Systems and Technologies, Business Science Reference, 2011

arXiv:1701.08634 [pdf]

Data Processing Benchmarks

Authors: Jérôme Darmont

Abstract: The aim of this article is to present an overview of the major families of state-of-the-art data processing benchmarks, namely transaction processing benchmarks and decision support benchmarks. We also address the newer trends in cloud benchmarking. Finally, we discuss the issues, tradeoffs and future trends for data processing benchmarks. The aim of this article is to present an overview of the major families of state-of-the-art data processing benchmarks, namely transaction processing benchmarks and decision support benchmarks. We also address the newer trends in cloud benchmarking. Finally, we discuss the issues, tradeoffs and future trends for data processing benchmarks. △ Less

Submitted 30 January, 2017; originally announced January 2017.

Comments: arXiv admin note: substantial text overlap with arXiv:1701.08052

Journal ref: Encyclopedia of Information Science and Technology, Third Edition, pp.146-152, 2014

arXiv:1701.08612 [pdf]

XML Warehousing and OLAP

Authors: Hadj Mahboubi, Marouane Hachicha, Jérôme Darmont

Abstract: The aim of this article is to present an overview of the major XML warehousing approaches from the literature, as well as the existing approaches for performing OLAP analyses over XML data (which is termed XML-OLAP or XOLAP; Wang et al., 2005). We also discuss the issues and future trends in this area and illustrate this topic by presenting the design of a unified, XML data warehouse architecture… ▽ More The aim of this article is to present an overview of the major XML warehousing approaches from the literature, as well as the existing approaches for performing OLAP analyses over XML data (which is termed XML-OLAP or XOLAP; Wang et al., 2005). We also discuss the issues and future trends in this area and illustrate this topic by presenting the design of a unified, XML data warehouse architecture and a set of XOLAP operators expressed in an XML algebra. △ Less

Submitted 30 January, 2017; originally announced January 2017.

Comments: arXiv admin note: substantial text overlap with arXiv:1701.08033

Journal ref: Encyclopedia of Data Warehousing and Mining, Second Edition, IV, IGI Publishing, pp.2109-2116, 2009

arXiv:1701.08088 [pdf]

doi 10.4018/978-1-61692-016-6.ch014

Query Performance Optimization in XML Data Warehouses

Authors: Hadj Mahboubi, Jérôme Darmont

Abstract: XML data warehouses form an interesting basis for decision-support applications that exploit complex data. However, native-XML database management systems (DBMSs) currently bear limited performances and it is necessary to research for ways to optimize them. In this chapter, we present two such techniques. First, we propose a join index that is specifically adapted to the multidimensional architect… ▽ More XML data warehouses form an interesting basis for decision-support applications that exploit complex data. However, native-XML database management systems (DBMSs) currently bear limited performances and it is necessary to research for ways to optimize them. In this chapter, we present two such techniques. First, we propose a join index that is specifically adapted to the multidimensional architecture of XML warehouses. It eliminates join operations while preserving the information contained in the original warehouse. Second, we present a strategy for selecting XML materialized views by clustering the query workload. To validate these proposals, we measure the response time of a set of decision-support XQueries over an XML data warehouse, with and without using our optimization techniques. Our experimental results demonstrate their efficiency, even when queries are complex and data are voluminous. △ Less

Submitted 27 January, 2017; originally announced January 2017.

Comments: arXiv admin note: substantial text overlap with arXiv:0809.1981, arXiv:0809.1963

Journal ref: E-Strategies for Resource Management Systems: Planning and Implementation, IGI Global, pp.232-253, 2010

arXiv:1701.08054 [pdf]

doi 10.4018/978-1-60566-242-8.ch072

Indices in XML Databases

Authors: Hadj Mahboubi, Jérôme Darmont

Abstract: With XML becoming a standard for business information representation and exchange, stor-ing, indexing, and querying XML documents have rapidly become major issues in database research. In this context, query processing and optimization are primordial, native-XML data-bases not being mature yet. Data structures such as indices, which help enhance performances substantially, are extensively research… ▽ More With XML becoming a standard for business information representation and exchange, stor-ing, indexing, and querying XML documents have rapidly become major issues in database research. In this context, query processing and optimization are primordial, native-XML data-bases not being mature yet. Data structures such as indices, which help enhance performances substantially, are extensively researched, especially since XML data bear numerous specifici-ties with respect to relational data. In this paper, we survey state-of-the-art XML indices and discuss the main issues, tradeoffs and future trends in XML indexing. We also present an in-dex that we specifically designed for the particular architecture of XML data warehouses. △ Less

Submitted 27 January, 2017; originally announced January 2017.

Journal ref: Handbook of Research on Innovations in Database Technologies and Applications, II, IGI Global, pp.674-681, 2009

arXiv:1701.08053 [pdf]

doi 10.4018/978-1-60566-232-9.ch015

Data Warehouse Benchmarking with DWEB

Authors: Jérôme Darmont

Abstract: Performance evaluation is a key issue for designers and users of Database Management Systems (DBMSs). Performance is generally assessed with software benchmarks that help, e.g., test architectural choices, compare different technologies or tune a system. In the particular context of data warehousing and On-Line Analytical Processing (OLAP), although the Transaction Processing Performance Council (… ▽ More Performance evaluation is a key issue for designers and users of Database Management Systems (DBMSs). Performance is generally assessed with software benchmarks that help, e.g., test architectural choices, compare different technologies or tune a system. In the particular context of data warehousing and On-Line Analytical Processing (OLAP), although the Transaction Processing Performance Council (TPC) aims at issuing standard decision-support benchmarks, few benchmarks do actually exist. We present in this chapter the Data Warehouse Engineering Benchmark (DWEB), which allows generating various ad-hoc synthetic data warehouses and workloads. DWEB is fully parameterized to fulfill various data warehouse design needs. However, two levels of parameterization keep it relatively easy to tune. We also expand on our previous work on DWEB by presenting its new Extract, Transform, and Load (ETL) feature as well as its new execution protocol. A Java implementation of DWEB is freely available on-line, which can be interfaced with most existing relational DMBSs. To the best of our knowledge, DWEB is the only easily available, up-to-date benchmark for data warehouses. △ Less

Submitted 27 January, 2017; originally announced January 2017.

Comments: arXiv admin note: substantial text overlap with arXiv:1701.00399; text overlap with arXiv:0705.1453

Journal ref: Advances in Data Warehousing and Mining, 3, IGI Publishing, pp.302-323, 2009, Progressive Methods in Data Warehousing and Business Intelligence: Concepts and Competitive Analytics

arXiv:1701.08052 [pdf]

Database Benchmarks

Authors: Jérôme Darmont

Abstract: The aim of this article is to present an overview of the major families of state-of-the-art data-base benchmarks, namely: relational benchmarks, object and object-relational benchmarks, XML benchmarks, and decision-support benchmarks, and to discuss the issues, tradeoffs and future trends in database benchmarking. We particularly focus on XML and decision-support benchmarks, which are currently th… ▽ More The aim of this article is to present an overview of the major families of state-of-the-art data-base benchmarks, namely: relational benchmarks, object and object-relational benchmarks, XML benchmarks, and decision-support benchmarks, and to discuss the issues, tradeoffs and future trends in database benchmarking. We particularly focus on XML and decision-support benchmarks, which are currently the most innovative tools that are developed in this area. △ Less

Submitted 27 January, 2017; originally announced January 2017.

Journal ref: Encyclopedia of Information Science and Technology, Second Edition, IGI Publishing, pp.950-954, 2009

arXiv:1701.08033 [pdf]

X-WACoDa: An XML-based approach for Warehousing and Analyzing Complex Data

Authors: Hadj Mahboubi, Jean-Christian Ralaivao, Sabine Loudcher, Omar Boussaïd, Fadila Bentayeb, Jérôme Darmont

Abstract: Data warehousing and OLAP applications must nowadays handle complex data that are not only numerical or symbolic. The XML language is well-suited to logically and physically represent complex data. However, its usage induces new theoretical and practical challenges at the modeling, storage and analysis levels, and a new trend toward XML warehousing has been emerging for a couple of years. Unfortun… ▽ More Data warehousing and OLAP applications must nowadays handle complex data that are not only numerical or symbolic. The XML language is well-suited to logically and physically represent complex data. However, its usage induces new theoretical and practical challenges at the modeling, storage and analysis levels, and a new trend toward XML warehousing has been emerging for a couple of years. Unfortunately, no standard XML data warehouse architecture emerges. In this paper, we propose a unified XML warehouse reference model that synthesizes and enhances related work, and fits into a global XML warehousing and analysis approach we have developed. We also present a software platform that is based on this model, as well as a case study that illustrates its usage. △ Less

Submitted 27 January, 2017; originally announced January 2017.

Journal ref: Advances in Data Warehousing and Mining, 3, IGI Publishing, pp.38-54, 2009, Data Warehousing Design and Advanced Engineering Applications: Methods for Complex Construction

arXiv:1701.08029 [pdf]

Index and Materialized View Selection in Data Warehouses

Authors: Kamel Aouiche, Jérôme Darmont

Abstract: The aim of this article is to present an overview of the major families of state-of-the-art index and materialized view selection methods, and to discuss the issues and future trends in data warehouse performance optimization. We particularly focus on data mining-based heuristics we developed to reduce the selection problem complexity and target the most pertinent candidate indexes and materialize… ▽ More The aim of this article is to present an overview of the major families of state-of-the-art index and materialized view selection methods, and to discuss the issues and future trends in data warehouse performance optimization. We particularly focus on data mining-based heuristics we developed to reduce the selection problem complexity and target the most pertinent candidate indexes and materialized views. △ Less

Submitted 27 January, 2017; originally announced January 2017.

Journal ref: Handbook of Research on Innovations in Database Technologies and Applications, II, pp.693-700, 2009

arXiv:1701.08028 [pdf]

Biomedical Data Warehouses

Authors: Jérôme Darmont, Emerson Olivier

Abstract: The aim of this article is to present an overview of the existing biomedical data warehouses and to discuss the issues and future trends in this area. We illustrate this topic by presenting the design of an innovative, complex data warehouse for personal, anticipative medicine. The aim of this article is to present an overview of the existing biomedical data warehouses and to discuss the issues and future trends in this area. We illustrate this topic by presenting the design of an innovative, complex data warehouse for personal, anticipative medicine. △ Less

Submitted 27 January, 2017; originally announced January 2017.

Comments: arXiv admin note: substantial text overlap with arXiv:0809.2688

Journal ref: Encyclopaedia of Healthcare Information Systems, IGI Publishing, pp.149-156, 2008

arXiv:1701.07739 [pdf]

Object Database Benchmarks

Authors: Jerome Darmont

Abstract: The need for performance measurement tools appeared soon after the emergence of the first Object-Oriented Database Management Systems (OODBMSs), and proved important for both designers and users (Atkinson \& Maier, 1990). Performance evaluation is useful to designers to determine elements of architecture and more generally to validate or refute hypotheses regarding the actual behavior of an OODBMS… ▽ More The need for performance measurement tools appeared soon after the emergence of the first Object-Oriented Database Management Systems (OODBMSs), and proved important for both designers and users (Atkinson \& Maier, 1990). Performance evaluation is useful to designers to determine elements of architecture and more generally to validate or refute hypotheses regarding the actual behavior of an OODBMS. Thus, performance evaluation is an essential component in the development process of well-designed and efficient systems. Users may also employ performance evaluation, either to compare the efficiency of different technologies before selecting an OODBMS or to tune a system.Performance evaluation by experimentation on a real system is generally referred to as benchmarking. It consists in performing a series of tests on a given OODBMS to estimate its performance in a given setting. Benchmarks are generally used to compare the global performance of OODBMSs, but they can also be exploited to illustrate the advantages of one system or another in a given situation, or to determine an optimal hardware configuration. Typically, a benchmark is constituted of two main elements: a workload model constituted of a database and a set of read and write operations to apply on this database, and a set of performance metrics. △ Less

Submitted 26 January, 2017; originally announced January 2017.

Journal ref: Encyclopedia of Information Science and Technology, I-III, Idea Group Publishing, pp.2146-2149, 2005

arXiv:1701.05449 [pdf]

doi 10.4018/ijdwm.2015040102

A Novel Multi-Secret Sharing Approach for Secure Data Warehousing and On-Line Analysis Processing in the Cloud

Authors: Varunya Attasena, Nouria Harbi, Jérôme Darmont

Abstract: Cloud computing helps reduce costs, increase business agility and deploy solutions with a high return on investment for many types of applications, including data warehouses and on-line analytical processing. However, storing and transferring sensitive data into the cloud raises legitimate security concerns. In this paper, we propose a new multi-secret sharing approach for deploying data warehouse… ▽ More Cloud computing helps reduce costs, increase business agility and deploy solutions with a high return on investment for many types of applications, including data warehouses and on-line analytical processing. However, storing and transferring sensitive data into the cloud raises legitimate security concerns. In this paper, we propose a new multi-secret sharing approach for deploying data warehouses in the cloud and allowing on-line analysis processing, while enforcing data privacy, integrity and availability. We first validate the relevance of our approach theoretically and then experimentally with both a simple random dataset and the Star Schema Benchmark. We also demonstrate its superiority to related methods. △ Less

Submitted 19 January, 2017; originally announced January 2017.

Journal ref: International Journal of Data Warehousing and Mining, 11 (2), pp.22 - 43 (2015)

arXiv:1701.05099 [pdf]

Cost Models for Selecting Materialized Views in Public Clouds

Authors: Romain Perriot, Jérémy Pfeifer, Laurent D 'Orazio, Bruno Bachelet, Sandro Bimonte, Jérôme Darmont

Abstract: Data warehouse performance is usually achieved through physical data structures such as indexes or materialized views. In this context, cost models can help select a relevant set ofsuch performance optimization structures. Nevertheless, selection becomes more complex in the cloud. The criterion to optimize is indeed at least two-dimensional, with monetary cost balancing overall query response time… ▽ More Data warehouse performance is usually achieved through physical data structures such as indexes or materialized views. In this context, cost models can help select a relevant set ofsuch performance optimization structures. Nevertheless, selection becomes more complex in the cloud. The criterion to optimize is indeed at least two-dimensional, with monetary cost balancing overall query response time. This paper introduces new cost models that fit into the pay-as-you-go paradigm of cloud computing. Based on these cost models, an optimization problem is defined to discover, among candidate views, those to be materialized to minimize both the overall cost of using and maintaining the database in a public cloud and the total response time ofa given query workload. We experimentally show that maintaining materialized views is always advantageous, both in terms of performance and cost. △ Less

Submitted 18 January, 2017; originally announced January 2017.

Journal ref: International Journal of Data Warehousing and Mining (JDWM), IGI Global, 2014, 10 (4), pp.1-25

arXiv:1701.04652 [pdf, ps, other]

doi 10.1109/TKDE.2011.209

A Survey of XML Tree Patterns

Authors: Marouane Hachicha, Jérôme Darmont

Abstract: With XML becoming an ubiquitous language for data interoperability purposes in various domains, efficiently querying XML data is a critical issue. This has lead to the design of algebraic frameworks based on tree-shaped patterns akin to the tree-structured data model of XML. Tree patterns are graphic representations of queries over data trees. They are actually matched against an input data tree t… ▽ More With XML becoming an ubiquitous language for data interoperability purposes in various domains, efficiently querying XML data is a critical issue. This has lead to the design of algebraic frameworks based on tree-shaped patterns akin to the tree-structured data model of XML. Tree patterns are graphic representations of queries over data trees. They are actually matched against an input data tree to answer a query. Since the turn of the twenty-first century, an astounding research effort has been focusing on tree pattern models and matching optimization (a primordial issue). This paper is a comprehensive survey of these topics, in which we outline and compare the various features of tree patterns. We also review and discuss the two main families of approaches for optimizing tree pattern matching, namely pattern tree minimization and holistic matching. We finally present actual tree pattern-based developments, to provide a global overview of this significant research topic. △ Less

Submitted 17 January, 2017; originally announced January 2017.

Journal ref: IEEE Transactions on Knowledge and Data Engineering, Institute of Electrical and Electronics Engineers, 2013, 25 (1), pp.29 - 46

arXiv:1701.02190 [pdf, ps, other]

doi 10.1504/IJBIDM.2009.029076

Fragmenting very large XML data warehouses via K-means clustering algorithm

Authors: Alfredo Cuzzocrea, Jérôme Darmont, Hadj Mahboubi

Abstract: XML data sources are more and more gaining popularity in the context of a wide family of Business Intelligence (BI) and On-Line Analytical Processing (OLAP) applications, due to the amenities of XML in representing and managing semi-structured and complex multidimensional data. As a consequence, many XML data warehouse models have been proposed during past years in order to handle hetero-geneity a… ▽ More XML data sources are more and more gaining popularity in the context of a wide family of Business Intelligence (BI) and On-Line Analytical Processing (OLAP) applications, due to the amenities of XML in representing and managing semi-structured and complex multidimensional data. As a consequence, many XML data warehouse models have been proposed during past years in order to handle hetero-geneity and complexity of multidimensional data in a way traditional relational data warehouse approaches fail to achieve. However, XML-native database systems currently suffer from limited performance, both in terms of volumes of manageable data and query response time. Therefore , recent research efforts are focusing the attention on fragmentation techniques, which are able to overcome the limitations above. Derived horizontal fragmentation is already used in relational data warehouses, and can definitely be adapted to the XML context. However, classical fragmentation algorithms are not suitable to control the number of originated fragments, which instead plays a critical role in data warehouses, and, with more emphasis, distributed data warehouse architectures. Inspired by this research challenge, in this paper we propose the use of K-means clustering algorithm for effectively and efficiently supporting the fragmentation of very large XML data warehouses, and, at the same time, completely controlling and determining the number of originated fragments via adequately setting the parameter K. We complete our analytical contribution by means of a comprehensive experimental assessment where we compare the efficiency of our proposed XML data warehouse fragmentation technique against those of classical derived horizontal fragmentation algorithms adapted to XML data warehouses. △ Less

Submitted 9 January, 2017; originally announced January 2017.

Journal ref: International Journal of Business Intelligence and Data Mining, Inderscience, 2009, 4 (3/4), pp.301-328

arXiv:1701.00400 [pdf]

doi 10.4018/jdm.2005040102

Evaluating the Dynamic Behavior of Database Applications

Authors: Zhen He, Jérôme Darmont

Abstract: This paper explores the effect that changing access patterns has on the performance of database management systems. Changes in access patterns play an important role in determining the efficiency of key performance optimization techniques, such as dynamic clustering, prefetching, and buffer replacement. However, all existing benchmarks or evaluation frameworks produce static… ▽ More This paper explores the effect that changing access patterns has on the performance of database management systems. Changes in access patterns play an important role in determining the efficiency of key performance optimization techniques, such as dynamic clustering, prefetching, and buffer replacement. However, all existing benchmarks or evaluation frameworks produce static access patterns in which objects are always accessed in the same order repeatedly. Hence, we have proposed the Dynamic Evaluation Framework (DEF) that simulates access pattern changes using configurable styles of change. DEF has been designed to be open and fully extensible (e.g., new access pattern change models can be added easily). In this paper, we instantiate DEF into the Dynamic object Evaluation Framework (DoEF) which is designed for object databases, i.e., object-oriented or object-relational databases such as multi-media databases or most XML databases.The capabilities of DoEF have been evaluated by simulating the execution of four different dynamic clustering algorithms. The results confirm our analysis that flexible conservative re-clustering is the key in determining a clustering algorithm's ability to adapt to changes in access pattern. These results show the effectiveness of DoEF at determining the adaptability of each dynamic clustering algorithm to changes in access pattern in a simulation environment. In a second set of experiments, we have used DoEF to compare the performance of two real-life object stores : Platypus and SHORE. DoEF has helped to reveal the poor swap** performance of Platypus. △ Less

Submitted 2 January, 2017; originally announced January 2017.

Comments: arXiv admin note: text overlap with arXiv:0705.1454

Journal ref: Journal of Database Management, IGI Global, 2005, 16 (2), pp.21 - 45

arXiv:1701.00399 [pdf]

doi 10.1504/IJBIDM.2007.012947

Benchmarking data warehouses

Authors: Jérôme Darmont, Fadila Bentayeb, Omar Boussaïd

Abstract: Data warehouse architectural choices and optimization techniques are critical to decision support query performance. To facilitate these choices, the performance of the designed data warehouse must be assessed, usually with benchmarks. These tools can either help system users comparing the performances of different systems, or help system engineers testing the effect of various design choices. Whi… ▽ More Data warehouse architectural choices and optimization techniques are critical to decision support query performance. To facilitate these choices, the performance of the designed data warehouse must be assessed, usually with benchmarks. These tools can either help system users comparing the performances of different systems, or help system engineers testing the effect of various design choices. While the Transaction Processing Performance Council's standard benchmarks address the first point, they are not tunable enough to address the second one and fail to model different data warehouse schemas. By contrast, our Data Warehouse Engineering Benchmark (DWEB) allows generating various ad-hoc synthetic data warehouses and workloads. DWEB is implemented as a Java free software that can be interfaced with most existing relational database management systems. The full specifications of DWEB, as well as experiments we performed to illustrate how our benchmark may be used, are provided in this paper. △ Less

Submitted 2 January, 2017; originally announced January 2017.

Comments: arXiv admin note: text overlap with arXiv:0705.1453

Journal ref: International Journal of Business Intelligence and Data Mining, Inderscience, 2007, 2 (1), pp.79-104

arXiv:1701.00398 [pdf, ps, other]

Warehousing complex data from the Web

Authors: Omar Boussaid, Jerome Darmont, Fadila Bentayeb, Sabine Loudcher

Abstract: The data warehousing and OLAP technologies are now moving onto handling complex data that mostly originate from the Web. However, intagrating such data into a decision-support process requires their representation under a form processable by OLAP and/or data mining techniques. We present in this paper a complex data warehousing methodology that exploits XML as a pivot language. Our approach includ… ▽ More The data warehousing and OLAP technologies are now moving onto handling complex data that mostly originate from the Web. However, intagrating such data into a decision-support process requires their representation under a form processable by OLAP and/or data mining techniques. We present in this paper a complex data warehousing methodology that exploits XML as a pivot language. Our approach includes the integration of complex data in an ODS, under the form of XML documents; their dimensional modeling and storage in an XML data warehouse; and their analysis with combined OLAP and data mining techniques. We also address the crucial issue of performance in XML warehouses. △ Less

Submitted 2 January, 2017; originally announced January 2017.

Journal ref: Int. J. Web Engineering and Technology, 2008, 4 (4), pp.408-433

Showing 1–50 of 93 results for author: Darmont, J