Search | arXiv e-print repository

AutoMLBench: A Comprehensive Experimental Evaluation of Automated Machine Learning Frameworks

Authors: Hassan Eldeeb, Mohamed Maher, Radwa Elshawi, Sherif Sakr

Abstract: With the booming demand for machine learning applications, it has been recognized that the number of knowledgeable data scientists can not scale with the growing data volumes and application needs in our digital world. In response to this demand, several automated machine learning (AutoML) frameworks have been developed to fill the gap of human expertise by automating the process of building machi… ▽ More With the booming demand for machine learning applications, it has been recognized that the number of knowledgeable data scientists can not scale with the growing data volumes and application needs in our digital world. In response to this demand, several automated machine learning (AutoML) frameworks have been developed to fill the gap of human expertise by automating the process of building machine learning pipelines. Each framework comes with different heuristics-based design decisions. In this study, we present a comprehensive evaluation and comparison of the performance characteristics of six popular AutoML frameworks, namely, AutoWeka, AutoSKlearn, TPOT, Recipe, ATM, and SmartML, across 100 data sets from established AutoML benchmark suites. Our experimental evaluation considers different aspects for its comparison, including the performance impact of several design decisions, including time budget, size of search space, meta-learning, and ensemble construction. The results of our study reveal various interesting insights that can significantly guide and impact the design of AutoML frameworks. △ Less

Submitted 12 April, 2023; v1 submitted 18 April, 2022; originally announced April 2022.

arXiv:2108.13066 [pdf, other]

To tune or not to tune? An Approach for Recommending Important Hyperparameters

Authors: Mohamadjavad Bahmani, Radwa El Shawi, Nshan Potikyan, Sherif Sakr

Abstract: Novel technologies in automated machine learning ease the complexity of algorithm selection and hyperparameter optimization. Hyperparameters are important for machine learning models as they significantly influence the performance of machine learning models. Many optimization techniques have achieved notable success in hyperparameter tuning and surpassed the performance of human experts. However,… ▽ More Novel technologies in automated machine learning ease the complexity of algorithm selection and hyperparameter optimization. Hyperparameters are important for machine learning models as they significantly influence the performance of machine learning models. Many optimization techniques have achieved notable success in hyperparameter tuning and surpassed the performance of human experts. However, depending on such techniques as blackbox algorithms can leave machine learning practitioners without insight into the relative importance of different hyperparameters. In this paper, we consider building the relationship between the performance of the machine learning models and their hyperparameters to discover the trend and gain insights, with empirical results based on six classifiers and 200 datasets. Our results enable users to decide whether it is worth conducting a possibly time-consuming tuning strategy, to focus on the most important hyperparameters, and to choose adequate hyperparameter spaces for tuning. The results of our experiments show that gradient boosting and Adaboost outperform other classifiers across 200 problems. However, they need tuning to boost their performance. Overall, the results obtained from this study provide a quantitative basis to focus efforts toward guided automated hyperparameter optimization and contribute toward the development of better-automated machine learning frameworks. △ Less

Submitted 30 August, 2021; originally announced August 2021.

Comments: Presented on The Fifth International Workshop on Automation in Machine Learning, A workshop to be held in conjunction with the KDD 2021 Conference

arXiv:2012.06171 [pdf, other]

doi 10.1145/3434642

The Future is Big Graphs! A Community View on Graph Processing Systems

Authors: Sherif Sakr, Angela Bonifati, Hannes Voigt, Alexandru Iosup, Khaled Ammar, Renzo Angles, Walid Aref, Marcelo Arenas, Maciej Besta, Peter A. Boncz, Khuzaima Daudjee, Emanuele Della Valle, Stefania Dumbrava, Olaf Hartig, Bernhard Haslhofer, Tim Hegeman, Jan Hidders, Katja Hose, Adriana Iamnitchi, Vasiliki Kalavri, Hugo Kapp, Wim Martens, M. Tamer Özsu, Eric Peukert, Stefan Plantikow , et al. (16 additional authors not shown)

Abstract: Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue t… ▽ More Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue to succeed? △ Less

Submitted 11 December, 2020; originally announced December 2020.

Comments: 12 pages, 3 figures, collaboration between the large-scale systems and data management communities, work started at the Dagstuhl Seminar 19491 on Big Graph Processing Systems, to be published in the Communications of the ACM

ACM Class: C.3; E.0; H.2; J.0

arXiv:2001.07906 [pdf, ps, other]

Graph Generators: State of the Art and Open Challenges

Authors: Angela Bonifati, Irena Holubová, Arnau Prat-Pérez, Sherif Sakr

Abstract: The abundance of interconnected data has fueled the design and implementation of graph generators reproducing real-world linking properties, or gauging the effectiveness of graph algorithms, techniques and applications manipulating these data. We consider graph generation across multiple subfields, such as Semantic Web, graph databases, social networks, and community detection, along with general… ▽ More The abundance of interconnected data has fueled the design and implementation of graph generators reproducing real-world linking properties, or gauging the effectiveness of graph algorithms, techniques and applications manipulating these data. We consider graph generation across multiple subfields, such as Semantic Web, graph databases, social networks, and community detection, along with general graphs. Despite the disparate requirements of modern graph generators throughout these communities, we analyze them under a common umbrella, reaching out the functionalities, the practical usage, and their supported operations. We argue that this classification is serving the need of providing scientists, researchers and practitioners with the right data generator at hand for their work. This survey provides a comprehensive overview of the state-of-the-art graph generators by focusing on those that are pertinent and suitable for several data-intensive tasks. Finally, we discuss open challenges and missing requirements of current graph generators along with their future extensions to new emerging fields. △ Less

Submitted 22 January, 2020; originally announced January 2020.

Comments: ACM Computing Surveys, 32 pages

arXiv:1906.02287 [pdf, other]

Automated Machine Learning: State-of-The-Art and Open Challenges

Authors: Radwa Elshawi, Mohamed Maher, Sherif Sakr

Abstract: With the continuous and vast increase in the amount of data in our digital world, it has been acknowledged that the number of knowledgeable data scientists can not scale to address these challenges. Thus, there was a crucial need for automating the process of building good machine learning models. In the last few years, several techniques and frameworks have been introduced to tackle the challenge… ▽ More With the continuous and vast increase in the amount of data in our digital world, it has been acknowledged that the number of knowledgeable data scientists can not scale to address these challenges. Thus, there was a crucial need for automating the process of building good machine learning models. In the last few years, several techniques and frameworks have been introduced to tackle the challenge of automating the process of Combined Algorithm Selection and Hyper-parameter tuning (CASH) in the machine learning domain. The main aim of these techniques is to reduce the role of the human in the loop and fill the gap for non-expert machine learning users by playing the role of the domain expert. In this paper, we present a comprehensive survey for the state-of-the-art efforts in tackling the CASH problem. In addition, we highlight the research work of automating the other steps of the full complex machine learning pipeline (AutoML) from data understanding till model deployment. Furthermore, we provide comprehensive coverage for the various tools and frameworks that have been introduced in this domain. Finally, we discuss some of the research directions and open challenges that need to be addressed in order to achieve the vision and goals of the AutoML process. △ Less

Submitted 11 June, 2019; v1 submitted 5 June, 2019; originally announced June 2019.

arXiv:1709.07493 [pdf, other]

Big Data Systems Meet Machine Learning Challenges: Towards Big Data Science as a Service

Authors: Radwa Elshawi, Sherif Sakr

Abstract: Recently, we have been witnessing huge advancements in the scale of data we routinely generate and collect in pretty much everything we do, as well as our ability to exploit modern technologies to process, analyze and understand this data. The intersection of these trends is what is called, nowadays, as Big Data Science. Cloud computing represents a practical and cost-effective solution for suppor… ▽ More Recently, we have been witnessing huge advancements in the scale of data we routinely generate and collect in pretty much everything we do, as well as our ability to exploit modern technologies to process, analyze and understand this data. The intersection of these trends is what is called, nowadays, as Big Data Science. Cloud computing represents a practical and cost-effective solution for supporting Big Data storage, processing and for sophisticated analytics applications. We analyze in details the building blocks of the software stack for supporting big data science as a commodity service for data scientists. We provide various insights about the latest ongoing developments and open challenges in this domain. △ Less

Submitted 21 September, 2017; originally announced September 2017.

arXiv:1702.08153 [pdf, other]

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud

Authors: Huijun Wu, Chen Wang, Yin** Fu, Sherif Sakr, Liming Zhu, Kai Lu

Abstract: Eliminating duplicate data in primary storage of clouds increases the cost-efficiency of cloud service providers as well as reduces the cost of users for using cloud services. Existing primary deduplication techniques either use inline caching to exploit locality in primary workloads or use post-processing deduplication running in system idle time to avoid the negative impact on I/O performance. H… ▽ More Eliminating duplicate data in primary storage of clouds increases the cost-efficiency of cloud service providers as well as reduces the cost of users for using cloud services. Existing primary deduplication techniques either use inline caching to exploit locality in primary workloads or use post-processing deduplication running in system idle time to avoid the negative impact on I/O performance. However, neither of them works well in the cloud servers running multiple services or applications for the following two reasons: Firstly, the temporal locality of duplicate data writes may not exist in some primary storage workloads thus inline caching often fails to achieve good deduplication ratio. Secondly, the post-processing deduplication allows duplicate data to be written into disks, therefore does not provide the benefit of I/O deduplication and requires high peak storage capacity. This paper presents HPDedup, a Hybrid Prioritized data Deduplication mechanism to deal with the storage system shared by applications running in co-located virtual machines or containers by fusing an inline and a post-processing process for exact deduplication. In the inline deduplication phase, HPDedup gives a fingerprint caching mechanism that estimates the temporal locality of duplicates in data streams from different VMs or applications and prioritizes the cache allocation for these streams based on the estimation. HPDedup also allows different deduplication threshold for streams based on their spatial locality to reduce the disk fragmentation. The post-processing phase removes duplicates whose fingerprints are not able to be cached due to the weak temporal locality from disks. Our experimental results show that HPDedup clearly outperforms the state-of-the-art primary storage deduplication techniques in terms of inline cache efficiency and primary deduplication efficiency. △ Less

Submitted 16 April, 2017; v1 submitted 27 February, 2017; originally announced February 2017.

Comments: 14 pages, 11 figures, submitted to MSST2017

arXiv:1302.2966 [pdf, other]

The Family of MapReduce and Large Scale Data Processing Systems

Authors: Sherif Sakr, Anna Liu, Ayman G. Fayoumi

Abstract: In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of comm… ▽ More In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions. △ Less

Submitted 12 February, 2013; originally announced February 2013.

Comments: arXiv admin note: text overlap with arXiv:1105.4252 by other authors

arXiv:1211.5817 [pdf, ps, other]

Extending SPARQL to Support Entity Grou** and Path Queries

Authors: Seyed-Mehdi-Reza Beheshti, Sherif Sakr, Boualem Benatallah, Hamid Reza Motahari-Nezhad

Abstract: The ability to efficiently find relevant subgraphs and paths in a large graph to a given query is important in many applications including scientific data analysis, social networks, and business intelligence. Currently, there is little support and no efficient approaches for expressing and executing such queries. This paper proposes a data model and a query language to address this problem. The co… ▽ More The ability to efficiently find relevant subgraphs and paths in a large graph to a given query is important in many applications including scientific data analysis, social networks, and business intelligence. Currently, there is little support and no efficient approaches for expressing and executing such queries. This paper proposes a data model and a query language to address this problem. The contributions include supporting the construction and selection of: (i) folder nodes, representing a set of related entities, and (ii) path nodes, representing a set of paths in which a path is the transitive relationship of two or more entities in the graph. Folders and paths can be stored and used for future queries. We introduce FPSPARQL which is an extension of the SPARQL supporting folder and path nodes. We have implemented a query engine that supports FPSPARQL and the evaluation results shows its viability and efficiency for querying large graph datasets. △ Less

Submitted 21 November, 2012; originally announced November 2012.

Comments: 23 pages. arXiv admin note: text overlap with arXiv:1211.5009

Report number: UNSW-CSE-TR-1019

arXiv:1102.1064 [pdf, other]

A Decade of Database Research Publications

Authors: Sherif Sakr, Mohammad Alomari

Abstract: We analyze the database research publications of four major core database technology conferences (SIGMOD, VLDB, ICDE, EDBT), two main theoretical database conferences (PODS, ICDT) and three database journals (TODS, VLDB Journal, TKDE) over a period of 10 years (2001 - 2010). Our analysis considers only regular papers as we do not include short papers, demo papers, posters, tutorials or panels into… ▽ More We analyze the database research publications of four major core database technology conferences (SIGMOD, VLDB, ICDE, EDBT), two main theoretical database conferences (PODS, ICDT) and three database journals (TODS, VLDB Journal, TKDE) over a period of 10 years (2001 - 2010). Our analysis considers only regular papers as we do not include short papers, demo papers, posters, tutorials or panels into our statistics. We rank the research scholars according to their number of publication in each conference/journal separately and in combined. We also report about the growth in the number of research publications and the size of the research community in the last decade. △ Less

Submitted 5 February, 2011; originally announced February 2011.

arXiv:0806.0075 [pdf, other]

An Experimental Investigation of XML Compression Tools

Authors: Sherif Sakr

Abstract: This paper presents an extensive experimental study of the state-of-the-art of XML compression tools. The study reports the behavior of nine XML compressors using a large corpus of XML documents which covers the different natures and scales of XML documents. In addition to assessing and comparing the performance characteristics of the evaluated XML compression tools, the study tries to assess th… ▽ More This paper presents an extensive experimental study of the state-of-the-art of XML compression tools. The study reports the behavior of nine XML compressors using a large corpus of XML documents which covers the different natures and scales of XML documents. In addition to assessing and comparing the performance characteristics of the evaluated XML compression tools, the study tries to assess the effectiveness and practicality of using these tools in the real world. Finally, we provide some guidelines and recommen- dations which are useful for hel** developers and users for making an effective decision for selecting the most suitable XML compression tool for their needs. △ Less

Submitted 31 May, 2008; originally announced June 2008.

Comments: http://xmlcompbench.sourceforge.net/

Showing 1–11 of 11 results for author: Sakr, S