Search | arXiv e-print repository

arXiv:2403.19546 [pdf, other]

doi 10.1145/3650203.3663326

Croissant: A Metadata Format for ML-Ready Datasets

Authors: Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos van der Velde, Steffen Vogler, Carole-Jean Wu

Abstract: Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is… ▽ More Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks. △ Less

Submitted 30 May, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

Comments: Published in Proceedings of ACM SIGMOD/PODS'24 Data Management for End-to-End Machine Learning (DEEM) Workshop https://dl.acm.org/doi/10.1145/3650203.3663326

arXiv:2207.12560 [pdf, other]

AMLB: an AutoML Benchmark

Authors: Pieter Gijsbers, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, Sébastien Poirier, Janek Thomas, Bernd Bischl, Joaquin Vanschoren

Abstract: Comparing different AutoML frameworks is notoriously challenging and often done incorrectly. We introduce an open and extensible benchmark that follows best practices and avoids common mistakes when comparing AutoML frameworks. We conduct a thorough comparison of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. The differences between the AutoML frameworks are explo… ▽ More Comparing different AutoML frameworks is notoriously challenging and often done incorrectly. We introduce an open and extensible benchmark that follows best practices and avoids common mistakes when comparing AutoML frameworks. We conduct a thorough comparison of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. The differences between the AutoML frameworks are explored with a multi-faceted analysis, evaluating model accuracy, its trade-offs with inference time, and framework failures. We also use Bradley-Terry trees to discover subsets of tasks where the relative AutoML framework rankings differ. The benchmark comes with an open-source tool that integrates with many AutoML frameworks and automates the empirical evaluation process end-to-end: from framework installation and resource allocation to in-depth evaluation. The benchmark uses public data sets, can be easily extended with other AutoML frameworks and tasks, and has a website with up-to-date results. △ Less

Submitted 16 November, 2023; v1 submitted 25 July, 2022; originally announced July 2022.

Comments: UNDER REVIEW: Revised submission to JMLR, with updated results from June 2023

arXiv:2106.05767 [pdf, other]

doi 10.1145/3449726.3459532

Meta-Learning for Symbolic Hyperparameter Defaults

Authors: Pieter Gijsbers, Florian Pfisterer, Jan N. van Rijn, Bernd Bischl, Joaquin Vanschoren

Abstract: Hyperparameter optimization in machine learning (ML) deals with the problem of empirically learning an optimal algorithm configuration from data, usually formulated as a black-box optimization problem. In this work, we propose a zero-shot method to meta-learn symbolic default hyperparameter configurations that are expressed in terms of the properties of the dataset. This enables a much faster, but… ▽ More Hyperparameter optimization in machine learning (ML) deals with the problem of empirically learning an optimal algorithm configuration from data, usually formulated as a black-box optimization problem. In this work, we propose a zero-shot method to meta-learn symbolic default hyperparameter configurations that are expressed in terms of the properties of the dataset. This enables a much faster, but still data-dependent, configuration of the ML algorithm, compared to standard hyperparameter optimization approaches. In the past, symbolic and static default values have usually been obtained as hand-crafted heuristics. We propose an approach of learning such symbolic configurations as formulas of dataset properties from a large set of prior evaluations on multiple datasets by optimizing over a grammar of expressions using an evolutionary algorithm. We evaluate our method on surrogate empirical performance models as well as on real data across 6 ML algorithms on more than 100 datasets and demonstrate that our method indeed finds viable symbolic defaults. △ Less

Submitted 11 June, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

Comments: Pieter Gijsbers and Florian Pfisterer contributed equally to the paper. V1: Two page GECCO poster paper accepted at GECCO 2021. V2: The original full length paper (8 pages) with appendix

arXiv:2007.04911 [pdf, other]

doi 10.1007/978-3-030-67670-4_39

GAMA: a General Automated Machine learning Assistant

Authors: Pieter Gijsbers, Joaquin Vanschoren

Abstract: The General Automated Machine learning Assistant (GAMA) is a modular AutoML system developed to empower users to track and control how AutoML algorithms search for optimal machine learning pipelines, and facilitate AutoML research itself. In contrast to current, often black-box systems, GAMA allows users to plug in different AutoML and post-processing techniques, logs and visualizes the search pro… ▽ More The General Automated Machine learning Assistant (GAMA) is a modular AutoML system developed to empower users to track and control how AutoML algorithms search for optimal machine learning pipelines, and facilitate AutoML research itself. In contrast to current, often black-box systems, GAMA allows users to plug in different AutoML and post-processing techniques, logs and visualizes the search process, and supports easy benchmarking. It currently features three AutoML search algorithms, two model post-processing steps, and is designed to allow for more components to be added. △ Less

Submitted 7 October, 2021; v1 submitted 9 July, 2020; originally announced July 2020.

Comments: Accepted at ECML-PKDD 2020 Demo Track

Journal ref: Lecture Notes in Computer Science, vol 12461 (2021). p560-564

arXiv:1911.02490 [pdf, other]

OpenML-Python: an extensible Python API for OpenML

Authors: Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Müller, Joaquin Vanschoren, Frank Hutter

Abstract: OpenML is an online platform for open science collaboration in machine learning, used to share datasets and results of machine learning experiments. In this paper we introduce OpenML-Python, a client API for Python, opening up the OpenML platform for a wide range of Python-based tools. It provides easy access to all datasets, tasks and experiments on OpenML from within Python. It also provides fun… ▽ More OpenML is an online platform for open science collaboration in machine learning, used to share datasets and results of machine learning experiments. In this paper we introduce OpenML-Python, a client API for Python, opening up the OpenML platform for a wide range of Python-based tools. It provides easy access to all datasets, tasks and experiments on OpenML from within Python. It also provides functionality to conduct machine learning experiments, upload the results to OpenML, and reproduce results which are stored on OpenML. Furthermore, it comes with a scikit-learn plugin and a plugin mechanism to easily integrate other machine learning libraries written in Python into the OpenML ecosystem. Source code and documentation is available at https://github.com/openml/openml-python/. △ Less

Submitted 23 June, 2021; v1 submitted 6 November, 2019; originally announced November 2019.

Journal ref: Journal of Machine Learning Research 22(100), 2021

arXiv:1907.00909 [pdf, other]

An Open Source AutoML Benchmark

Authors: Pieter Gijsbers, Erin LeDell, Janek Thomas, Sébastien Poirier, Bernd Bischl, Joaquin Vanschoren

Abstract: In recent years, an active field of research has developed around automated machine learning (AutoML). Unfortunately, comparing different AutoML systems is hard and often done incorrectly. We introduce an open, ongoing, and extensible benchmark framework which follows best practices and avoids common mistakes. The framework is open-source, uses public datasets and has a website with up-to-date res… ▽ More In recent years, an active field of research has developed around automated machine learning (AutoML). Unfortunately, comparing different AutoML systems is hard and often done incorrectly. We introduce an open, ongoing, and extensible benchmark framework which follows best practices and avoids common mistakes. The framework is open-source, uses public datasets and has a website with up-to-date results. We use the framework to conduct a thorough comparison of 4 AutoML systems across 39 datasets and analyze the results. △ Less

Submitted 1 July, 2019; originally announced July 2019.

Comments: Accepted paper at the AutoML Workshop at ICML 2019. Code: https://github.com/openml/automlbenchmark/ Accompanying website: https://openml.github.io/automlbenchmark/

arXiv:1801.06007 [pdf, ps, other]

Layered TPOT: Speeding up Tree-based Pipeline Optimization

Authors: Pieter Gijsbers, Joaquin Vanschoren, Randal S. Olson

Abstract: With the demand for machine learning increasing, so does the demand for tools which make it easier to use. Automated machine learning (AutoML) tools have been developed to address this need, such as the Tree-Based Pipeline Optimization Tool (TPOT) which uses genetic programming to build optimal pipelines. We introduce Layered TPOT, a modification to TPOT which aims to create pipelines equally good… ▽ More With the demand for machine learning increasing, so does the demand for tools which make it easier to use. Automated machine learning (AutoML) tools have been developed to address this need, such as the Tree-Based Pipeline Optimization Tool (TPOT) which uses genetic programming to build optimal pipelines. We introduce Layered TPOT, a modification to TPOT which aims to create pipelines equally good as the original, but in significantly less time. This approach evaluates candidate pipelines on increasingly large subsets of the data according to their fitness, using a modified evolutionary algorithm to allow for separate competition between pipelines trained on different sample sizes. Empirical evaluation shows that, on sufficiently large datasets, Layered TPOT indeed finds better models faster. △ Less

Submitted 12 March, 2018; v1 submitted 18 January, 2018; originally announced January 2018.

Comments: Update to include a reference to Zutty et al. after it was brought to our attention

arXiv:1708.03731 [pdf, other]

OpenML Benchmarking Suites

Authors: Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, Joaquin Vanschoren

Abstract: Machine learning research depends on objectively interpretable, comparable, and reproducible algorithm benchmarks. We advocate the use of curated, comprehensive suites of machine learning tasks to standardize the setup, execution, and reporting of benchmarks. We enable this through software tools that help to create and leverage these benchmarking suites. These are seamlessly integrated into the O… ▽ More Machine learning research depends on objectively interpretable, comparable, and reproducible algorithm benchmarks. We advocate the use of curated, comprehensive suites of machine learning tasks to standardize the setup, execution, and reporting of benchmarks. We enable this through software tools that help to create and leverage these benchmarking suites. These are seamlessly integrated into the OpenML platform, and accessible through interfaces in Python, Java, and R. OpenML benchmarking suites (a) are easy to use through standardized data formats, APIs, and client libraries; (b) come with extensive meta-information on the included datasets; and (c) allow benchmarks to be shared and reused in future studies. We then present a first, carefully curated and practical benchmarking suite for classification: the OpenML Curated Classification benchmarking suite 2018 (OpenML-CC18). Finally, we discuss use cases and applications which demonstrate the usefulness of OpenML benchmarking suites and the OpenML-CC18 in particular. △ Less

Submitted 22 November, 2021; v1 submitted 11 August, 2017; originally announced August 2017.

Comments: Accepted for publication in the Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS 2021)

Journal ref: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2021)

Showing 1–8 of 8 results for author: Gijsbers, P