Search | arXiv e-print repository

The Future of Fundamental Science Led by Generative Closed-Loop Artificial Intelligence

Authors: Hector Zenil, Jesper Tegnér, Felipe S. Abrahão, Alexander Lavin, Vipin Kumar, Jeremy G. Frey, Adrian Weller, Larisa Soldatova, Alan R. Bundy, Nicholas R. Jennings, Koichi Takahashi, Lawrence Hunter, Saso Dzeroski, Andrew Briggs, Frederick D. Gregory, Carla P. Gomes, Jon Rowe, James Evans, Hiroaki Kitano, Ross King

Abstract: Recent advances in machine learning and AI, including Generative AI and LLMs, are disrupting technological innovation, product development, and society as a whole. AI's contribution to technology can come from multiple approaches that require access to large training data sets and clear performance evaluation criteria, ranging from pattern recognition and classification to generative models. Yet,… ▽ More Recent advances in machine learning and AI, including Generative AI and LLMs, are disrupting technological innovation, product development, and society as a whole. AI's contribution to technology can come from multiple approaches that require access to large training data sets and clear performance evaluation criteria, ranging from pattern recognition and classification to generative models. Yet, AI has contributed less to fundamental science in part because large data sets of high-quality data for scientific practice and model discovery are more difficult to access. Generative AI, in general, and Large Language Models in particular, may represent an opportunity to augment and accelerate the scientific discovery of fundamental deep science with quantitative models. Here we explore and investigate aspects of an AI-driven, automated, closed-loop approach to scientific discovery, including self-driven hypothesis generation and open-ended autonomous exploration of the hypothesis space. Integrating AI-driven automation into the practice of science would mitigate current problems, including the replication of findings, systematic production of data, and ultimately democratisation of the scientific process. Realising these possibilities requires a vision for augmented AI coupled with a diversity of AI approaches able to deal with fundamental aspects of causality analysis and model discovery while enabling unbiased search across the space of putative explanations. These advances hold the promise to unleash AI's potential for searching and discovering the fundamental structure of our world beyond what human scientists have been able to achieve. Such a vision would push the boundaries of new fundamental science rather than automatize current workflows and instead open doors for technological innovation to tackle some of the greatest challenges facing humanity today. △ Less

Submitted 29 August, 2023; v1 submitted 9 July, 2023; originally announced July 2023.

Comments: 35 pages, first draft of the final report from the Alan Turing Institute on AI for Scientific Discovery

arXiv:2306.17585 [pdf, other]

doi 10.1145/3583133.3590697

Comparing Algorithm Selection Approaches on Black-Box Optimization Problems

Authors: Ana Kostovska, Anja Jankovic, Diederick Vermetten, Sašo Džeroski, Tome Eftimov, Carola Doerr

Abstract: Performance complementarity of solvers available to tackle black-box optimization problems gives rise to the important task of algorithm selection (AS). Automated AS approaches can help replace tedious and labor-intensive manual selection, and have already shown promising performance in various optimization domains. Automated AS relies on machine learning (ML) techniques to recommend the best algo… ▽ More Performance complementarity of solvers available to tackle black-box optimization problems gives rise to the important task of algorithm selection (AS). Automated AS approaches can help replace tedious and labor-intensive manual selection, and have already shown promising performance in various optimization domains. Automated AS relies on machine learning (ML) techniques to recommend the best algorithm given the information about the problem instance. Unfortunately, there are no clear guidelines for choosing the most appropriate one from a variety of ML techniques. Tree-based models such as Random Forest or XGBoost have consistently demonstrated outstanding performance for automated AS. Transformers and other tabular deep learning models have also been increasingly applied in this context. We investigate in this work the impact of the choice of the ML technique on AS performance. We compare four ML models on the task of predicting the best solver for the BBOB problems for 7 different runtime budgets in 2 dimensions. While our results confirm that a per-instance AS has indeed impressive potential, we also show that the particular choice of the ML technique is of much minor importance. △ Less

Submitted 30 June, 2023; originally announced June 2023.

Comments: To appear in the Companion Proceedings of GECCO 2023 as poster paper

arXiv:2306.00479 [pdf, other]

doi 10.1145/3583131.3590424

Algorithm Instance Footprint: Separating Easily Solvable and Challenging Problem Instances

Authors: Ana Nikolikj, Sašo Džeroski, Mario Andrés Muñoz, Carola Doerr, Peter Korošec, Tome Eftimov

Abstract: In black-box optimization, it is essential to understand why an algorithm instance works on a set of problem instances while failing on others and provide explanations of its behavior. We propose a methodology for formulating an algorithm instance footprint that consists of a set of problem instances that are easy to be solved and a set of problem instances that are difficult to be solved, for an… ▽ More In black-box optimization, it is essential to understand why an algorithm instance works on a set of problem instances while failing on others and provide explanations of its behavior. We propose a methodology for formulating an algorithm instance footprint that consists of a set of problem instances that are easy to be solved and a set of problem instances that are difficult to be solved, for an algorithm instance. This behavior of the algorithm instance is further linked to the landscape properties of the problem instances to provide explanations of which properties make some problem instances easy or challenging. The proposed methodology uses meta-representations that embed the landscape properties of the problem instances and the performance of the algorithm into the same vector space. These meta-representations are obtained by training a supervised machine learning regression model for algorithm performance prediction and applying model explainability techniques to assess the importance of the landscape features to the performance predictions. Next, deterministic clustering of the meta-representations demonstrates that using them captures algorithm performance across the space and detects regions of poor and good algorithm performance, together with an explanation of which landscape properties are leading to it. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: To appear at GECCO 2023

arXiv:2305.08413 [pdf, other]

Artificial intelligence to advance Earth observation: a perspective

Authors: Devis Tuia, Konrad Schindler, Begüm Demir, Gustau Camps-Valls, Xiao Xiang Zhu, Mrinalini Kochupillai, Sašo Džeroski, Jan N. van Rijn, Holger H. Hoos, Fabio Del Frate, Mihai Datcu, Jorge-Arnulfo Quiané-Ruiz, Volker Markl, Bertrand Le Saux, Rochelle Schneider

Abstract: Earth observation (EO) is a prime instrument for monitoring land and ocean processes, studying the dynamics at work, and taking the pulse of our planet. This article gives a bird's eye view of the essential scientific tools and approaches informing and supporting the transition from raw EO data to usable EO-based information. The promises, as well as the current challenges of these developments, a… ▽ More Earth observation (EO) is a prime instrument for monitoring land and ocean processes, studying the dynamics at work, and taking the pulse of our planet. This article gives a bird's eye view of the essential scientific tools and approaches informing and supporting the transition from raw EO data to usable EO-based information. The promises, as well as the current challenges of these developments, are highlighted under dedicated sections. Specifically, we cover the impact of (i) Computer vision; (ii) Machine learning; (iii) Advanced processing and computing; (iv) Knowledge-based AI; (v) Explainable AI and causal inference; (vi) Physics-aware models; (vii) User-centric approaches; and (viii) the much-needed discussion of ethical and societal issues related to the massive use of ML technologies in EO. △ Less

Submitted 15 May, 2023; originally announced May 2023.

arXiv:2305.02251 [pdf]

Automated Scientific Discovery: From Equation Discovery to Autonomous Discovery Systems

Authors: Stefan Kramer, Mattia Cerrato, Sašo Džeroski, Ross King

Abstract: The paper surveys automated scientific discovery, from equation discovery and symbolic regression to autonomous discovery systems and agents. It discusses the individual approaches from a "big picture" perspective and in context, but also discusses open issues and recent topics like the various roles of deep neural networks in this area, aiding in the discovery of human-interpretable knowledge. Fu… ▽ More The paper surveys automated scientific discovery, from equation discovery and symbolic regression to autonomous discovery systems and agents. It discusses the individual approaches from a "big picture" perspective and in context, but also discusses open issues and recent topics like the various roles of deep neural networks in this area, aiding in the discovery of human-interpretable knowledge. Further, we will present closed-loop scientific discovery systems, starting with the pioneering work on the Adam system up to current efforts in fields from material science to astronomy. Finally, we will elaborate on autonomy from a machine learning perspective, but also in analogy to the autonomy levels in autonomous driving. The maximal level, level five, is defined to require no human intervention at all in the production of scientific knowledge. Achieving this is one step towards solving the Nobel Turing Grand Challenge to develop AI Scientists: AI systems capable of making Nobel-quality scientific discoveries highly autonomously at a level comparable, and possibly superior, to the best human scientists by 2050. △ Less

Submitted 3 May, 2023; originally announced May 2023.

arXiv:2302.09893 [pdf, other]

doi 10.1007/s10994-023-06400-2

Efficient Generator of Mathematical Expressions for Symbolic Regression

Authors: Sebastian Mežnar, Sašo Džeroski, Ljupčo Todorovski

Abstract: We propose an approach to symbolic regression based on a novel variational autoencoder for generating hierarchical structures, HVAE. It combines simple atomic units with shared weights to recursively encode and decode the individual nodes in the hierarchy. Encoding is performed bottom-up and decoding top-down. We empirically show that HVAE can be trained efficiently with small corpora of mathemati… ▽ More We propose an approach to symbolic regression based on a novel variational autoencoder for generating hierarchical structures, HVAE. It combines simple atomic units with shared weights to recursively encode and decode the individual nodes in the hierarchy. Encoding is performed bottom-up and decoding top-down. We empirically show that HVAE can be trained efficiently with small corpora of mathematical expressions and can accurately encode expressions into a smooth low-dimensional latent space. The latter can be efficiently explored with various optimization methods to address the task of symbolic regression. Indeed, random search through the latent space of HVAE performs better than random search through expressions generated by manually crafted probabilistic grammars for mathematical expressions. Finally, EDHiE system for symbolic regression, which applies an evolutionary algorithm to the latent space of HVAE, reconstructs equations from a standard symbolic regression benchmark better than a state-of-the-art system based on a similar combination of deep learning and evolutionary algorithms.ž △ Less

Submitted 10 September, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

Comments: 35 pages, 11 tables, 7 multi-part figures, Machine learning (Springer) and journal track of ECML/PKDD 2023

ACM Class: I.2.0; I.2.6

Journal ref: Mach Learn (2023)

arXiv:2301.09876 [pdf, other]

Using Knowledge Graphs for Performance Prediction of Modular Optimization Algorithms

Authors: Ana Kostovska, Diederick Vermetten, Sašo Džeroski, Panče Panov, Tome Eftimov, Carola Doerr

Abstract: Empirical data plays an important role in evolutionary computation research. To make better use of the available data, ontologies have been proposed in the literature to organize their storage in a structured way. However, the full potential of these formal methods to capture our domain knowledge has yet to be demonstrated. In this work, we evaluate a performance prediction model built on top of t… ▽ More Empirical data plays an important role in evolutionary computation research. To make better use of the available data, ontologies have been proposed in the literature to organize their storage in a structured way. However, the full potential of these formal methods to capture our domain knowledge has yet to be demonstrated. In this work, we evaluate a performance prediction model built on top of the extension of the recently proposed OPTION ontology. More specifically, we first extend the OPTION ontology with the vocabulary needed to represent modular black-box optimization algorithms. Then, we use the extended OPTION ontology, to create knowledge graphs with fixed-budget performance data for two modular algorithm frameworks, modCMA, and modDE, for the 24 noiseless BBOB benchmark functions. We build the performance prediction model using a knowledge graph embedding-based methodology. Using a number of different evaluation scenarios, we show that a triple classification approach, a fairly standard predictive modeling task in the context of knowledge graphs, can correctly predict whether a given algorithm instance will be able to achieve a certain target precision for a given problem instance. This approach requires feature representation of algorithms and problems. While the latter is already well developed, we hope that our work will motivate the community to collaborate on appropriate algorithm representations. △ Less

Submitted 24 January, 2023; originally announced January 2023.

Comments: To appear at EvoApps 2023

arXiv:2211.12757 [pdf, other]

FAIRification of MLC data

Authors: Ana Kostovska, Jasmin Bogatinovski, Andrej Treven, Sašo Džeroski, Dragi Kocev, Panče Panov

Abstract: The multi-label classification (MLC) task has increasingly been receiving interest from the machine learning (ML) community, as evidenced by the growing number of papers and methods that appear in the literature. Hence, ensuring proper, correct, robust, and trustworthy benchmarking is of utmost importance for the further development of the field. We believe that this can be achieved by adhering to… ▽ More The multi-label classification (MLC) task has increasingly been receiving interest from the machine learning (ML) community, as evidenced by the growing number of papers and methods that appear in the literature. Hence, ensuring proper, correct, robust, and trustworthy benchmarking is of utmost importance for the further development of the field. We believe that this can be achieved by adhering to the recently emerged data management standards, such as the FAIR (Findable, Accessible, Interoperable, and Reusable) and TRUST (Transparency, Responsibility, User focus, Sustainability, and Technology) principles. To FAIRify the MLC datasets, we introduce an ontology-based online catalogue of MLC datasets that follow these principles. The catalogue extensively describes many MLC datasets with comprehensible meta-features, MLC-specific semantic descriptions, and different data provenance information. The MLC data catalogue is extensively described in our recent publication in Nature Scientific Reports, Kostovska & Bogatinovski et al., and available at: http://semantichub.ijs.si/MLCdatasets. In addition, we provide an ontology-based system for easy access and querying of performance/benchmark data obtained from a comprehensive MLC benchmark study. The system is available at: http://semantichub.ijs.si/MLCbenchmark. △ Less

Submitted 23 November, 2022; originally announced November 2022.

Comments: This paper was accepted ECML PKDD 2022

arXiv:2211.11332 [pdf, other]

OPTION: OPTImization Algorithm Benchmarking ONtology

Authors: Ana Kostovska, Diederick Vermetten, Carola Doerr, Saso Džeroski, Panče Panov, Tome Eftimov

Abstract: Many optimization algorithm benchmarking platforms allow users to share their experimental data to promote reproducible and reusable research. However, different platforms use different data models and formats, which drastically complicates the identification of relevant datasets, their interpretation, and their interoperability. Therefore, a semantically rich, ontology-based, machine-readable dat… ▽ More Many optimization algorithm benchmarking platforms allow users to share their experimental data to promote reproducible and reusable research. However, different platforms use different data models and formats, which drastically complicates the identification of relevant datasets, their interpretation, and their interoperability. Therefore, a semantically rich, ontology-based, machine-readable data model that can be used by different platforms is highly desirable. In this paper, we report on the development of such an ontology, which we call OPTION (OPTImization algorithm benchmarking ONtology). Our ontology provides the vocabulary needed for semantic annotation of the core entities involved in the benchmarking process, such as algorithms, problems, and evaluation measures. It also provides means for automatic data integration, improved interoperability, and powerful querying capabilities, thereby increasing the value of the benchmarking data. We demonstrate the utility of OPTION, by annotating and querying a corpus of benchmark performance data from the BBOB collection of the COCO framework and from the Yet Another Black-Box Optimization Benchmark (YABBOB) family of the Nevergrad environment. In addition, we integrate features of the BBOB functional performance landscape into the OPTION knowledge base using publicly available datasets with exploratory landscape analysis. Finally, we integrate the OPTION knowledge base into the IOHprofiler environment and provide users with the ability to perform meta-analysis of performance data. △ Less

Submitted 21 November, 2022; originally announced November 2022.

arXiv:2211.11227 [pdf, other]

Explainable Model-specific Algorithm Selection for Multi-Label Classification

Authors: Ana Kostovska, Carola Doerr, Sašo Džeroski, Dragi Kocev, Panče Panov, Tome Eftimov

Abstract: Multi-label classification (MLC) is an ML task of predictive modeling in which a data instance can simultaneously belong to multiple classes. MLC is increasingly gaining interest in different application domains such as text mining, computer vision, and bioinformatics. Several MLC algorithms have been proposed in the literature, resulting in a meta-optimization problem that the user needs to addre… ▽ More Multi-label classification (MLC) is an ML task of predictive modeling in which a data instance can simultaneously belong to multiple classes. MLC is increasingly gaining interest in different application domains such as text mining, computer vision, and bioinformatics. Several MLC algorithms have been proposed in the literature, resulting in a meta-optimization problem that the user needs to address: which MLC approach to select for a given dataset? To address this algorithm selection problem, we investigate in this work the quality of an automated approach that uses characteristics of the datasets - so-called features - and a trained algorithm selector to choose which algorithm to apply for a given task. For our empirical evaluation, we use a portfolio of 38 datasets. We consider eight MLC algorithms, whose quality we evaluate using six different performance metrics. We show that our automated algorithm selector outperforms any of the single MLC algorithms, and this is for all evaluated performance measures. Our selection approach is explainable, a characteristic that we exploit to investigate which meta-features have the largest influence on the decisions made by the algorithm selector. Finally, we also quantify the importance of the most significant meta-features for various domains. △ Less

Submitted 21 November, 2022; originally announced November 2022.

arXiv:2207.09237 [pdf, other]

Semi-supervised Predictive Clustering Trees for (Hierarchical) Multi-label Classification

Authors: Jurica Levatić, Michelangelo Ceci, Dragi Kocev, Sašo Džeroski

Abstract: Semi-supervised learning (SSL) is a common approach to learning predictive models using not only labeled examples, but also unlabeled examples. While SSL for the simple tasks of classification and regression has received a lot of attention from the research community, this is not properly investigated for complex prediction tasks with structurally dependent variables. This is the case of multi-lab… ▽ More Semi-supervised learning (SSL) is a common approach to learning predictive models using not only labeled examples, but also unlabeled examples. While SSL for the simple tasks of classification and regression has received a lot of attention from the research community, this is not properly investigated for complex prediction tasks with structurally dependent variables. This is the case of multi-label classification and hierarchical multi-label classification tasks, which may require additional information, possibly coming from the underlying distribution in the descriptive space provided by unlabeled examples, to better face the challenging task of predicting simultaneously multiple class labels. In this paper, we investigate this aspect and propose a (hierarchical) multi-label classification method based on semi-supervised learning of predictive clustering trees. We also extend the method towards ensemble learning and propose a method based on the random forest approach. Extensive experimental evaluation conducted on 23 datasets shows significant advantages of the proposed method and its extension with respect to their supervised counterparts. Moreover, the method preserves interpretability and reduces the time complexity of classical tree-based models. △ Less

Submitted 30 March, 2024; v1 submitted 19 July, 2022; originally announced July 2022.

arXiv:2204.07431 [pdf, other]

The Importance of Landscape Features for Performance Prediction of Modular CMA-ES Variants

Authors: Ana Kostovska, Diederick Vermetten, Sašo Džeroski, Carola Doerr, Peter Korošec, Tome Eftimov

Abstract: Selecting the most suitable algorithm and determining its hyperparameters for a given optimization problem is a challenging task. Accurately predicting how well a certain algorithm could solve the problem is hence desirable. Recent studies in single-objective numerical optimization show that supervised machine learning methods can predict algorithm performance using landscape features extracted fr… ▽ More Selecting the most suitable algorithm and determining its hyperparameters for a given optimization problem is a challenging task. Accurately predicting how well a certain algorithm could solve the problem is hence desirable. Recent studies in single-objective numerical optimization show that supervised machine learning methods can predict algorithm performance using landscape features extracted from the problem instances. Existing approaches typically treat the algorithms as black-boxes, without consideration of their characteristics. To investigate in this work if a selection of landscape features that depends on algorithms properties could further improve regression accuracy, we regard the modular CMA-ES framework and estimate how much each landscape feature contributes to the best algorithm performance regression models. Exploratory data analysis performed on this data indicate that the set of most relevant features does not depend on the configuration of individual modules, but the influence that these features have on regression accuracy does. In addition, we have shown that by using classifiers that take the features relevance on the model accuracy, we are able to predict the status of individual modules in the CMA-ES configurations. △ Less

Submitted 15 April, 2022; originally announced April 2022.

arXiv:2203.02360 [pdf, other]

doi 10.48550/arXiv.2203.02360

Boosting the Performance of Quantum Annealers using Machine Learning

Authors: Jure Brence, Dragan Mihailović, Viktor Kabanov, Ljupčo Todorovski, Sašo Džeroski, Jaka Vodeb

Abstract: Noisy intermediate-scale quantum (NISQ) devices are spearheading the second quantum revolution. Of these, quantum annealers are the only ones currently offering real world, commercial applications on as many as 5000 qubits. The size of problems that can be solved by quantum annealers is limited mainly by errors caused by environmental noise and intrinsic imperfections of the processor. We address… ▽ More Noisy intermediate-scale quantum (NISQ) devices are spearheading the second quantum revolution. Of these, quantum annealers are the only ones currently offering real world, commercial applications on as many as 5000 qubits. The size of problems that can be solved by quantum annealers is limited mainly by errors caused by environmental noise and intrinsic imperfections of the processor. We address the issue of intrinsic imperfections with a novel error correction approach, based on machine learning methods. Our approach adjusts the input Hamiltonian to maximize the probability of finding the solution. In our experiments, the proposed error correction method improved the performance of annealing by up to three orders of magnitude and enabled the solving of a previously intractable, maximally complex problem. △ Less

Submitted 7 March, 2022; v1 submitted 4 March, 2022; originally announced March 2022.

arXiv:2111.13273 [pdf, other]

doi 10.1007/978-3-030-88942-5_26

Unsupervised Feature Ranking via Attribute Networks

Authors: Urh Primožič, Blaž Škrlj, Sašo Džeroski, Matej Petković

Abstract: The need for learning from unlabeled data is increasing in contemporary machine learning. Methods for unsupervised feature ranking, which identify the most important features in such data are thus gaining attention, and so are their applications in studying high throughput biological experiments or user bases for recommender systems. We propose FRANe (Feature Ranking via Attribute Networks), an un… ▽ More The need for learning from unlabeled data is increasing in contemporary machine learning. Methods for unsupervised feature ranking, which identify the most important features in such data are thus gaining attention, and so are their applications in studying high throughput biological experiments or user bases for recommender systems. We propose FRANe (Feature Ranking via Attribute Networks), an unsupervised algorithm capable of finding key features in given unlabeled data set. FRANe is based on ideas from network reconstruction and network analysis. FRANe performs better than state-of-the-art competitors, as we empirically demonstrate on a large collection of benchmarks. Moreover, we provide the time complexity analysis of FRANe further demonstrating its scalability. Finally, FRANe offers as the result the interpretable relational structures used to derive the feature importances. △ Less

Submitted 25 November, 2021; originally announced November 2021.

Comments: for online material and python package see https://github.com/FRANe-team/FRANe

Journal ref: Soares C., Torgo L. (eds) Discovery Science. DS 2021. Lecture Notes in Computer Science, vol 12986. Springer, Cham

arXiv:2108.01407 [pdf, other]

GalaxAI: Machine learning toolbox for interpretable analysis of spacecraft telemetry data

Authors: Ana Kostovska, Matej Petković, Tomaž Stepišnik, Luke Lucas, Timothy Finn, José Martínez-Heras, Panče Panov, Sašo Džeroski, Alessandro Donati, Nikola Simidjievski, Dragi Kocev

Abstract: We present GalaxAI - a versatile machine learning toolbox for efficient and interpretable end-to-end analysis of spacecraft telemetry data. GalaxAI employs various machine learning algorithms for multivariate time series analyses, classification, regression and structured output prediction, capable of handling high-throughput heterogeneous data. These methods allow for the construction of robust a… ▽ More We present GalaxAI - a versatile machine learning toolbox for efficient and interpretable end-to-end analysis of spacecraft telemetry data. GalaxAI employs various machine learning algorithms for multivariate time series analyses, classification, regression and structured output prediction, capable of handling high-throughput heterogeneous data. These methods allow for the construction of robust and accurate predictive models, that are in turn applied to different tasks of spacecraft monitoring and operations planning. More importantly, besides the accurate building of models, GalaxAI implements a visualisation layer, providing mission specialists and operators with a full, detailed and interpretable view of the data analysis process. We show the utility and versatility of GalaxAI on two use-cases concerning two different spacecraft: i) analysis and planning of Mars Express thermal power consumption and ii) predicting of INTEGRAL's crossings through Van Allen belts. △ Less

Submitted 9 August, 2021; v1 submitted 3 August, 2021; originally announced August 2021.

Journal ref: 8th IEEE International Conference on Space Mission Challenges for Information Technology (SMC-IT 2021)

arXiv:2106.15411 [pdf, other]

Explaining the Performance of Multi-label Classification Methods with Data Set Properties

Authors: Jasmin Bogatinovski, Ljupčo Todorovski, Sašo Džeroski, Dragi Kocev

Abstract: Meta learning generalizes the empirical experience with different learning tasks and holds promise for providing important empirical insight into the behaviour of machine learning algorithms. In this paper, we present a comprehensive meta-learning study of data sets and methods for multi-label classification (MLC). MLC is a practically relevant machine learning task where each example is labelled… ▽ More Meta learning generalizes the empirical experience with different learning tasks and holds promise for providing important empirical insight into the behaviour of machine learning algorithms. In this paper, we present a comprehensive meta-learning study of data sets and methods for multi-label classification (MLC). MLC is a practically relevant machine learning task where each example is labelled with multiple labels simultaneously. Here, we analyze 40 MLC data sets by using 50 meta features describing different properties of the data. The main findings of this study are as follows. First, the most prominent meta features that describe the space of MLC data sets are the ones assessing different aspects of the label space. Second, the meta models show that the most important meta features describe the label space, and, the meta features describing the relationships among the labels tend to occur a bit more often than the meta features describing the distributions between and within the individual labels. Third, the optimization of the hyperparameters can improve the predictive performance, however, quite often the extent of the improvements does not always justify the resource utilization. △ Less

Submitted 28 June, 2021; originally announced June 2021.

arXiv:2104.11889 [pdf, other]

doi 10.1145/3449726.3459579

OPTION: OPTImization Algorithm Benchmarking ONtology

Authors: Ana Kostovska, Diederick Vermetten, Carola Doerr, Sašo Džeroski, Panče Panov, Tome Eftimov

Abstract: Many platforms for benchmarking optimization algorithms offer users the possibility of sharing their experimental data with the purpose of promoting reproducible and reusable research. However, different platforms use different data models and formats, which drastically inhibits identification of relevant data sets, their interpretation, and their interoperability. Consequently, a semantically ric… ▽ More Many platforms for benchmarking optimization algorithms offer users the possibility of sharing their experimental data with the purpose of promoting reproducible and reusable research. However, different platforms use different data models and formats, which drastically inhibits identification of relevant data sets, their interpretation, and their interoperability. Consequently, a semantically rich, ontology-based, machine-readable data model is highly desired. We report in this paper on the development of such an ontology, which we name OPTION (OPTImization algorithm benchmarking ONtology). Our ontology provides the vocabulary needed for semantic annotation of the core entities involved in the benchmarking process, such as algorithms, problems, and evaluation measures. It also provides means for automated data integration, improved interoperability, powerful querying capabilities and reasoning, thereby enriching the value of the benchmark data. We demonstrate the utility of OPTION by annotating and querying a corpus of benchmark performance data from the BBOB workshop data - a use case which can be easily extended to cover other benchmarking data collections. △ Less

Submitted 24 April, 2021; originally announced April 2021.

Comments: To appear in the Proceedings of Genetic and Evolutionary Computation Conference Companion (GECCO 2021), ACM

arXiv:2102.07113 [pdf, other]

Comprehensive Comparative Study of Multi-Label Classification Methods

Authors: Jasmin Bogatinovski, Ljupčo Todorovski, Sašo Džeroski, Dragi Kocev

Abstract: Multi-label classification (MLC) has recently received increasing interest from the machine learning community. Several studies provide reviews of methods and datasets for MLC and a few provide empirical comparisons of MLC methods. However, they are limited in the number of methods and datasets considered. This work provides a comprehensive empirical study of a wide range of MLC methods on a pleth… ▽ More Multi-label classification (MLC) has recently received increasing interest from the machine learning community. Several studies provide reviews of methods and datasets for MLC and a few provide empirical comparisons of MLC methods. However, they are limited in the number of methods and datasets considered. This work provides a comprehensive empirical study of a wide range of MLC methods on a plethora of datasets from various domains. More specifically, our study evaluates 26 methods on 42 benchmark datasets using 20 evaluation measures. The adopted evaluation methodology adheres to the highest literature standards for designing and executing large scale, time-budgeted experimental studies. First, the methods are selected based on their usage by the community, assuring representation of methods across the MLC taxonomy of methods and different base learners. Second, the datasets cover a wide range of complexity and domains of application. The selected evaluation measures assess the predictive performance and the efficiency of the methods. The results of the analysis identify RFPCT, RFDTBR, ECCJ48, EBRJ48 and AdaBoostMH as best performing methods across the spectrum of performance measures. Whenever a new method is introduced, it should be compared to different subsets of MLC methods, determined on the basis of the different evaluation criteria. △ Less

Submitted 16 February, 2021; v1 submitted 14 February, 2021; originally announced February 2021.

arXiv:2101.09577 [pdf, other]

ReliefE: Feature Ranking in High-dimensional Spaces via Manifold Embeddings

Authors: Blaž Škrlj, Sašo Džeroski, Nada Lavrač, Matej Petković

Abstract: Feature ranking has been widely adopted in machine learning applications such as high-throughput biology and social sciences. The approaches of the popular Relief family of algorithms assign importances to features by iteratively accounting for nearest relevant and irrelevant instances. Despite their high utility, these algorithms can be computationally expensive and not-well suited for high-dimen… ▽ More Feature ranking has been widely adopted in machine learning applications such as high-throughput biology and social sciences. The approaches of the popular Relief family of algorithms assign importances to features by iteratively accounting for nearest relevant and irrelevant instances. Despite their high utility, these algorithms can be computationally expensive and not-well suited for high-dimensional sparse input spaces. In contrast, recent embedding-based methods learn compact, low-dimensional representations, potentially facilitating down-stream learning capabilities of conventional learners. This paper explores how the Relief branch of algorithms can be adapted to benefit from (Riemannian) manifold-based embeddings of instance and target spaces, where a given embedding's dimensionality is intrinsic to the dimensionality of the considered data set. The developed ReliefE algorithm is faster and can result in better feature rankings, as shown by our evaluation on 20 real-life data sets for multi-class and multi-label classification tasks. The utility of ReliefE for high-dimensional data sets is ensured by its implementation that utilizes sparse matrix algebraic operations. Finally, the relation of ReliefE to other ranking algorithms is studied via the Fuzzy Jaccard Index. △ Less

Submitted 23 January, 2021; originally announced January 2021.

arXiv:2012.00428 [pdf, other]

doi 10.1016/j.knosys.2021.107077

Probabilistic Grammars for Equation Discovery

Authors: Jure Brence, Ljupčo Todorovski, Sašo Džeroski

Abstract: Equation discovery, also known as symbolic regression, is a type of automated modeling that discovers scientific laws, expressed in the form of equations, from observed data and expert knowledge. Deterministic grammars, such as context-free grammars, have been used to limit the search spaces in equation discovery by providing hard constraints that specify which equations to consider and which not.… ▽ More Equation discovery, also known as symbolic regression, is a type of automated modeling that discovers scientific laws, expressed in the form of equations, from observed data and expert knowledge. Deterministic grammars, such as context-free grammars, have been used to limit the search spaces in equation discovery by providing hard constraints that specify which equations to consider and which not. In this paper, we propose the use of probabilistic context-free grammars in equation discovery. Such grammars encode soft constraints, specifying a prior probability distribution on the space of possible equations. We show that probabilistic grammars can be used to elegantly and flexibly formulate the parsimony principle, that favors simpler equations, through probabilities attached to the rules in the grammars. We demonstrate that the use of probabilistic, rather than deterministic grammars, in the context of a Monte-Carlo algorithm for grammar-based equation discovery, leads to more efficient equation discovery. Finally, by specifying prior probability distributions over equation spaces, the foundations are laid for Bayesian approaches to equation discovery. △ Less

Submitted 22 March, 2021; v1 submitted 1 December, 2020; originally announced December 2020.

Comments: Submitted to Knowledge-Based Systems, Elsevier. 28 pages + 13 pages appendix. 7 figures

ACM Class: I.2.4; I.2.6; I.1.1; I.1.3; G.3

arXiv:2011.11679 [pdf, other]

Ensemble- and Distance-Based Feature Ranking for Unsupervised Learning

Authors: Matej Petković, Dragi Kocev, Blaž Škrlj, Sašo Džeroski

Abstract: In this work, we propose two novel (groups of) methods for unsupervised feature ranking and selection. The first group includes feature ranking scores (Genie3 score, RandomForest score) that are computed from ensembles of predictive clustering trees. The second method is URelief, the unsupervised extension of the Relief family of feature ranking algorithms. Using 26 benchmark data sets and 5 basel… ▽ More In this work, we propose two novel (groups of) methods for unsupervised feature ranking and selection. The first group includes feature ranking scores (Genie3 score, RandomForest score) that are computed from ensembles of predictive clustering trees. The second method is URelief, the unsupervised extension of the Relief family of feature ranking algorithms. Using 26 benchmark data sets and 5 baselines, we show that both the Genie3 score (computed from the ensemble of extra trees) and the URelief method outperform the existing methods and that Genie3 performs best overall, in terms of predictive power of the top-ranked features. Additionally, we analyze the influence of the hyper-parameters of the proposed methods on their performance, and show that for the Genie3 score the highest quality is achieved by the most efficient parameter configuration. Finally, we propose a way of discovering the location of the features in the ranking, which are the most relevant in reality. △ Less

Submitted 23 November, 2020; originally announced November 2020.

Comments: 20 pages, 5 figures

arXiv:2008.03937 [pdf, other]

Feature Ranking for Semi-supervised Learning

Authors: Matej Petković, Sašo Džeroski, Dragi Kocev

Abstract: The data made available for analysis are becoming more and more complex along several directions: high dimensionality, number of examples and the amount of labels per example. This poses a variety of challenges for the existing machine learning methods: co** with dataset with a large number of examples that are described in a high-dimensional space and not all examples have labels provided. For… ▽ More The data made available for analysis are becoming more and more complex along several directions: high dimensionality, number of examples and the amount of labels per example. This poses a variety of challenges for the existing machine learning methods: co** with dataset with a large number of examples that are described in a high-dimensional space and not all examples have labels provided. For example, when investigating the toxicity of chemical compounds there are a lot of compounds available, that can be described with information rich high-dimensional representations, but not all of the compounds have information on their toxicity. To address these challenges, we propose semi-supervised learning of feature ranking. The feature rankings are learned in the context of classification and regression as well as in the context of structured output prediction (multi-label classification, hierarchical multi-label classification and multi-target regression). To the best of our knowledge, this is the first work that treats the task of feature ranking within the semi-supervised structured output prediction context. More specifically, we propose two approaches that are based on tree ensembles and the Relief family of algorithms. The extensive evaluation across 38 benchmark datasets reveals the following: Random Forests perform the best for the classification-like tasks, while for the regression-like tasks Extra-PCTs perform the best, Random Forests are the most efficient method considering induction times across all tasks, and semi-supervised feature rankings outperform their supervised counterpart across a majority of the datasets from the different tasks. △ Less

Submitted 10 August, 2020; originally announced August 2020.

Comments: Submitted to: Special Issue on Discovery Science of the Machine Learning Journal

arXiv:2002.04464 [pdf, other]

doi 10.3233/FAIA200256

Feature Importance Estimation with Self-Attention Networks

Authors: Blaž Škrlj, Sašo Džeroski, Nada Lavrač, Matej Petkovič

Abstract: Black-box neural network models are widely used in industry and science, yet are hard to understand and interpret. Recently, the attention mechanism was introduced, offering insights into the inner workings of neural language models. This paper explores the use of attention-based neural networks mechanism for estimating feature importance, as means for explaining the models learned from propositio… ▽ More Black-box neural network models are widely used in industry and science, yet are hard to understand and interpret. Recently, the attention mechanism was introduced, offering insights into the inner workings of neural language models. This paper explores the use of attention-based neural networks mechanism for estimating feature importance, as means for explaining the models learned from propositional (tabular) data. Feature importance estimates, assessed by the proposed Self-Attention Network (SAN) architecture, are compared with the established ReliefF, Mutual Information and Random Forest-based estimates, which are widely used in practice for model interpretation. For the first time we conduct scale-free comparisons of feature importance estimates across algorithms on ten real and synthetic data sets to study the similarities and differences of the resulting feature importance estimates, showing that SANs identify similar high-ranked features as the other methods. We demonstrate that SANs identify feature interactions which in some cases yield better predictive performance than the baselines, suggesting that attention extends beyond interactions of just a few key features and detects larger feature subsets relevant for the considered learning task. △ Less

Submitted 11 February, 2020; originally announced February 2020.

Comments: Accepted for publication in ECAI 2020

arXiv:1907.00821 [pdf, other]

Equation Discovery for Nonlinear System Identification

Authors: Nikola Simidjievski, Ljupčo Todorovski, Juš Kocijan, Sašo Džeroski

Abstract: Equation discovery methods enable modelers to combine domain-specific knowledge and system identification to construct models most suitable for a selected modeling task. The method described and evaluated in this paper can be used as a nonlinear system identification method for gray-box modeling. It consists of two interlaced parts of modeling that are computer-aided. The first performs computer-a… ▽ More Equation discovery methods enable modelers to combine domain-specific knowledge and system identification to construct models most suitable for a selected modeling task. The method described and evaluated in this paper can be used as a nonlinear system identification method for gray-box modeling. It consists of two interlaced parts of modeling that are computer-aided. The first performs computer-aided identification of a model structure composed of elements selected from user-specified domain-specific modeling knowledge, while the second part performs parameter estimation. In this paper, recent developments of the equation discovery method called process-based modeling, suited for nonlinear system identification, are elaborated and illustrated on two continuous-time case studies. The first case study illustrates the use of the process-based modeling on synthetic data while the second case-study evaluates on measured data for a standard system-identification benchmark. The experimental results clearly demonstrate the ability of process-based modeling to reconstruct both model structure and parameters from measured data. △ Less

Submitted 1 July, 2019; originally announced July 2019.

arXiv:1906.09088 [pdf, other]

doi 10.1109/ACCESS.2019.2959846

Meta-Model Framework for Surrogate-Based Parameter Estimation in Dynamical Systems

Authors: Žiga Lukšič, Jovan Tanevski, Sašo Džeroski, Ljupčo Todorovski

Abstract: The central task in modeling complex dynamical systems is parameter estimation. This task involves numerous evaluations of a computationally expensive objective function. Surrogate-based optimization introduces a computationally efficient predictive model that approximates the value of the objective function. The standard approach involves learning a surrogate from training examples that correspon… ▽ More The central task in modeling complex dynamical systems is parameter estimation. This task involves numerous evaluations of a computationally expensive objective function. Surrogate-based optimization introduces a computationally efficient predictive model that approximates the value of the objective function. The standard approach involves learning a surrogate from training examples that correspond to past evaluations of the objective function. Current surrogate-based optimization methods use static, predefined substitution strategies that decide when to use the surrogate and when the true objective. We introduce a meta-model framework where the substitution strategy is dynamically adapted to the solution space of the given optimization problem. The meta model encapsulates the objective function, the surrogate model and the model of the substitution strategy, as well as components for learning them. The framework can be seamlessly coupled with an arbitrary optimization algorithm without any modification: it replaces the objective function and autonomously decides how to evaluate a given candidate solution. We test the utility of the framework on three tasks of estimating parameters of real-world models of dynamical systems. The results show that the meta model significantly improves the efficiency of optimization, reducing the total number of evaluations of the objective function up to an average of 77%. △ Less

Submitted 18 December, 2019; v1 submitted 21 June, 2019; originally announced June 2019.

arXiv:1809.00542 [pdf, other]

doi 10.1109/MAES.2019.2915456

Machine learning for predicting thermal power consumption of the Mars Express Spacecraft

Authors: Matej Petković, Redouane Boumghar, Martin Breskvar, Sašo Džeroski, Dragi Kocev, Jurica Levatić, Luke Lucas, Aljaž Osojnik, Bernard Ženko, Nikola Simidjievski

Abstract: The thermal subsystem of the Mars Express (MEX) spacecraft keeps the on-board equipment within its pre-defined operating temperatures range. To plan and optimize the scientific operations of MEX, its operators need to estimate in advance, as accurately as possible, the power consumption of the thermal subsystem. The remaining power can then be allocated for scientific purposes. We present a machin… ▽ More The thermal subsystem of the Mars Express (MEX) spacecraft keeps the on-board equipment within its pre-defined operating temperatures range. To plan and optimize the scientific operations of MEX, its operators need to estimate in advance, as accurately as possible, the power consumption of the thermal subsystem. The remaining power can then be allocated for scientific purposes. We present a machine learning pipeline for efficiently constructing accurate predictive models for predicting the power of the thermal subsystem on board MEX. In particular, we employ state-of-the-art feature engineering approaches for transforming raw telemetry data, in turn used for constructing accurate models with different state-of-the-art machine learning methods. We show that the proposed pipeline considerably improve our previous (competition-winning) work in terms of time efficiency and predictive performance. Moreover, while achieving superior predictive performance, the constructed models also provide important insight into the spacecraft's behavior, allowing for further analyses and optimal planning of MEX's operation. △ Less

Submitted 16 January, 2019; v1 submitted 3 September, 2018; originally announced September 2018.

Journal ref: IEEE Aerospace and Electronic Systems Magazine. 34(7), 46-60. (2019)

arXiv:1804.06207 [pdf, other]

MetaBags: Bagged Meta-Decision Trees for Regression

Authors: Jihed Khiari, Luis Moreira-Matias, Ammar Shaker, Bernard Zenko, Saso Dzeroski

Abstract: Ensembles are popular methods for solving practical supervised learning problems. They reduce the risk of having underperforming models in production-grade software. Although critical, methods for learning heterogeneous regression ensembles have not been proposed at large scale, whereas in classical ML literature, stacking, cascading and voting are mostly restricted to classification problems. Reg… ▽ More Ensembles are popular methods for solving practical supervised learning problems. They reduce the risk of having underperforming models in production-grade software. Although critical, methods for learning heterogeneous regression ensembles have not been proposed at large scale, whereas in classical ML literature, stacking, cascading and voting are mostly restricted to classification problems. Regression poses distinct learning challenges that may result in poor performance, even when using well established homogeneous ensemble schemas such as bagging or boosting. In this paper, we introduce MetaBags, a novel, practically useful stacking framework for regression. MetaBags is a meta-learning algorithm that learns a set of meta-decision trees designed to select one base model (i.e. expert) for each query, and focuses on inductive bias reduction. A set of meta-decision trees are learned using different types of meta-features, specially created for this purpose - to then be bagged at meta-level. This procedure is designed to learn a model with a fair bias-variance trade-off, and its improvement over base model performance is correlated with the prediction diversity of different experts on specific input space subregions. The proposed method and meta-features are designed in such a way that they enable good predictive performance even in subregions of space which are not adequately represented in the available training data. An exhaustive empirical testing of the method was performed, evaluating both generalization error and scalability of the approach on synthetic, open and real-world application datasets. The obtained results show that our method significantly outperforms existing state-of-the-art approaches. △ Less

Submitted 17 April, 2018; originally announced April 2018.

arXiv:1712.03100 [pdf, other]

doi 10.1088/1367-2630/aae941

Decoupling approximation robustly reconstructs directed dynamical networks

Authors: Nikola Simidjievski, Jovan Tanevski, Bernard Zenko, Zoran Levnajic, Ljupco Todorovski, Saso Dzeroski

Abstract: Methods for reconstructing the topology of complex networks from time-resolved observations of node dynamics are gaining relevance across scientific disciplines. Of biggest practical interest are methods that make no assumptions about properties of the dynamics, and can cope with noisy, short and incomplete trajectories. Ideal reconstruction in such scenario requires and exhaustive approach of sim… ▽ More Methods for reconstructing the topology of complex networks from time-resolved observations of node dynamics are gaining relevance across scientific disciplines. Of biggest practical interest are methods that make no assumptions about properties of the dynamics, and can cope with noisy, short and incomplete trajectories. Ideal reconstruction in such scenario requires and exhaustive approach of simulating the dynamics for all possible network configurations and matching the simulated against the actual trajectories, which of course is computationally too costly for any realistic application. Relying on insights from equation discovery and machine learning, we here introduce \textit{decoupling approximation} of dynamical networks and propose a new reconstruction method based on it. Decoupling approximation consists of matching the simulated against the actual trajectories for each node individually rather than for the entire network at once. Despite drastic reduction of the computational cost that this approximation entails, we find our method's performance to be very close to that of the ideal method. In particular, we not only make no assumptions about properties of the trajectories, but provide strong evidence that our methods' performance is largely independent of the dynamical regime at hand. Of crucial relevance for practical applications, we also find our method to be extremely robust to both length and resolution of the trajectories and relatively insensitive to noise. △ Less

Submitted 7 November, 2018; v1 submitted 8 December, 2017; originally announced December 2017.

Journal ref: New J. Phys. 20 (11), 113003 (2018)

arXiv:1702.06831 [pdf]

doi 10.1371/journal.pone.0187364

Using Redescription Mining to Relate Clinical and Biological Characteristics of Cognitively Impaired and Alzheimer's Disease Patients

Authors: Matej Mihelčić, Goran Šimić, Mirjana Babić Leko, Nada Lavrač, Sašo Džeroski, Tomislav Šmuc

Abstract: We used redescription mining to find interpretable rules revealing associations between those determinants that provide insights about the Alzheimer's disease (AD). We extended the CLUS-RM redescription mining algorithm to a constraint-based redescription mining (CBRM) setting, which enables several modes of targeted exploration of specific, user-constrained associations. Redescription mining enab… ▽ More We used redescription mining to find interpretable rules revealing associations between those determinants that provide insights about the Alzheimer's disease (AD). We extended the CLUS-RM redescription mining algorithm to a constraint-based redescription mining (CBRM) setting, which enables several modes of targeted exploration of specific, user-constrained associations. Redescription mining enabled finding specific constructs of clinical and biological attributes that describe many groups of subjects of different size, homogeneity and levels of cognitive impairment. We confirmed some previously known findings. However, in some instances, as with the attributes: testosterone, the imaging attribute Spatial Pattern of Abnormalities for Recognition of Early AD, as well as the levels of leptin and angiopoietin-2 in plasma, we corroborated previously debatable findings or provided additional information about these variables and their association with AD pathogenesis. Applying redescription mining on ADNI data resulted with the discovery of one largely unknown attribute: the Pregnancy-Associated Protein-A (PAPP-A), which we found highly associated with cognitive impairment in AD. Statistically significant correlations (p <= 0.01) were found between PAPP-A and various different clinical tests. The high importance of this finding lies in the fact that PAPP-A is a metalloproteinase, known to cleave insulin-like growth factor binding proteins. Since it also shares similar substrates with A Disintegrin and the Metalloproteinase family of enzymes that act as α-secretase to physiologically cleave amyloid precursor protein (APP) in the non-amyloidogenic pathway, it could be directly involved in the metabolism of APP very early during the disease course. Therefore, further studies should investigate the role of PAPP-A in the development of AD more thoroughly. △ Less

Submitted 14 November, 2017; v1 submitted 20 February, 2017; originally announced February 2017.

arXiv:1606.03935 [pdf, other]

doi 10.1016/j.eswa.2016.10.012

A framework for redescription set construction

Authors: Matej Mihelčić, Sašo Džeroski, Nada Lavrač, Tomislav Šmuc

Abstract: Redescription mining is a field of knowledge discovery that aims at finding different descriptions of similar subsets of instances in the data. These descriptions are represented as rules inferred from one or more disjoint sets of attributes, called views. As such, they support knowledge discovery process and help domain experts in formulating new hypotheses or constructing new knowledge bases and… ▽ More Redescription mining is a field of knowledge discovery that aims at finding different descriptions of similar subsets of instances in the data. These descriptions are represented as rules inferred from one or more disjoint sets of attributes, called views. As such, they support knowledge discovery process and help domain experts in formulating new hypotheses or constructing new knowledge bases and decision support systems. In contrast to previous approaches that typically create one smaller set of redescriptions satisfying a pre-defined set of constraints, we introduce a framework that creates large and heterogeneous redescription set from which user/expert can extract compact sets of differing properties, according to its own preferences. Construction of large and heterogeneous redescription set relies on CLUS-RM algorithm and a novel, conjunctive refinement procedure that facilitates generation of larger and more accurate redescription sets. The work also introduces the variability of redescription accuracy when missing values are present in the data, which significantly extends applicability of the method. Crucial part of the framework is the redescription set extraction based on heuristic multi-objective optimization procedure that allows user to define importance levels towards one or more redescription quality criteria. We provide both theoretical and empirical comparison of the novel framework against current state of the art redescription mining algorithms and show that it represents more efficient and versatile approach for mining redescriptions from data. △ Less

Submitted 19 December, 2016; v1 submitted 13 June, 2016; originally announced June 2016.

Showing 1–30 of 30 results for author: Džeroski, S