-
A review of feature selection strategies utilizing graph data structures and knowledge graphs
Authors:
Sisi Shao,
Pedro Henrique Ribeiro,
Christina Ramirez,
Jason H. Moore
Abstract:
Feature selection in Knowledge Graphs (KGs) are increasingly utilized in diverse domains, including biomedical research, Natural Language Processing (NLP), and personalized recommendation systems. This paper delves into the methodologies for feature selection within KGs, emphasizing their roles in enhancing machine learning (ML) model efficacy, hypothesis generation, and interpretability. Through…
▽ More
Feature selection in Knowledge Graphs (KGs) are increasingly utilized in diverse domains, including biomedical research, Natural Language Processing (NLP), and personalized recommendation systems. This paper delves into the methodologies for feature selection within KGs, emphasizing their roles in enhancing machine learning (ML) model efficacy, hypothesis generation, and interpretability. Through this comprehensive review, we aim to catalyze further innovation in feature selection for KGs, paving the way for more insightful, efficient, and interpretable analytical models across various domains. Our exploration reveals the critical importance of scalability, accuracy, and interpretability in feature selection techniques, advocating for the integration of domain knowledge to refine the selection process. We highlight the burgeoning potential of multi-objective optimization and interdisciplinary collaboration in advancing KG feature selection, underscoring the transformative impact of such methodologies on precision medicine, among other fields. The paper concludes by charting future directions, including the development of scalable, dynamic feature selection algorithms and the integration of explainable AI principles to foster transparency and trust in KG-driven models.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Lexidate: Model Evaluation and Selection with Lexicase
Authors:
Jose Guadalupe Hernandez,
Anil Kumar Saini,
Jason H. Moore
Abstract:
Automated machine learning streamlines the task of finding effective machine learning pipelines by automating model training, evaluation, and selection. Traditional evaluation strategies, like cross-validation (CV), generate one value that averages the accuracy of a pipeline's predictions. This single value, however, may not fully describe the generalizability of the pipeline. Here, we present Lex…
▽ More
Automated machine learning streamlines the task of finding effective machine learning pipelines by automating model training, evaluation, and selection. Traditional evaluation strategies, like cross-validation (CV), generate one value that averages the accuracy of a pipeline's predictions. This single value, however, may not fully describe the generalizability of the pipeline. Here, we present Lexicase-based Validation (lexidate), a method that uses multiple, independent prediction values for selection. Lexidate splits training data into a learning set and a selection set. Pipelines are trained on the learning set and make predictions on the selection set. The predictions are graded for correctness and used by lexicase selection to identify parent pipelines. Compared to 10-fold CV, lexicase reduces the training time. We test the effectiveness of three lexidate configurations within the Tree-based Pipeline Optimization Tool 2 (TPOT2) package on six OpenML classification tasks. In one configuration, we detected no difference in the accuracy of the final model returned from TPOT2 on most tasks compared to 10-fold CV. All configurations studied here returned similar or less complex final pipelines compared to 10-fold CV.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras
Authors:
Jun Yu,
Yutong Dai,
Xiaokang Liu,
** Huang,
Yishan Shen,
Ke Zhang,
Rong Zhou,
Eashan Adhikarla,
Wenxuan Ye,
Yixin Liu,
Zhaoming Kong,
Kai Zhang,
Yilong Yin,
Vinod Namboodiri,
Brian D. Davison,
Jason H. Moore,
Yong Chen
Abstract:
MTL is a learning paradigm that effectively leverages both task-specific and shared information to address multiple related tasks simultaneously. In contrast to STL, MTL offers a suite of benefits that enhance both the training process and the inference efficiency. MTL's key advantages encompass streamlined model architecture, performance enhancement, and cross-domain generalizability. Over the pa…
▽ More
MTL is a learning paradigm that effectively leverages both task-specific and shared information to address multiple related tasks simultaneously. In contrast to STL, MTL offers a suite of benefits that enhance both the training process and the inference efficiency. MTL's key advantages encompass streamlined model architecture, performance enhancement, and cross-domain generalizability. Over the past twenty years, MTL has become widely recognized as a flexible and effective approach in various fields, including CV, NLP, recommendation systems, disease prognosis and diagnosis, and robotics. This survey provides a comprehensive overview of the evolution of MTL, encompassing the technical aspects of cutting-edge methods from traditional approaches to deep learning and the latest trend of pretrained foundation models. Our survey methodically categorizes MTL techniques into five key areas: regularization, relationship learning, feature propagation, optimization, and pre-training. This categorization not only chronologically outlines the development of MTL but also dives into various specialized strategies within each category. Furthermore, the survey reveals how the MTL evolves from handling a fixed set of tasks to embracing a more flexible approach free from task or modality constraints. It explores the concepts of task-promptable and -agnostic training, along with the capacity for ZSL, which unleashes the untapped potential of this historically coveted learning paradigm. Overall, we hope this survey provides the research community with a comprehensive overview of the advancements in MTL from its inception in 1997 to the present in 2023. We address present challenges and look ahead to future possibilities, shedding light on the opportunities and potential avenues for MTL research in a broad manner. This project is publicly available at https://github.com/junfish/Awesome-Multitask-Learning.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Genetic Programming Theory and Practice: A Fifteen-Year Trajectory
Authors:
Moshe Sipper,
Jason H. Moore
Abstract:
The GPTP workshop series, which began in 2003, has served over the years as a focal meeting for genetic programming (GP) researchers. As such, we think it provides an excellent source for studying the development of GP over the past fifteen years. We thus present herein a trajectory of the thematic developments in the field of GP.
The GPTP workshop series, which began in 2003, has served over the years as a focal meeting for genetic programming (GP) researchers. As such, we think it provides an excellent source for studying the development of GP over the past fifteen years. We thus present herein a trajectory of the thematic developments in the field of GP.
△ Less
Submitted 1 February, 2024;
originally announced February 2024.
-
Coevolving Artistic Images Using OMNIREP
Authors:
Moshe Sipper,
Jason H. Moore,
Ryan J. Urbanowicz
Abstract:
We have recently developed OMNIREP, a coevolutionary algorithm to discover both a representation and an interpreter that solve a particular problem of interest. Herein, we demonstrate that the OMNIREP framework can be successfully applied within the field of evolutionary art. Specifically, we coevolve representations that encode image position, alongside interpreters that transform these positions…
▽ More
We have recently developed OMNIREP, a coevolutionary algorithm to discover both a representation and an interpreter that solve a particular problem of interest. Herein, we demonstrate that the OMNIREP framework can be successfully applied within the field of evolutionary art. Specifically, we coevolve representations that encode image position, alongside interpreters that transform these positions into one of three pre-defined shapes (chunks, polygons, or circles) of varying size, shape, and color. We showcase a sampling of the unique image variations produced by this approach.
△ Less
Submitted 20 January, 2024;
originally announced January 2024.
-
New Pathways in Coevolutionary Computation
Authors:
Moshe Sipper,
Jason H. Moore,
Ryan J. Urbanowicz
Abstract:
The simultaneous evolution of two or more species with coupled fitness -- coevolution -- has been put to good use in the field of evolutionary computation. Herein, we present two new forms of coevolutionary algorithms, which we have recently designed and applied with success. OMNIREP is a cooperative coevolutionary algorithm that discovers both a representation and an encoding for solving a partic…
▽ More
The simultaneous evolution of two or more species with coupled fitness -- coevolution -- has been put to good use in the field of evolutionary computation. Herein, we present two new forms of coevolutionary algorithms, which we have recently designed and applied with success. OMNIREP is a cooperative coevolutionary algorithm that discovers both a representation and an encoding for solving a particular problem of interest. SAFE is a commensalistic coevolutionary algorithm that maintains two coevolving populations: a population of candidate solutions and a population of candidate objective functions needed to measure solution quality during evolution.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Perceptual and technical barriers in sharing and formatting metadata accompanying omics studies
Authors:
Yu-Ning Huang,
Michael I. Love,
Cynthia Flaire Ronkowski,
Dhrithi Deshpande,
Lynn M. Schriml,
Annie Wong-Beringer,
Barend Mons,
Russell Corbett-Detig,
Christopher I Hunter,
Jason H. Moore,
Lana X. Garmire,
T. B. K. Reddy,
Winston A. Hide,
Atul J. Butte,
Mark D. Robinson,
Serghei Mangul
Abstract:
Metadata, often termed "data about data," is crucial for organizing, understanding, and managing vast omics datasets. It aids in efficient data discovery, integration, and interpretation, enabling users to access, comprehend, and utilize data effectively. Its significance spans the domains of scientific research, facilitating data reproducibility, reusability, and secondary analysis. However, nume…
▽ More
Metadata, often termed "data about data," is crucial for organizing, understanding, and managing vast omics datasets. It aids in efficient data discovery, integration, and interpretation, enabling users to access, comprehend, and utilize data effectively. Its significance spans the domains of scientific research, facilitating data reproducibility, reusability, and secondary analysis. However, numerous perceptual and technical barriers hinder the sharing of metadata among researchers. These barriers compromise the reliability of research results and hinder integrative meta-analyses of omics studies . This study highlights the key barriers to metadata sharing, including the lack of uniform standards, privacy and legal concerns, limitations in study design, limited incentives, inadequate infrastructure, and the dearth of well-trained personnel for metadata management and reuse. Proposed solutions include emphasizing the promotion of standardization, educational efforts, the role of journals and funding agencies, incentives and rewards, and the improvement of infrastructure. More accurate, reliable, and impactful research outcomes are achievable if the scientific community addresses these barriers, facilitating more accurate, reliable, and impactful research outcomes.
△ Less
Submitted 22 November, 2023;
originally announced January 2024.
-
Faster Convergence with Lexicase Selection in Tree-based Automated Machine Learning
Authors:
Nicholas Matsumoto,
Anil Kumar Saini,
Pedro Ribeiro,
Hyunjun Choi,
Alena Orlenko,
Leo-Pekka Lyytikäinen,
Jari O Laurikka,
Terho Lehtimäki,
Sandra Batista,
Jason H. Moore
Abstract:
In many evolutionary computation systems, parent selection methods can affect, among other things, convergence to a solution. In this paper, we present a study comparing the role of two commonly used parent selection methods in evolving machine learning pipelines in an automated machine learning system called Tree-based Pipeline Optimization Tool (TPOT). Specifically, we demonstrate, using experim…
▽ More
In many evolutionary computation systems, parent selection methods can affect, among other things, convergence to a solution. In this paper, we present a study comparing the role of two commonly used parent selection methods in evolving machine learning pipelines in an automated machine learning system called Tree-based Pipeline Optimization Tool (TPOT). Specifically, we demonstrate, using experiments on multiple datasets, that lexicase selection leads to significantly faster convergence as compared to NSGA-II in TPOT. We also compare the exploration of parts of the search space by these selection methods using a trie data structure that contains information about the pipelines explored in a particular run.
△ Less
Submitted 1 February, 2023;
originally announced February 2023.
-
Benchmarking AutoML algorithms on a collection of synthetic classification problems
Authors:
Pedro Henrique Ribeiro,
Patryk Orzechowski,
Joost Wagenaar,
Jason H. Moore
Abstract:
Automated machine learning (AutoML) algorithms have grown in popularity due to their high performance and flexibility to adapt to different problems and data sets. With the increasing number of AutoML algorithms, deciding which would best suit a given problem becomes increasingly more work. Therefore, it is essential to use complex and challenging benchmarks which would be able to differentiate th…
▽ More
Automated machine learning (AutoML) algorithms have grown in popularity due to their high performance and flexibility to adapt to different problems and data sets. With the increasing number of AutoML algorithms, deciding which would best suit a given problem becomes increasingly more work. Therefore, it is essential to use complex and challenging benchmarks which would be able to differentiate the AutoML algorithms from each other. This paper compares the performance of four different AutoML algorithms: Tree-based Pipeline Optimization Tool (TPOT), Auto-Sklearn, Auto-Sklearn 2, and H2O AutoML. We use the Diverse and Generative ML benchmark (DIGEN), a diverse set of synthetic datasets derived from generative functions designed to highlight the strengths and weaknesses of the performance of common machine learning algorithms. We confirm that AutoML can identify pipelines that perform well on all included datasets. Most AutoML algorithms performed similarly; however, there were some differences depending on the specific dataset and metric used.
△ Less
Submitted 8 March, 2023; v1 submitted 5 December, 2022;
originally announced December 2022.
-
Applying Autonomous Hybrid Agent-based Computing to Difficult Optimization Problems
Authors:
Mateusz Godzik,
Jacek Dajda,
Marek Kisiel-Dorohinicki,
Aleksander Byrski,
Leszek Rutkowski,
Patryk Orzechowski,
Joost Wagenaar,
Jason H. Moore
Abstract:
Evolutionary multi-agent systems (EMASs) are very good at dealing with difficult, multi-dimensional problems, their efficacy was proven theoretically based on analysis of the relevant Markov-Chain based model. Now the research continues on introducing autonomous hybridization into EMAS. This paper focuses on a proposed hybrid version of the EMAS, and covers selection and introduction of a number o…
▽ More
Evolutionary multi-agent systems (EMASs) are very good at dealing with difficult, multi-dimensional problems, their efficacy was proven theoretically based on analysis of the relevant Markov-Chain based model. Now the research continues on introducing autonomous hybridization into EMAS. This paper focuses on a proposed hybrid version of the EMAS, and covers selection and introduction of a number of hybrid operators and defining rules for starting the hybrid steps of the main algorithm. Those hybrid steps leverage existing, well-known and proven to be efficient metaheuristics, and integrate their results into the main algorithm. The discussed modifications are evaluated based on a number of difficult continuous-optimization benchmarks.
△ Less
Submitted 24 October, 2022;
originally announced October 2022.
-
Automatically Balancing Model Accuracy and Complexity using Solution and Fitness Evolution (SAFE)
Authors:
Moshe Sipper,
Jason H. Moore,
Ryan J. Urbanowicz
Abstract:
When seeking a predictive model in biomedical data, one often has more than a single objective in mind, e.g., attaining both high accuracy and low complexity (to promote interpretability). We investigate herein whether multiple objectives can be dynamically tuned by our recently proposed coevolutionary algorithm, SAFE (Solution And Fitness Evolution). We find that SAFE is able to automatically tun…
▽ More
When seeking a predictive model in biomedical data, one often has more than a single objective in mind, e.g., attaining both high accuracy and low complexity (to promote interpretability). We investigate herein whether multiple objectives can be dynamically tuned by our recently proposed coevolutionary algorithm, SAFE (Solution And Fitness Evolution). We find that SAFE is able to automatically tune accuracy and complexity with no performance loss, as compared with a standard evolutionary algorithm, over complex simulated genetics datasets produced by the GAMETES tool.
△ Less
Submitted 30 June, 2022;
originally announced June 2022.
-
Solution and Fitness Evolution (SAFE): A Study of Multiobjective Problems
Authors:
Moshe Sipper,
Jason H. Moore,
Ryan J. Urbanowicz
Abstract:
We have recently presented SAFE -- Solution And Fitness Evolution -- a commensalistic coevolutionary algorithm that maintains two coevolving populations: a population of candidate solutions and a population of candidate objective functions. We showed that SAFE was successful at evolving solutions within a robotic maze domain. Herein we present an investigation of SAFE's adaptation and application…
▽ More
We have recently presented SAFE -- Solution And Fitness Evolution -- a commensalistic coevolutionary algorithm that maintains two coevolving populations: a population of candidate solutions and a population of candidate objective functions. We showed that SAFE was successful at evolving solutions within a robotic maze domain. Herein we present an investigation of SAFE's adaptation and application to multiobjective problems, wherein candidate objective functions explore different weightings of each objective. Though preliminary, the results suggest that SAFE, and the concept of coevolving solutions and objective functions, can identify a similar set of optimal multiobjective solutions without explicitly employing a Pareto front for fitness calculation and parent selection. These findings support our hypothesis that the SAFE algorithm concept can not only solve complex problems, but can adapt to the challenge of problems with multiple objectives.
△ Less
Submitted 25 June, 2022;
originally announced June 2022.
-
Solution and Fitness Evolution (SAFE): Coevolving Solutions and Their Objective Functions
Authors:
Moshe Sipper,
Jason H. Moore,
Ryan J. Urbanowicz
Abstract:
We recently highlighted a fundamental problem recognized to confound algorithmic optimization, namely, \textit{conflating} the objective with the objective function. Even when the former is well defined, the latter may not be obvious, e.g., in learning a strategy to navigate a maze to find a goal (objective), an effective objective function to \textit{evaluate} strategies may not be a simple funct…
▽ More
We recently highlighted a fundamental problem recognized to confound algorithmic optimization, namely, \textit{conflating} the objective with the objective function. Even when the former is well defined, the latter may not be obvious, e.g., in learning a strategy to navigate a maze to find a goal (objective), an effective objective function to \textit{evaluate} strategies may not be a simple function of the distance to the objective. We proposed to automate the means by which a good objective function may be discovered -- a proposal reified herein. We present \textbf{S}olution \textbf{A}nd \textbf{F}itness \textbf{E}volution (\textbf{SAFE}), a \textit{commensalistic} coevolutionary algorithm that maintains two coevolving populations: a population of candidate solutions and a population of candidate objective functions. As proof of principle of this concept, we show that SAFE successfully evolves not only solutions within a robotic maze domain, but also the objective functions needed to measure solution quality during evolution.
△ Less
Submitted 25 June, 2022;
originally announced June 2022.
-
Symbolic-Regression Boosting
Authors:
Moshe Sipper,
Jason H Moore
Abstract:
Modifying standard gradient boosting by replacing the embedded weak learner in favor of a strong(er) one, we present SyRBo: Symbolic-Regression Boosting. Experiments over 98 regression datasets show that by adding a small number of boosting stages -- between 2--5 -- to a symbolic regressor, statistically significant improvements can often be attained. We note that coding SyRBo on top of any symbol…
▽ More
Modifying standard gradient boosting by replacing the embedded weak learner in favor of a strong(er) one, we present SyRBo: Symbolic-Regression Boosting. Experiments over 98 regression datasets show that by adding a small number of boosting stages -- between 2--5 -- to a symbolic regressor, statistically significant improvements can often be attained. We note that coding SyRBo on top of any symbolic regressor is straightforward, and the added cost is simply a few more evolutionary rounds. SyRBo is essentially a simple add-on that can be readily added to an extant symbolic regressor, often with beneficial results.
△ Less
Submitted 24 June, 2022;
originally announced June 2022.
-
Contemporary Symbolic Regression Methods and their Relative Performance
Authors:
William La Cava,
Patryk Orzechowski,
Bogdan Burlacu,
Fabrício Olivetti de França,
Marco Virgolin,
Ying **,
Michael Kommenda,
Jason H. Moore
Abstract:
Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. In this paper, we address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14 symbolic regression methods and 7 machine learnin…
▽ More
Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. In this paper, we address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14 symbolic regression methods and 7 machine learning methods on a set of 252 diverse regression problems. Our assessment includes both real-world datasets with no known model form as well as ground-truth benchmark problems, including physics equations and systems of ordinary differential equations. For the real-world datasets, we benchmark the ability of each method to learn models with low error and low complexity relative to state-of-the-art machine learning methods. For the synthetic problems, we assess each method's ability to find exact solutions in the presence of varying levels of noise. Under these controlled experiments, we conclude that the best performing methods for real-world regression combine genetic algorithms with parameter estimation and/or semantic search drivers. When tasked with recovering exact equations in the presence of noise, we find that deep learning and genetic algorithm-based approaches perform similarly. We provide a detailed guide to reproducing this experiment and contributing new methods, and encourage other researchers to collaborate with us on a common and living symbolic regression benchmark.
△ Less
Submitted 29 July, 2021;
originally announced July 2021.
-
Benchmarking AutoML Frameworks for Disease Prediction Using Medical Claims
Authors:
Roland Albert A. Romero,
Mariefel Nicole Y. Deypalan,
Suchit Mehrotra,
John Titus Jungao,
Natalie E. Sheils,
Elisabetta Manduchi,
Jason H. Moore
Abstract:
We ascertain and compare the performances of AutoML tools on large, highly imbalanced healthcare datasets.
We generated a large dataset using historical administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated…
▽ More
We ascertain and compare the performances of AutoML tools on large, highly imbalanced healthcare datasets.
We generated a large dataset using historical administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics.
The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while kee** the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications.
Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features types. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance.
Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application.
△ Less
Submitted 22 July, 2021;
originally announced July 2021.
-
Generative and reproducible benchmarks for comprehensive evaluation of machine learning classifiers
Authors:
Patryk Orzechowski,
Jason H. Moore
Abstract:
Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial for determine their scope of application. Here, we introduce the DIverse and GENerative ML Benchmark (DIGEN) - a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of machine learning algorithms for classification of binary outcomes. The DIGEN resource consists of…
▽ More
Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial for determine their scope of application. Here, we introduce the DIverse and GENerative ML Benchmark (DIGEN) - a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of machine learning algorithms for classification of binary outcomes. The DIGEN resource consists of 40 mathematical functions which map continuous features to discrete endpoints for creating synthetic datasets. These 40 functions were discovered using a heuristic algorithm designed to maximize the diversity of performance among multiple popular machine learning algorithms thus providing a useful test suite for evaluating and comparing new methods. Access to the generative functions facilitates understanding of why a method performs poorly compared to other algorithms thus providing ideas for improvement. The resource with extensive documentation and analyses is open-source and available on GitHub.
△ Less
Submitted 13 July, 2021;
originally announced July 2021.
-
EBIC.JL -- an Efficient Implementation of Evolutionary Biclustering Algorithm in Julia
Authors:
Paweł Renc,
Patryk Orzechowski,
Aleksander Byrski,
Jarosław Wąs,
Jason H. Moore
Abstract:
Biclustering is a data mining technique which searches for local patterns in numeric tabular data with main application in bioinformatics. This technique has shown promise in multiple areas, including development of biomarkers for cancer, disease subtype identification, or gene-drug interactions among others. In this paper we introduce EBIC.JL - an implementation of one of the most accurate biclus…
▽ More
Biclustering is a data mining technique which searches for local patterns in numeric tabular data with main application in bioinformatics. This technique has shown promise in multiple areas, including development of biomarkers for cancer, disease subtype identification, or gene-drug interactions among others. In this paper we introduce EBIC.JL - an implementation of one of the most accurate biclustering algorithms in Julia, a modern highly parallelizable programming language for data science. We show that the new version maintains comparable accuracy to its predecessor EBIC while converging faster for the majority of the problems. We hope that this open source software in a high-level programming language will foster research in this promising field of bioinformatics and expedite development of new biclustering methods for big data.
△ Less
Submitted 3 May, 2021;
originally announced May 2021.
-
PMLB v1.0: An open source dataset collection for benchmarking machine learning methods
Authors:
Joseph D. Romano,
Trang T. Le,
William La Cava,
John T. Gregg,
Daniel J. Goldberg,
Natasha L. Ray,
Praneel Chakraborty,
Daniel Himmelstein,
Weixuan Fu,
Jason H. Moore
Abstract:
Motivation: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows.
Results: This release of PMLB provides the largest collection of…
▽ More
Motivation: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows.
Results: This release of PMLB provides the largest collection of diverse, public benchmark datasets for evaluating new machine learning and data science methods aggregated in one location. v1.0 introduces a number of critical improvements developed following discussions with the open-source community.
Availability: PMLB is available at https://github.com/EpistasisLab/pmlb. Python and R interfaces for PMLB can be installed through the Python Package Index and Comprehensive R Archive Network, respectively.
△ Less
Submitted 6 April, 2021; v1 submitted 30 November, 2020;
originally announced December 2020.
-
A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary Classification: Application in Pancreatic Cancer Nested Case-control Studies with Implications for Bias Assessments
Authors:
Ryan J. Urbanowicz,
Pranshu Suri,
Yuhan Cui,
Jason H. Moore,
Karen Ruth,
Rachael Stolzenberg-Solomon,
Shannon M. Lynch
Abstract:
Machine learning (ML) offers a collection of powerful approaches for detecting and modeling associations, often applied to data having a large number of features and/or complex associations. Currently, there are many tools to facilitate implementing custom ML analyses (e.g. scikit-learn). Interest is also increasing in automated ML packages, which can make it easier for non-experts to apply ML and…
▽ More
Machine learning (ML) offers a collection of powerful approaches for detecting and modeling associations, often applied to data having a large number of features and/or complex associations. Currently, there are many tools to facilitate implementing custom ML analyses (e.g. scikit-learn). Interest is also increasing in automated ML packages, which can make it easier for non-experts to apply ML and have the potential to improve model performance. ML permeates most subfields of biomedical research with varying levels of rigor and correct usage. Tremendous opportunities offered by ML are frequently offset by the challenge of assembling comprehensive analysis pipelines, and the ease of ML misuse. In this work we have laid out and assembled a complete, rigorous ML analysis pipeline focused on binary classification (i.e. case/control prediction), and applied this pipeline to both simulated and real world data. At a high level, this 'automated' but customizable pipeline includes a) exploratory analysis, b) data cleaning and transformation, c) feature selection, d) model training with 9 established ML algorithms, each with hyperparameter optimization, and e) thorough evaluation, including appropriate metrics, statistical analyses, and novel visualizations. This pipeline organizes the many subtle complexities of ML pipeline assembly to illustrate best practices to avoid bias and ensure reproducibility. Additionally, this pipeline is the first to compare established ML algorithms to 'ExSTraCS', a rule-based ML algorithm with the unique capability of interpretably modeling heterogeneous patterns of association. While designed to be widely applicable we apply this pipeline to an epidemiological investigation of established and newly identified risk factors for pancreatic cancer to evaluate how different sources of bias might be handled by ML algorithms.
△ Less
Submitted 8 September, 2020; v1 submitted 28 August, 2020;
originally announced August 2020.
-
Benchmarking in Optimization: Best Practice and Open Issues
Authors:
Thomas Bartz-Beielstein,
Carola Doerr,
Daan van den Berg,
Jakob Bossek,
Sowmya Chandrasekaran,
Tome Eftimov,
Andreas Fischbach,
Pascal Kerschke,
William La Cava,
Manuel Lopez-Ibanez,
Katherine M. Malan,
Jason H. Moore,
Boris Naujoks,
Patryk Orzechowski,
Vanessa Volz,
Markus Wagner,
Thomas Weise
Abstract:
This survey compiles ideas and recommendations from more than a dozen researchers with different backgrounds and from different institutes around the world. Promoting best practice in benchmarking is its main goal. The article discusses eight essential topics in benchmarking: clearly stated goals, well-specified problems, suitable algorithms, adequate performance measures, thoughtful analysis, eff…
▽ More
This survey compiles ideas and recommendations from more than a dozen researchers with different backgrounds and from different institutes around the world. Promoting best practice in benchmarking is its main goal. The article discusses eight essential topics in benchmarking: clearly stated goals, well-specified problems, suitable algorithms, adequate performance measures, thoughtful analysis, effective and efficient designs, comprehensible presentations, and guaranteed reproducibility. The final goal is to provide well-accepted guidelines (rules) that might be useful for authors and reviewers. As benchmarking in optimization is an active and evolving field of research this manuscript is meant to co-evolve over time by means of periodic updates.
△ Less
Submitted 16 December, 2020; v1 submitted 7 July, 2020;
originally announced July 2020.
-
Is deep learning necessary for simple classification tasks?
Authors:
Joseph D. Romano,
Trang T. Le,
Weixuan Fu,
Jason H. Moore
Abstract:
Automated machine learning (AutoML) and deep learning (DL) are two cutting-edge paradigms used to solve a myriad of inductive learning tasks. In spite of their successes, little guidance exists for when to choose one approach over the other in the context of specific real-world problems. Furthermore, relatively few tools exist that allow the integration of both AutoML and DL in the same analysis t…
▽ More
Automated machine learning (AutoML) and deep learning (DL) are two cutting-edge paradigms used to solve a myriad of inductive learning tasks. In spite of their successes, little guidance exists for when to choose one approach over the other in the context of specific real-world problems. Furthermore, relatively few tools exist that allow the integration of both AutoML and DL in the same analysis to yield results combining both of their strengths. Here, we seek to address both of these issues, by (1.) providing a head-to-head comparison of AutoML and DL in the context of binary classification on 6 well-characterized public datasets, and (2.) evaluating a new tool for genetic programming-based AutoML that incorporates deep estimators. Our observations suggest that AutoML outperforms simple DL classifiers when trained on similar datasets for binary classification but integrating DL into AutoML improves classification performance even further. However, the substantial time needed to train AutoML+DL pipelines will likely outweigh performance advantages in many applications.
△ Less
Submitted 11 June, 2020;
originally announced June 2020.
-
Genetic programming approaches to learning fair classifiers
Authors:
William La Cava,
Jason H. Moore
Abstract:
Society has come to rely on algorithms like classifiers for important decision making, giving rise to the need for ethical guarantees such as fairness. Fairness is typically defined by asking that some statistic of a classifier be approximately equal over protected groups within a population. In this paper, current approaches to fairness are discussed and used to motivate algorithmic proposals tha…
▽ More
Society has come to rely on algorithms like classifiers for important decision making, giving rise to the need for ethical guarantees such as fairness. Fairness is typically defined by asking that some statistic of a classifier be approximately equal over protected groups within a population. In this paper, current approaches to fairness are discussed and used to motivate algorithmic proposals that incorporate fairness into genetic programming for classification. We propose two ideas. The first is to incorporate a fairness objective into multi-objective optimization. The second is to adapt lexicase selection to define cases dynamically over intersections of protected groups. We describe why lexicase selection is well suited to pressure models to perform well across the potentially infinitely many subgroups over which fairness is desired. We use a recent genetic programming approach to construct models on four datasets for which fairness constraints are necessary, and empirically compare performance to prior methods utilizing game-theoretic solutions. Methods are assessed based on their ability to generate trade-offs of subgroup fairness and accuracy that are Pareto optimal. The result show that genetic programming methods in general, and random search in particular, are well suited to this task.
△ Less
Submitted 28 April, 2020;
originally announced April 2020.
-
SGP-DT: Semantic Genetic Programming Based on Dynamic Targets
Authors:
Stefano Ruberto,
Valerio Terragni,
Jason H. Moore
Abstract:
Semantic GP is a promising approach that introduces semantic awareness during genetic evolution. This paper presents a new Semantic GP approach based on Dynamic Target (SGP-DT) that divides the search problem into multiple GP runs. The evolution in each run is guided by a new (dynamic) target based on the residual errors. To obtain the final solution, SGP-DT combines the solutions of each run usin…
▽ More
Semantic GP is a promising approach that introduces semantic awareness during genetic evolution. This paper presents a new Semantic GP approach based on Dynamic Target (SGP-DT) that divides the search problem into multiple GP runs. The evolution in each run is guided by a new (dynamic) target based on the residual errors. To obtain the final solution, SGP-DT combines the solutions of each run using linear scaling. SGP-DT presents a new methodology to produce the offspring that does not rely on the classic crossover. The synergy between such a methodology and linear scaling yields to final solutions with low approximation error and computational cost. We evaluate SGP-DT on eight well-known data sets and compare with ε-lexicase, a state-of-the-art evolutionary technique. SGP-DT achieves small RMSE values, on average 23.19% smaller than the one of ε-lexicase.
△ Less
Submitted 30 January, 2020;
originally announced January 2020.
-
Evaluating recommender systems for AI-driven biomedical informatics
Authors:
William La Cava,
Heather Williams,
Weixuan Fu,
Steve Vitale,
Durga Srivatsan,
Jason H. Moore
Abstract:
Motivation: Many researchers with domain expertise are unable to easily apply machine learning to their bioinformatics data due to a lack of machine learning and/or coding expertise. Methods that have been proposed thus far to automate machine learning mostly require programming experience as well as expert knowledge to tune and apply the algorithms correctly. Here, we study a method of automating…
▽ More
Motivation: Many researchers with domain expertise are unable to easily apply machine learning to their bioinformatics data due to a lack of machine learning and/or coding expertise. Methods that have been proposed thus far to automate machine learning mostly require programming experience as well as expert knowledge to tune and apply the algorithms correctly. Here, we study a method of automating biomedical data science using a web-based platform that uses AI to recommend model choices and conduct experiments. We have two goals in mind: first, to make it easy to construct sophisticated models of biomedical processes; and second, to provide a fully automated AI agent that can choose and conduct promising experiments for the user, based on the user's experiments as well as prior knowledge. To validate this framework, we experiment with hundreds of classification problems, comparing to state-of-the-art, automated approaches. Finally, we use this tool to develop predictive models of septic shock in critical care patients.
Results: We find that matrix factorization-based recommendation systems outperform meta-learning methods for automating machine learning. This result mirrors the results of earlier recommender systems research in other domains. The proposed AI is competitive with state-of-the-art automated machine learning methods in terms of choosing optimal algorithm configurations for datasets. In our application to prediction of septic shock, the AI-driven analysis produces a competent machine learning model (AUROC 0.85 +/- 0.02) that performs on par with state-of-the-art deep learning results for this task, with much less computational effort.
△ Less
Submitted 28 April, 2020; v1 submitted 22 May, 2019;
originally announced May 2019.
-
Semantic variation operators for multidimensional genetic programming
Authors:
William La Cava,
Jason H. Moore
Abstract:
Multidimensional genetic programming represents candidate solutions as sets of programs, and thereby provides an interesting framework for exploiting building block identification. Towards this goal, we investigate the use of machine learning as a way to bias which components of programs are promoted, and propose two semantic operators to choose where useful building blocks are placed during cross…
▽ More
Multidimensional genetic programming represents candidate solutions as sets of programs, and thereby provides an interesting framework for exploiting building block identification. Towards this goal, we investigate the use of machine learning as a way to bias which components of programs are promoted, and propose two semantic operators to choose where useful building blocks are placed during crossover. A forward stagewise crossover operator we propose leads to significant improvements on a set of regression problems, and produces state-of-the-art results in a large benchmark study. We discuss this architecture and others in terms of their propensity for allowing heuristic search to utilize information during the evolutionary process. Finally, we look at the collinearity and complexity of the data representations that result from these architectures, with a view towards disentangling factors of variation in application.
△ Less
Submitted 17 April, 2019;
originally announced April 2019.
-
Interpretation of machine learning predictions for patient outcomes in electronic health records
Authors:
William La Cava,
Christopher Bauer,
Jason H. Moore,
Sarah A Pendergrass
Abstract:
Electronic health records are an increasingly important resource for understanding the interactions between patient health, environment, and clinical decisions. In this paper we report an empirical study of predictive modeling of several patient outcomes using three state-of-the-art machine learning methods. Our primary goal is to validate the models by interpreting the importance of predictors in…
▽ More
Electronic health records are an increasingly important resource for understanding the interactions between patient health, environment, and clinical decisions. In this paper we report an empirical study of predictive modeling of several patient outcomes using three state-of-the-art machine learning methods. Our primary goal is to validate the models by interpreting the importance of predictors in the final models. Central to interpretation is the use of feature importance scores, which vary depending on the underlying methodology. In order to assess feature importance, we compared univariate statistical tests, information-theoretic measures, permutation testing, and normalized coefficients from multivariate logistic regression models. In general we found poor correlation between methods in their assessment of feature importance, even when their performance is comparable and relatively good. However, permutation tests applied to random forest and gradient boosting models showed the most agreement, and the importance scores matched the clinical interpretation most frequently.
△ Less
Submitted 14 March, 2019;
originally announced March 2019.
-
EBIC: an open source software for high-dimensional and big data biclustering analyses
Authors:
Patryk Orzechowski,
Jason H. Moore
Abstract:
Motivation: In this paper we present the latest release of EBIC, a next-generation biclustering algorithm for mining genetic data. The major contribution of this paper is adding support for big data, making it possible to efficiently run large genomic data mining analyses. Additional enhancements include integration with R and Bioconductor and an option to remove influence of missing value on the…
▽ More
Motivation: In this paper we present the latest release of EBIC, a next-generation biclustering algorithm for mining genetic data. The major contribution of this paper is adding support for big data, making it possible to efficiently run large genomic data mining analyses. Additional enhancements include integration with R and Bioconductor and an option to remove influence of missing value on the final result.
Results: EBIC was applied to datasets of different sizes, including a large DNA methylation dataset with 436,444 rows. For the largest dataset we observed over 6.6 fold speedup in computation time on a cluster of 8 GPUs compared to running the method on a single GPU. This proves high scalability of the algorithm.
Availability: The latest version of EBIC could be downloaded from http://github.com/EpistasisLab/ebic . Installation and usage instructions are also available online.
△ Less
Submitted 25 July, 2018;
originally announced July 2018.
-
Learning concise representations for regression by evolving networks of trees
Authors:
William La Cava,
Tilak Raj Singh,
James Taggart,
Srinivas Suri,
Jason H. Moore
Abstract:
We propose and study a method for learning interpretable representations for the task of regression. Features are represented as networks of multi-type expression trees comprised of activation functions common in neural networks in addition to other elementary functions. Differentiable features are trained via gradient descent, and the performance of features in a linear model is used to weight th…
▽ More
We propose and study a method for learning interpretable representations for the task of regression. Features are represented as networks of multi-type expression trees comprised of activation functions common in neural networks in addition to other elementary functions. Differentiable features are trained via gradient descent, and the performance of features in a linear model is used to weight the rate of change among subcomponents of each representation. The search process maintains an archive of representations with accuracy-complexity trade-offs to assist in generalization and interpretation. We compare several stochastic optimization approaches within this framework. We benchmark these variants on 100 open-source regression problems in comparison to state-of-the-art machine learning approaches. Our main finding is that this approach produces the highest average test scores across problems while producing representations that are orders of magnitude smaller than the next best performing method (gradient boosting). We also report a negative result in which attempts to directly optimize the disentanglement of the representation result in more highly correlated features.
△ Less
Submitted 25 March, 2019; v1 submitted 3 July, 2018;
originally announced July 2018.
-
Gamorithm
Authors:
Moshe Sipper,
Jason H. Moore
Abstract:
Examining games from a fresh perspective we present the idea of game-inspired and game-based algorithms, dubbed "gamorithms".
Examining games from a fresh perspective we present the idea of game-inspired and game-based algorithms, dubbed "gamorithms".
△ Less
Submitted 27 August, 2018; v1 submitted 7 June, 2018;
originally announced June 2018.
-
Where are we now? A large benchmark study of recent symbolic regression methods
Authors:
Patryk Orzechowski,
William La Cava,
Jason H. Moore
Abstract:
In this paper we provide a broad benchmarking of recent genetic programming approaches to symbolic regression in the context of state of the art machine learning approaches. We use a set of nearly 100 regression benchmark problems culled from open source repositories across the web. We conduct a rigorous benchmarking of four recent symbolic regression approaches as well as nine machine learning ap…
▽ More
In this paper we provide a broad benchmarking of recent genetic programming approaches to symbolic regression in the context of state of the art machine learning approaches. We use a set of nearly 100 regression benchmark problems culled from open source repositories across the web. We conduct a rigorous benchmarking of four recent symbolic regression approaches as well as nine machine learning approaches from scikit-learn. The results suggest that symbolic regression performs strongly compared to state-of-the-art gradient boosting algorithms, although in terms of running times is among the slowest of the available methodologies. We discuss the results in detail and point to future research directions that may allow symbolic regression to gain wider adoption in the machine learning community.
△ Less
Submitted 7 June, 2018; v1 submitted 24 April, 2018;
originally announced April 2018.
-
EBIC: an evolutionary-based parallel biclustering algorithm for pattern discover
Authors:
Patryk Orzechowski,
Moshe Sipper,
Xiuzhen Huang,
Jason H. Moore
Abstract:
In this paper a novel biclustering algorithm based on artificial intelligence (AI) is introduced. The method called EBIC aims to detect biologically meaningful, order-preserving patterns in complex data. The proposed algorithm is probably the first one capable of discovering with accuracy exceeding 50% multiple complex patterns in real gene expression datasets. It is also one of the very few biclu…
▽ More
In this paper a novel biclustering algorithm based on artificial intelligence (AI) is introduced. The method called EBIC aims to detect biologically meaningful, order-preserving patterns in complex data. The proposed algorithm is probably the first one capable of discovering with accuracy exceeding 50% multiple complex patterns in real gene expression datasets. It is also one of the very few biclustering methods designed for parallel environments with multiple graphics processing units (GPUs). We demonstrate that EBIC outperforms state-of-the-art biclustering methods, in terms of recovery and relevance, on both synthetic and genetic datasets. EBIC also yields results over 12 times faster than the most accurate reference algorithms. The proposed algorithm is anticipated to be added to the repertoire of unsupervised machine learning algorithms for the analysis of datasets, including those from large-scale genomic studies.
△ Less
Submitted 26 July, 2018; v1 submitted 9 January, 2018;
originally announced January 2018.
-
Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining
Authors:
Ryan J. Urbanowicz,
Randal S. Olson,
Peter Schmitt,
Melissa Meeker,
Jason H. Moore
Abstract:
Modern biomedical data mining requires feature selection methods that can (1) be applied to large scale feature spaces (e.g. `omics' data), (2) function in noisy problems, (3) detect complex patterns of association (e.g. gene-gene interactions), (4) be flexibly adapted to various problem domains and data types (e.g. genetic variants, gene expression, and clinical data) and (5) are computationally…
▽ More
Modern biomedical data mining requires feature selection methods that can (1) be applied to large scale feature spaces (e.g. `omics' data), (2) function in noisy problems, (3) detect complex patterns of association (e.g. gene-gene interactions), (4) be flexibly adapted to various problem domains and data types (e.g. genetic variants, gene expression, and clinical data) and (5) are computationally tractable. To that end, this work examines a set of filter-style feature selection algorithms inspired by the `Relief' algorithm, i.e. Relief-Based algorithms (RBAs). We implement and expand these RBAs in an open source framework called ReBATE (Relief-Based Algorithm Training Environment). We apply a comprehensive genetic simulation study comparing existing RBAs, a proposed RBA called MultiSURF, and other established feature selection methods, over a variety of problems. The results of this study (1) support the assertion that RBAs are particularly flexible, efficient, and powerful feature selection methods that differentiate relevant features having univariate, multivariate, epistatic, or heterogeneous associations, (2) confirm the efficacy of expansions for classification vs. regression, discrete vs. continuous features, missing data, multiple classes, or class imbalance, (3) identify previously unknown limitations of specific RBAs, and (4) suggest that while MultiSURF* performs best for explicitly identifying pure 2-way interactions, MultiSURF yields the most reliable feature selection performance across a wide range of problem types.
△ Less
Submitted 2 April, 2018; v1 submitted 22 November, 2017;
originally announced November 2017.
-
Relief-Based Feature Selection: Introduction and Review
Authors:
Ryan J. Urbanowicz,
Melissa Meeker,
William LaCava,
Randal S. Olson,
Jason H. Moore
Abstract:
Feature selection plays a critical role in biomedical data mining, driven by increasing feature dimensionality in target problems and growing interest in advanced but computationally expensive methodologies able to model complex associations. Specifically, there is a need for feature selection methods that are computationally efficient, yet sensitive to complex patterns of association, e.g. intera…
▽ More
Feature selection plays a critical role in biomedical data mining, driven by increasing feature dimensionality in target problems and growing interest in advanced but computationally expensive methodologies able to model complex associations. Specifically, there is a need for feature selection methods that are computationally efficient, yet sensitive to complex patterns of association, e.g. interactions, so that informative features are not mistakenly eliminated prior to downstream modeling. This paper focuses on Relief-based algorithms (RBAs), a unique family of filter-style feature selection algorithms that have gained appeal by striking an effective balance between these objectives while flexibly adapting to various data characteristics, e.g. classification vs. regression. First, this work broadly examines types of feature selection and defines RBAs within that context. Next, we introduce the original Relief algorithm and associated concepts, emphasizing the intuition behind how it works, how feature weights generated by the algorithm can be interpreted, and why it is sensitive to feature interactions without evaluating combinations of features. Lastly, we include an expansive review of RBA methodological research beyond Relief and its popular descendant, ReliefF. In particular, we characterize branches of RBA research, and provide comparative summaries of RBA algorithms including contributions, strategies, functionality, time complexity, adaptation to key data characteristics, and software availability.
△ Less
Submitted 2 April, 2018; v1 submitted 22 November, 2017;
originally announced November 2017.
-
A probabilistic and multi-objective analysis of lexicase selection and epsilon-lexicase selection
Authors:
William La Cava,
Thomas Helmuth,
Lee Spector,
Jason H. Moore
Abstract:
Lexicase selection is a parent selection method that considers training cases individually, rather than in aggregate, when performing parent selection. Whereas previous work has demonstrated the ability of lexicase selection to solve difficult problems in program synthesis and symbolic regression, the central goal of this paper is to develop the theoretical underpinnings that explain its performan…
▽ More
Lexicase selection is a parent selection method that considers training cases individually, rather than in aggregate, when performing parent selection. Whereas previous work has demonstrated the ability of lexicase selection to solve difficult problems in program synthesis and symbolic regression, the central goal of this paper is to develop the theoretical underpinnings that explain its performance. To this end, we derive an analytical formula that gives the expected probabilities of selection under lexicase selection, given a population and its behavior. In addition, we expand upon the relation of lexicase selection to many-objective optimization methods to describe the behavior of lexicase selection, which is to select individuals on the boundaries of Pareto fronts in high-dimensional space. We show analytically why lexicase selection performs more poorly for certain sizes of population and training cases, and show why it has been shown to perform more poorly in continuous error spaces. To address this last concern, we propose new variants of epsilon-lexicase selection, a method that modifies the pass condition in lexicase selection to allow near-elite individuals to pass cases, thereby improving selection performance with continuous errors. We show that epsilon-lexicase outperforms several diversity-maintenance strategies on a number of real-world and synthetic regression problems.
△ Less
Submitted 29 April, 2018; v1 submitted 15 September, 2017;
originally announced September 2017.
-
Data-driven Advice for Applying Machine Learning to Bioinformatics Problems
Authors:
Randal S. Olson,
William La Cava,
Zairah Mustahsan,
Akshay Varik,
Jason H. Moore
Abstract:
As the bioinformatics field grows, it must keep pace not only with new data but with new algorithms. Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual compari…
▽ More
As the bioinformatics field grows, it must keep pace not only with new data but with new algorithms. Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual comparisons of algorithm performance and quantify the effect of model selection and algorithm tuning for each algorithm and dataset. The analysis culminates in the recommendation of five algorithms with hyperparameters that maximize classifier performance across the tested problems, as well as general guidelines for applying machine learning to supervised classification problems.
△ Less
Submitted 7 January, 2018; v1 submitted 8 August, 2017;
originally announced August 2017.
-
Investigating the Parameter Space of Evolutionary Algorithms
Authors:
Moshe Sipper,
Weixuan Fu,
Karuna Ahuja,
Jason H. Moore
Abstract:
The practice of evolutionary algorithms involves the tuning of many parameters. How big should the population be? How many generations should the algorithm run? What is the (tournament selection) tournament size? What probabilities should one assign to crossover and mutation? Through an extensive series of experiments over multiple evolutionary algorithm implementations and problems we show that p…
▽ More
The practice of evolutionary algorithms involves the tuning of many parameters. How big should the population be? How many generations should the algorithm run? What is the (tournament selection) tournament size? What probabilities should one assign to crossover and mutation? Through an extensive series of experiments over multiple evolutionary algorithm implementations and problems we show that parameter space tends to be rife with viable parameters, at least for 25 the problems studied herein. We discuss the implications of this finding in practice.
△ Less
Submitted 10 October, 2017; v1 submitted 13 June, 2017;
originally announced June 2017.
-
A System for Accessible Artificial Intelligence
Authors:
Randal S. Olson,
Moshe Sipper,
William La Cava,
Sharon Tartarone,
Steven Vitale,
Weixuan Fu,
Patryk Orzechowski,
Ryan J. Urbanowicz,
John H. Holmes,
Jason H. Moore
Abstract:
While artificial intelligence (AI) has become widespread, many commercial AI systems are not yet accessible to individual researchers nor the general public due to the deep knowledge of the systems required to use them. We believe that AI has matured to the point where it should be an accessible technology for everyone. We present an ongoing project whose ultimate goal is to deliver an open source…
▽ More
While artificial intelligence (AI) has become widespread, many commercial AI systems are not yet accessible to individual researchers nor the general public due to the deep knowledge of the systems required to use them. We believe that AI has matured to the point where it should be an accessible technology for everyone. We present an ongoing project whose ultimate goal is to deliver an open source, user-friendly AI system that is specialized for machine learning analysis of complex data in the biomedical and health care domains. We discuss how genetic programming can aid in this endeavor, and highlight specific examples where genetic programming has automated machine learning analyses in previous projects.
△ Less
Submitted 10 August, 2017; v1 submitted 1 May, 2017;
originally announced May 2017.
-
Ensemble representation learning: an analysis of fitness and survival for wrapper-based genetic programming methods
Authors:
William La Cava,
Jason H. Moore
Abstract:
Recently we proposed a general, ensemble-based feature engineering wrapper (FEW) that was paired with a number of machine learning methods to solve regression problems. Here, we adapt FEW for supervised classification and perform a thorough analysis of fitness and survival methods within this framework. Our tests demonstrate that two fitness metrics, one introduced as an adaptation of the silhouet…
▽ More
Recently we proposed a general, ensemble-based feature engineering wrapper (FEW) that was paired with a number of machine learning methods to solve regression problems. Here, we adapt FEW for supervised classification and perform a thorough analysis of fitness and survival methods within this framework. Our tests demonstrate that two fitness metrics, one introduced as an adaptation of the silhouette score, outperform the more commonly used Fisher criterion. We analyze survival methods and demonstrate that $ε$-lexicase survival works best across our test problems, followed by random survival which outperforms both tournament and deterministic crowding. We conduct a benchmark comparison to several classification methods using a large set of problems and show that FEW can improve the best classifier performance in several cases. We show that FEW generates consistent, meaningful features for a biomedical problem with different ML pairings.
△ Less
Submitted 3 August, 2017; v1 submitted 20 March, 2017;
originally announced March 2017.
-
PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison
Authors:
Randal S. Olson,
William La Cava,
Patryk Orzechowski,
Ryan J. Urbanowicz,
Jason H. Moore
Abstract:
The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchma…
▽ More
The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists. The present study introduces an accessible, curated, and develo** public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies. We compare meta-features among the current set of benchmark datasets in this resource to characterize the diversity of available data. Finally, we apply a number of established machine learning methods to the entire benchmark suite and analyze how datasets and algorithms cluster in terms of performance. This work is an important first step towards understanding the limitations of popular benchmarking suites and develo** a resource that connects existing benchmarking standards to more diverse and efficient standards in the future.
△ Less
Submitted 1 March, 2017;
originally announced March 2017.
-
Toward the automated analysis of complex diseases in genome-wide association studies using genetic programming
Authors:
Andrew Sohn,
Randal S. Olson,
Jason H. Moore
Abstract:
Machine learning has been gaining traction in recent years to meet the demand for tools that can efficiently analyze and make sense of the ever-growing databases of biomedical data in health care systems around the world. However, effectively using machine learning methods requires considerable domain expertise, which can be a barrier of entry for bioinformaticians new to computational data scienc…
▽ More
Machine learning has been gaining traction in recent years to meet the demand for tools that can efficiently analyze and make sense of the ever-growing databases of biomedical data in health care systems around the world. However, effectively using machine learning methods requires considerable domain expertise, which can be a barrier of entry for bioinformaticians new to computational data science methods. Therefore, off-the-shelf tools that make machine learning more accessible can prove invaluable for bioinformaticians. To this end, we have developed an open source pipeline optimization tool (TPOT-MDR) that uses genetic programming to automatically design machine learning pipelines for bioinformatics studies. In TPOT-MDR, we implement Multifactor Dimensionality Reduction (MDR) as a feature construction method for modeling higher-order feature interactions, and combine it with a new expert knowledge-guided feature selector for large biomedical data sets. We demonstrate TPOT-MDR's capabilities using a combination of simulated and real world data sets from human genetics and find that TPOT-MDR significantly outperforms modern machine learning methods such as logistic regression and eXtreme Gradient Boosting (XGBoost). We further analyze the best pipeline discovered by TPOT-MDR for a real world problem and highlight TPOT-MDR's ability to produce a high-accuracy solution that is also easily interpretable.
△ Less
Submitted 6 February, 2017;
originally announced February 2017.
-
Identifying and Harnessing the Building Blocks of Machine Learning Pipelines for Sensible Initialization of a Data Science Automation Tool
Authors:
Randal S. Olson,
Jason H. Moore
Abstract:
As data science continues to grow in popularity, there will be an increasing need to make data science tools more scalable, flexible, and accessible. In particular, automated machine learning (AutoML) systems seek to automate the process of designing and optimizing machine learning pipelines. In this chapter, we present a genetic programming-based AutoML system called TPOT that optimizes a series…
▽ More
As data science continues to grow in popularity, there will be an increasing need to make data science tools more scalable, flexible, and accessible. In particular, automated machine learning (AutoML) systems seek to automate the process of designing and optimizing machine learning pipelines. In this chapter, we present a genetic programming-based AutoML system called TPOT that optimizes a series of feature preprocessors and machine learning models with the goal of maximizing classification accuracy on a supervised classification problem. Further, we analyze a large database of pipelines that were previously used to solve various supervised classification problems and identify 100 short series of machine learning operations that appear the most frequently, which we call the building blocks of machine learning pipelines. We harness these building blocks to initialize TPOT with promising solutions, and find that this sensible initialization method significantly improves TPOT's performance on one benchmark at no cost of significantly degrading performance on the others. Thus, sensible initialization with machine learning pipeline building blocks shows promise for GP-based AutoML systems, and should be further refined in future work.
△ Less
Submitted 29 July, 2016;
originally announced July 2016.
-
Evolution of active categorical image classification via saccadic eye movement
Authors:
Randal S. Olson,
Jason H. Moore,
Christoph Adami
Abstract:
Pattern recognition and classification is a central concern for modern information processing systems. In particular, one key challenge to image and video classification has been that the computational cost of image processing scales linearly with the number of pixels in the image or video. Here we present an intelligent machine (the "active categorical classifier," or ACC) that is inspired by the…
▽ More
Pattern recognition and classification is a central concern for modern information processing systems. In particular, one key challenge to image and video classification has been that the computational cost of image processing scales linearly with the number of pixels in the image or video. Here we present an intelligent machine (the "active categorical classifier," or ACC) that is inspired by the saccadic movements of the eye, and is capable of classifying images by selectively scanning only a portion of the image. We harness evolutionary computation to optimize the ACC on the MNIST hand-written digit classification task, and provide a proof-of-concept that the ACC works on noisy multi-class data. We further analyze the ACC and demonstrate its ability to classify images after viewing only a fraction of the pixels, and provide insight on future research paths to further improve upon the ACC presented here.
△ Less
Submitted 16 June, 2016; v1 submitted 27 March, 2016;
originally announced March 2016.
-
Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science
Authors:
Randal S. Olson,
Nathan Bartley,
Ryan J. Urbanowicz,
Jason H. Moore
Abstract:
As the field of data science continues to grow, there will be an ever-increasing demand for tools that make machine learning accessible to non-experts. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement an open source Tree-based Pipeline Optimization Tool (TPOT) in Python and d…
▽ More
As the field of data science continues to grow, there will be an ever-increasing demand for tools that make machine learning accessible to non-experts. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement an open source Tree-based Pipeline Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a series of simulated and real-world benchmark data sets. In particular, we show that TPOT can design machine learning pipelines that provide a significant improvement over a basic machine learning analysis while requiring little to no input nor prior knowledge from the user. We also address the tendency for TPOT to design overly complex pipelines by integrating Pareto optimization, which produces compact pipelines without sacrificing classification accuracy. As such, this work represents an important step toward fully automating machine learning pipeline design.
△ Less
Submitted 20 March, 2016;
originally announced March 2016.
-
Exploring the coevolution of predator and prey morphology and behavior
Authors:
Randal S. Olson,
Arend Hintze,
Fred C. Dyer,
Jason H. Moore,
Christoph Adami
Abstract:
A common idiom in biology education states, "Eyes in the front, the animal hunts. Eyes on the side, the animal hides." In this paper, we explore one possible explanation for why predators tend to have forward-facing, high-acuity visual systems. We do so using an agent-based computational model of evolution, where predators and prey interact and adapt their behavior and morphology to one another ov…
▽ More
A common idiom in biology education states, "Eyes in the front, the animal hunts. Eyes on the side, the animal hides." In this paper, we explore one possible explanation for why predators tend to have forward-facing, high-acuity visual systems. We do so using an agent-based computational model of evolution, where predators and prey interact and adapt their behavior and morphology to one another over successive generations of evolution. In this model, we observe a coevolutionary cycle between prey swarming behavior and the predator's visual system, where the predator and prey continually adapt their visual system and behavior, respectively, over evolutionary time in reaction to one another due to the well-known "predator confusion effect." Furthermore, we provide evidence that the predator visual system is what drives this coevolutionary cycle, and suggest that the cycle could be closed if the predator evolves a hybrid visual system capable of narrow, high-acuity vision for tracking prey as well as broad, coarse vision for prey discovery. Thus, the conflicting demands imposed on a predator's visual system by the predator confusion effect could have led to the evolution of complex eyes in many predators.
△ Less
Submitted 28 February, 2016;
originally announced February 2016.
-
Automating biomedical data science through tree-based pipeline optimization
Authors:
Randal S. Olson,
Ryan J. Urbanowicz,
Peter C. Andrews,
Nicole A. Lavender,
La Creis Kidd,
Jason H. Moore
Abstract:
Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and…
▽ More
Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators---such as synthetic feature constructors---that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.
△ Less
Submitted 28 January, 2016;
originally announced January 2016.