Search | arXiv e-print repository

A Comparison of Recent Algorithms for Symbolic Regression to Genetic Programming

Authors: Yousef A. Radwan, Gabriel Kronberger, Stephan Winkler

Abstract: Symbolic regression is a machine learning method with the goal to produce interpretable results. Unlike other machine learning methods such as, e.g. random forests or neural networks, which are opaque, symbolic regression aims to model and map data in a way that can be understood by scientists. Recent advancements, have attempted to bridge the gap between these two fields; new methodologies attemp… ▽ More Symbolic regression is a machine learning method with the goal to produce interpretable results. Unlike other machine learning methods such as, e.g. random forests or neural networks, which are opaque, symbolic regression aims to model and map data in a way that can be understood by scientists. Recent advancements, have attempted to bridge the gap between these two fields; new methodologies attempt to fuse the map** power of neural networks and deep learning techniques with the explanatory power of symbolic regression. In this paper, we examine these new emerging systems and test the performance of an end-to-end transformer model for symbolic regression versus the reigning traditional methods based on genetic programming that have spearheaded symbolic regression throughout the years. We compare these systems on novel datasets to avoid bias to older methods who were improved on well-known benchmark datasets. Our results show that traditional GP methods as implemented e.g., by Operon still remain superior to two recently published symbolic regression methods. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2404.17292 [pdf, other]

The Inefficiency of Genetic Programming for Symbolic Regression -- Extended Version

Authors: Gabriel Kronberger, Fabricio Olivetti de Franca, Harry Desmond, Deaglan J. Bartlett, Lukas Kammerer

Abstract: We analyse the search behaviour of genetic programming for symbolic regression in practically relevant but limited settings, allowing exhaustive enumeration of all solutions. This enables us to quantify the success probability of finding the best possible expressions, and to compare the search efficiency of genetic programming to random search in the space of semantically unique expressions. This… ▽ More We analyse the search behaviour of genetic programming for symbolic regression in practically relevant but limited settings, allowing exhaustive enumeration of all solutions. This enables us to quantify the success probability of finding the best possible expressions, and to compare the search efficiency of genetic programming to random search in the space of semantically unique expressions. This analysis is made possible by improved algorithms for equality saturation, which we use to improve the Exhaustive Symbolic Regression algorithm; this produces the set of semantically unique expression structures, orders of magnitude smaller than the full symbolic regression search space. We compare the efficiency of random search in the set of unique expressions and genetic programming. For our experiments we use two real-world datasets where symbolic regression has been used to produce well-fitting univariate expressions: the Nikuradse dataset of flow in rough pipes and the Radial Acceleration Relation of galaxy dynamics. The results show that genetic programming in such limited settings explores only a small fraction of all unique expressions, and evaluates expressions repeatedly that are congruent to already visited expressions. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: This is an extended version of the article submitted to Parallel Problem Solving from Nature (PPSN) Conference 2024

arXiv:2311.15865 [pdf, other]

doi 10.1051/0004-6361/202348811

A precise symbolic emulator of the linear matter power spectrum

Authors: Deaglan J. Bartlett, Lukas Kammerer, Gabriel Kronberger, Harry Desmond, Pedro G. Ferreira, Benjamin D. Wandelt, Bogdan Burlacu, David Alonso, Matteo Zennaro

Abstract: Computing the matter power spectrum, $P(k)$, as a function of cosmological parameters can be prohibitively slow in cosmological analyses, hence emulating this calculation is desirable. Previous analytic approximations are insufficiently accurate for modern applications, so black-box, uninterpretable emulators are often used. We utilise an efficient genetic programming based symbolic regression fra… ▽ More Computing the matter power spectrum, $P(k)$, as a function of cosmological parameters can be prohibitively slow in cosmological analyses, hence emulating this calculation is desirable. Previous analytic approximations are insufficiently accurate for modern applications, so black-box, uninterpretable emulators are often used. We utilise an efficient genetic programming based symbolic regression framework to explore the space of potential mathematical expressions which can approximate the power spectrum and $σ_8$. We learn the ratio between an existing low-accuracy fitting function for $P(k)$ and that obtained by solving the Boltzmann equations and thus still incorporate the physics which motivated this earlier approximation. We obtain an analytic approximation to the linear power spectrum with a root mean squared fractional error of 0.2% between $k = 9\times10^{-3} - 9 \, h{\rm \, Mpc^{-1}}$ and across a wide range of cosmological parameters, and we provide physical interpretations for various terms in the expression. Our analytic approximation is 950 times faster to evaluate than camb and 36 times faster than the neural network based matter power spectrum emulator BACCO. We also provide a simple analytic approximation for $σ_8$ with a similar accuracy, with a root mean squared fractional error of just 0.1% when evaluated across the same range of cosmologies. This function is easily invertible to obtain $A_{\rm s}$ as a function of $σ_8$ and the other cosmological parameters, if preferred. It is possible to obtain symbolic approximations to a seemingly complex function at a precision required for current and future cosmological analyses without resorting to deep-learning techniques, thus avoiding their black-box nature and large number of parameters. Our emulator will be usable long after the codes on which numerical approximations are built become outdated. △ Less

Submitted 15 April, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

Comments: 9 pages, 5 figures. Accepted for publication in A&A

Journal ref: A&A 686, A209 (2024)

arXiv:2307.01238 [pdf, other]

Learning Difference Equations with Structured Grammatical Evolution for Postprandial Glycaemia Prediction

Authors: Daniel Parra, David Joedicke, J. Manuel Velasco, Gabriel Kronberger, J. Ignacio Hidalgo

Abstract: People with diabetes must carefully monitor their blood glucose levels, especially after eating. Blood glucose regulation requires a proper combination of food intake and insulin boluses. Glucose prediction is vital to avoid dangerous post-meal complications in treating individuals with diabetes. Although traditional methods, such as artificial neural networks, have shown high accuracy rates, some… ▽ More People with diabetes must carefully monitor their blood glucose levels, especially after eating. Blood glucose regulation requires a proper combination of food intake and insulin boluses. Glucose prediction is vital to avoid dangerous post-meal complications in treating individuals with diabetes. Although traditional methods, such as artificial neural networks, have shown high accuracy rates, sometimes they are not suitable for develo** personalised treatments by physicians due to their lack of interpretability. In this study, we propose a novel glucose prediction method emphasising interpretability: Interpretable Sparse Identification by Grammatical Evolution. Combined with a previous clustering stage, our approach provides finite difference equations to predict postprandial glucose levels up to two hours after meals. We divide the dataset into four-hour segments and perform clustering based on blood glucose values for the twohour window before the meal. Prediction models are trained for each cluster for the two-hour windows after meals, allowing predictions in 15-minute steps, yielding up to eight predictions at different time horizons. Prediction safety was evaluated based on Parkes Error Grid regions. Our technique produces safe predictions through explainable expressions, avoiding zones D (0.2% average) and E (0%) and reducing predictions on zone C (6.2%). In addition, our proposal has slightly better accuracy than other techniques, including sparse identification of non-linear dynamics and artificial neural networks. The results demonstrate that our proposal provides interpretable solutions without sacrificing prediction accuracy, offering a promising approach to glucose prediction in diabetes management that balances accuracy, interpretability, and computational efficiency. △ Less

Submitted 3 July, 2023; originally announced July 2023.

arXiv:2212.10284 [pdf, other]

Steel Phase Kinetics Modeling using Symbolic Regression

Authors: David Piringer, Bernhard Bloder, Gabriel Kronberger

Abstract: We describe an approach for empirical modeling of steel phase kinetics based on symbolic regression and genetic programming. The algorithm takes processed data gathered from dilatometer measurements and produces a system of differential equations that models the phase kinetics. Our initial results demonstrate that the proposed approach allows to identify compact differential equations that fit the… ▽ More We describe an approach for empirical modeling of steel phase kinetics based on symbolic regression and genetic programming. The algorithm takes processed data gathered from dilatometer measurements and produces a system of differential equations that models the phase kinetics. Our initial results demonstrate that the proposed approach allows to identify compact differential equations that fit the data. The model predicts ferrite, pearlite and bainite formation for a single steel type. Martensite is not yet included in the model. Future work shall incorporate martensite and generalize to multiple steel types with different chemical compositions. △ Less

Submitted 19 December, 2022; originally announced December 2022.

Comments: 24th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC 2022)

arXiv:2209.13852 [pdf, other]

doi 10.1007/978-3-031-25312-6_21

Identifying Differential Equations to predict Blood Glucose using Sparse Identification of Nonlinear Systems

Authors: David Jödicke, Daniel Parra, Gabriel Kronberger, Stephan Winkler

Abstract: Describing dynamic medical systems using machine learning is a challenging topic with a wide range of applications. In this work, the possibility of modeling the blood glucose level of diabetic patients purely on the basis of measured data is described. A combination of the influencing variables insulin and calories are used to find an interpretable model. The absorption speed of external substanc… ▽ More Describing dynamic medical systems using machine learning is a challenging topic with a wide range of applications. In this work, the possibility of modeling the blood glucose level of diabetic patients purely on the basis of measured data is described. A combination of the influencing variables insulin and calories are used to find an interpretable model. The absorption speed of external substances in the human body depends strongly on external influences, which is why time-shifts are added for the influencing variables. The focus is put on identifying the best timeshifts that provide robust models with good prediction accuracy that are independent of other unknown external influences. The modeling is based purely on the measured data using Sparse Identification of Nonlinear Dynamics. A differential equation is determined which, starting from an initial value, simulates blood glucose dynamics. By applying the best model to test data, we can show that it is possible to simulate the long-term blood glucose dynamics using differential equations and few, influencing variables. △ Less

Submitted 28 September, 2022; originally announced September 2022.

Comments: Submitted manuscript to be published in Computer Aided Systems Theory - EUROCAST 2022: 18th International Conference, Las Palmas de Gran Canaria, Feb. 2022

Journal ref: In: Moreno-Diaz, R., Pichler, F., Quesada-Arencibia, A. (eds) Computer Aided Systems Theory EUROCAST 2022. Lecture Notes in Computer Science, vol 13789

arXiv:2209.09675 [pdf, ps, other]

doi 10.1007/978-3-031-25312-6_16

Symbolic Regression with Fast Function Extraction and Nonlinear Least Squares Optimization

Authors: Lukas Kammerer, Gabriel Kronberger, Michael Kommenda

Abstract: Fast Function Extraction (FFX) is a deterministic algorithm for solving symbolic regression problems. We improve the accuracy of FFX by adding parameters to the arguments of nonlinear functions. Instead of only optimizing linear parameters, we optimize these additional nonlinear parameters with separable nonlinear least squared optimization using a variable projection algorithm. Both FFX and our n… ▽ More Fast Function Extraction (FFX) is a deterministic algorithm for solving symbolic regression problems. We improve the accuracy of FFX by adding parameters to the arguments of nonlinear functions. Instead of only optimizing linear parameters, we optimize these additional nonlinear parameters with separable nonlinear least squared optimization using a variable projection algorithm. Both FFX and our new algorithm is applied on the PennML benchmark suite. We show that the proposed extensions of FFX leads to higher accuracy while providing models of similar length and with only a small increase in runtime on the given data. Our results are compared to a large set of regression methods that were already published for the given benchmark suite. △ Less

Submitted 20 September, 2022; originally announced September 2022.

Comments: Submitted manuscript to be published in Computer Aided Systems Theory - EUROCAST 2022: 18th International Conference, Las Palmas de Gran Canaria, Feb. 2022

Journal ref: In: Moreno-Diaz, R., Pichler, F., Quesada-Arencibia, A. (eds) Computer Aided Systems Theory EUROCAST 2022. Lecture Notes in Computer Science, vol 13789. Springer, Cham

arXiv:2209.09602 [pdf, other]

doi 10.1007/978-3-031-25312-6_17

Comparing Shape-Constrained Regression Algorithms for Data Validation

Authors: Florian Bachinger, Gabriel Kronberger

Abstract: Industrial and scientific applications handle large volumes of data that render manual validation by humans infeasible. Therefore, we require automated data validation approaches that are able to consider the prior knowledge of domain experts to produce dependable, trustworthy assessments of data quality. Prior knowledge is often available as rules that describe interactions of inputs with regard… ▽ More Industrial and scientific applications handle large volumes of data that render manual validation by humans infeasible. Therefore, we require automated data validation approaches that are able to consider the prior knowledge of domain experts to produce dependable, trustworthy assessments of data quality. Prior knowledge is often available as rules that describe interactions of inputs with regard to the target e.g. the target must be monotonically decreasing and convex over increasing input values. Domain experts are able to validate multiple such interactions at a glance. However, existing rule-based data validation approaches are unable to consider these constraints. In this work, we compare different shape-constrained regression algorithms for the purpose of data validation based on their classification accuracy and runtime performance. △ Less

Submitted 20 September, 2022; originally announced September 2022.

Comments: 9 pages, Submitted manuscript to be published in Computer Aided System Theory - EUROCAST 2022, Las Palmas de Gran Canaria, February 2022

Journal ref: In: Moreno-Diaz, R., Pichler, F., Quesada-Arencibia, A. (eds) Computer Aided Systems Theory EUROCAST 2022. Lecture Notes in Computer Science, vol 13789

arXiv:2209.06454 [pdf, other]

Prediction Intervals and Confidence Regions for Symbolic Regression Models based on Likelihood Profiles

Authors: Fabricio Olivetti de Franca, Gabriel Kronberger

Abstract: Symbolic regression is a nonlinear regression method which is commonly performed by an evolutionary computation method such as genetic programming. Quantification of uncertainty of regression models is important for the interpretation of models and for decision making. The linear approximation and so-called likelihood profiles are well-known possibilities for the calculation of confidence and pred… ▽ More Symbolic regression is a nonlinear regression method which is commonly performed by an evolutionary computation method such as genetic programming. Quantification of uncertainty of regression models is important for the interpretation of models and for decision making. The linear approximation and so-called likelihood profiles are well-known possibilities for the calculation of confidence and prediction intervals for nonlinear regression models. These simple and effective techniques have been completely ignored so far in the genetic programming literature. In this work we describe the calculation of likelihood profiles in details and also provide some illustrative examples with models created with three different symbolic regression algorithms on two different datasets. The examples highlight the importance of the likelihood profiles to understand the limitations of symbolic regression models and to help the user taking an informed post-prediction decision. △ Less

Submitted 14 September, 2022; originally announced September 2022.

Comments: 10 pages, 8 figures, 1 table, 3 algorithms. Submitted to IEEE Transactions on Evolutionary Computation

arXiv:2209.00942 [pdf, other]

doi 10.1109/SYNASC57785.2022.00055

Local Optimization Often is Ill-conditioned in Genetic Programming for Symbolic Regression

Authors: Gabriel Kronberger

Abstract: Gradient-based local optimization has been shown to improve results of genetic programming (GP) for symbolic regression. Several state-of-the-art GP implementations use iterative nonlinear least squares (NLS) algorithms such as the Levenberg-Marquardt algorithm for local optimization. The effectiveness of NLS algorithms depends on appropriate scaling and conditioning of the optimization problem. T… ▽ More Gradient-based local optimization has been shown to improve results of genetic programming (GP) for symbolic regression. Several state-of-the-art GP implementations use iterative nonlinear least squares (NLS) algorithms such as the Levenberg-Marquardt algorithm for local optimization. The effectiveness of NLS algorithms depends on appropriate scaling and conditioning of the optimization problem. This has so far been ignored in symbolic regression and GP literature. In this study we use a singular value decomposition of NLS Jacobian matrices to determine the numeric rank and the condition number. We perform experiments with a GP implementation and six different benchmark datasets. Our results show that rank-deficient and ill-conditioned Jacobian matrices occur frequently and for all datasets. The issue is less extreme when restricting GP tree size and when using many non-linear functions in the function set. △ Less

Submitted 2 September, 2022; originally announced September 2022.

Comments: Submitted to International Symposium on Symbolic and Numeric Algorithms for Scientific Computing 2022 https://synasc.ro/

arXiv:2206.06422 [pdf, other]

doi 10.1007/978-981-19-8460-0_1

Symbolic Regression in Materials Science: Discovering Interatomic Potentials from Data

Authors: Bogdan Burlacu, Michael Kommenda, Gabriel Kronberger, Stephan Winkler, Michael Affenzeller

Abstract: Particle-based modeling of materials at atomic scale plays an important role in the development of new materials and understanding of their properties. The accuracy of particle simulations is determined by interatomic potentials, which allow to calculate the potential energy of an atomic system as a function of atomic coordinates and potentially other properties. First-principles-based ab initio p… ▽ More Particle-based modeling of materials at atomic scale plays an important role in the development of new materials and understanding of their properties. The accuracy of particle simulations is determined by interatomic potentials, which allow to calculate the potential energy of an atomic system as a function of atomic coordinates and potentially other properties. First-principles-based ab initio potentials can reach arbitrary levels of accuracy, however their aplicability is limited by their high computational cost. Machine learning (ML) has recently emerged as an effective way to offset the high computational costs of ab initio atomic potentials by replacing expensive models with highly efficient surrogates trained on electronic structure data. Among a plethora of current methods, symbolic regression (SR) is gaining traction as a powerful "white-box" approach for discovering functional forms of interatomic potentials. This contribution discusses the role of symbolic regression in Materials Science (MS) and offers a comprehensive overview of current methodological challenges and state-of-the-art results. A genetic programming-based approach for modeling atomic potentials from raw data (consisting of snapshots of atomic positions and associated potential energy) is presented and empirically validated on ab initio electronic structure data. △ Less

Submitted 21 July, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

Comments: Submitted to the GPTP XIX Workshop, June 2-4 2022, University of Michigan, Ann Arbor, Michigan

arXiv:2110.00415 [pdf, other]

doi 10.1007/978-3-319-74718-7_47

Optimization Networks for Integrated Machine Learning

Authors: Michael Kommenda, Johannes Karder, Andreas Beham, Bogdan Burlacu, Gabriel Kronberger, Stefan Wagner, Michael Affenzeller

Abstract: Optimization networks are a new methodology for holistically solving interrelated problems that have been developed with combinatorial optimization problems in mind. In this contribution we revisit the core principles of optimization networks and demonstrate their suitability for solving machine learning problems. We use feature selection in combination with linear model creation as a benchmark ap… ▽ More Optimization networks are a new methodology for holistically solving interrelated problems that have been developed with combinatorial optimization problems in mind. In this contribution we revisit the core principles of optimization networks and demonstrate their suitability for solving machine learning problems. We use feature selection in combination with linear model creation as a benchmark application and compare the results of optimization networks to ordinary least squares with optional elastic net regularization. Based on this example we justify the advantages of optimization networks by adapting the network to solve other machine learning problems. Finally, optimization analysis is presented, where optimal input values of a system have to be found to achieve desired output values. Optimization analysis can be divided into three subproblems: model creation to describe the system, model selection to choose the most appropriate one and parameter optimization to obtain the input values. Therefore, optimization networks are an obvious choice for handling optimization analysis tasks. △ Less

Submitted 1 September, 2021; originally announced October 2021.

Comments: International Conference on Computer Aided Systems Theory, Eurocast 2017, pp 392-399

Journal ref: In: Moreno-Díaz R. et al (eds) Computer Aided Systems Theory, Eurocast 2017. Lecture Notes in Computer Science, Vol. 10671. Springer (2018)

arXiv:2109.13898 [pdf, ps, other]

doi 10.1007/978-3-030-04735-1_5

Cluster Analysis of a Symbolic Regression Search Space

Authors: Gabriel Kronberger, Lukas Kammerer, Bogdan Burlacu, Stephan M. Winkler, Michael Kommenda, Michael Affenzeller

Abstract: In this chapter we take a closer look at the distribution of symbolic regression models generated by genetic programming in the search space. The motivation for this work is to improve the search for well-fitting symbolic regression models by using information about the similarity of models that can be precomputed independently from the target function. For our analysis, we use a restricted gramma… ▽ More In this chapter we take a closer look at the distribution of symbolic regression models generated by genetic programming in the search space. The motivation for this work is to improve the search for well-fitting symbolic regression models by using information about the similarity of models that can be precomputed independently from the target function. For our analysis, we use a restricted grammar for uni-variate symbolic regression models and generate all possible models up to a fixed length limit. We identify unique models and cluster them based on phenotypic as well as genotypic similarity. We find that phenotypic similarity leads to well-defined clusters while genotypic similarity does not produce a clear clustering. By map** solution candidates visited by GP to the enumerated search space we find that GP initially explores the whole search space and later converges to the subspace of highest quality expressions in a run for a simple benchmark problem. △ Less

Submitted 28 September, 2021; originally announced September 2021.

Comments: Genetic Programming Theory and Practice XVI. Genetic and Evolutionary Computation. Springer

Journal ref: eIn: Banzhaf W. et al (eds) Genetic Programming Theory and Practice XVI. Genetic and Evolutionary Computation. Springer, Cham. pp 85-102 (2019)

arXiv:2109.13895 [pdf, ps, other]

doi 10.1007/978-3-030-39958-0_5

Symbolic Regression by Exhaustive Search: Reducing the Search Space Using Syntactical Constraints and Efficient Semantic Structure Deduplication

Authors: Lukas Kammerer, Gabriel Kronberger, Bogdan Burlacu, Stephan M. Winkler, Michael Kommenda, Michael Affenzeller

Abstract: Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available. Such scenarios often require specific model properties such as interpretability, robustness, trustworthiness and plausibility, that are not easily achievable using standard approaches like genetic programming for symbolic regression. In this chapter we… ▽ More Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available. Such scenarios often require specific model properties such as interpretability, robustness, trustworthiness and plausibility, that are not easily achievable using standard approaches like genetic programming for symbolic regression. In this chapter we introduce a deterministic symbolic regression algorithm specifically designed to address these issues. The algorithm uses a context-free grammar to produce models that are parameterized by a non-linear least squares local optimization procedure. A finite enumeration of all possible models is guaranteed by structural restrictions as well as a caching mechanism for detecting semantically equivalent solutions. Enumeration order is established via heuristics designed to improve search efficiency. Empirical tests on a comprehensive benchmark suite show that our approach is competitive with genetic programming in many noiseless problems while maintaining desirable properties such as simple, reliable models and reproducibility. △ Less

Submitted 28 September, 2021; originally announced September 2021.

Comments: Genetic and Evolutionary Computation

Journal ref: In: Banzhaf W. et al (eds) Genetic Programming Theory and Practice XVII, pp 79-99 (2020)

arXiv:2109.00238 [pdf, ps, other]

doi 10.1007/978-3-319-27340-2_51

Complexity Measures for Multi-objective Symbolic Regression

Authors: Michael Kommenda, Andreas Beham, Michael Affenzeller, Gabriel Kronberger

Abstract: Multi-objective symbolic regression has the advantage that while the accuracy of the learned models is maximized, the complexity is automatically adapted and need not be specified a-priori. The result of the optimization is not a single solution anymore, but a whole Pareto-front describing the trade-off between accuracy and complexity. In this contribution we study which complexity measures are mo… ▽ More Multi-objective symbolic regression has the advantage that while the accuracy of the learned models is maximized, the complexity is automatically adapted and need not be specified a-priori. The result of the optimization is not a single solution anymore, but a whole Pareto-front describing the trade-off between accuracy and complexity. In this contribution we study which complexity measures are most appropriately used in symbolic regression when performing multi- objective optimization with NSGA-II. Furthermore, we present a novel complexity measure that includes semantic information based on the function symbols occurring in the models and test its effects on several benchmark datasets. Results comparing multiple complexity measures are presented in terms of the achieved accuracy and model length to illustrate how the search direction of the algorithm is affected. △ Less

Submitted 1 September, 2021; originally announced September 2021.

Comments: International Conference on Computer Aided Systems Theory, Eurocast 2015, pp 409-416

Journal ref: In: Moreno-Díaz R. et al (eds) Computer Aided Systems Theory, EUROCAST 2015. Lecture Notes in Computer Science, vol 9520 (2015)

arXiv:2108.10660 [pdf, other]

doi 10.1007/978-3-030-45093-9_46

Data Aggregation for Reducing Training Data in Symbolic Regression

Authors: Lukas Kammerer, Gabriel Kronberger, Michael Kommenda

Abstract: The growing volume of data makes the use of computationally intense machine learning techniques such as symbolic regression with genetic programming more and more impractical. This work discusses methods to reduce the training data and thereby also the runtime of genetic programming. The data is aggregated in a preprocessing step before running the actual machine learning algorithm. K-means cluste… ▽ More The growing volume of data makes the use of computationally intense machine learning techniques such as symbolic regression with genetic programming more and more impractical. This work discusses methods to reduce the training data and thereby also the runtime of genetic programming. The data is aggregated in a preprocessing step before running the actual machine learning algorithm. K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method. We analyze the achieved speed-up in training and the effects on the trained models test accuracy for every method on four real-world data sets. The performance of genetic programming is compared with random forests and linear regression. It is shown, that k-means and random sampling lead to very small loss in test accuracy when the data is reduced down to only 30% of the original data, while the speed-up is proportional to the size of the data set. Binning on the contrary, leads to models with very high test error. △ Less

Submitted 24 August, 2021; originally announced August 2021.

Comments: International Conference on Computer Aided Systems Theory 2015, pp 378-386

Journal ref: In: Moreno-Díaz R. et al (eds) Computer Aided Systems Theory, Eurocast 2019. Lecture Notes in Computer Science, Vol. 12013. Springer (2020)

arXiv:2108.03274 [pdf, other]

doi 10.1007/978-3-319-27340-2_47

Smooth Symbolic Regression: Transformation of Symbolic Regression into a Real-valued Optimization Problem

Authors: Erik Pitzer, Gabriel Kronberger

Abstract: The typical methods for symbolic regression produce rather abrupt changes in solution candidates. In this work, we have tried to transform symbolic regression from an optimization problem, with a landscape that is so rugged that typical analysis methods do not produce meaningful results, to one that can be compared to typical and very smooth real-valued problems. While the ruggedness might not int… ▽ More The typical methods for symbolic regression produce rather abrupt changes in solution candidates. In this work, we have tried to transform symbolic regression from an optimization problem, with a landscape that is so rugged that typical analysis methods do not produce meaningful results, to one that can be compared to typical and very smooth real-valued problems. While the ruggedness might not interfere with the performance of optimization, it restricts the possibilities of analysis. Here, we have explored different aspects of a transformation and propose a simple procedure to create real-valued optimization problems from symbolic regression problems. △ Less

Submitted 6 August, 2021; originally announced August 2021.

Journal ref: Pitzer E. et al., Smooth Symbolic Regression: Transformation of Symbolic Regression into a Real-Valued Optimization Problem. In: Moreno-Díaz R. et al. (eds) Computer Aided Systems Theory. Lecture Notes in Computer Science, Vol. 9520, 2015

arXiv:2108.03273 [pdf, other]

doi 10.1007/978-3-030-45093-9_36

Concept Drift Detection with Variable Interaction Networks

Authors: Jan Zenisek, Gabriel Kronberger, Josef Wolfartsberger, Norbert Wild, Michael Affenzeller

Abstract: The current development of today's production industry towards seamless sensor-based monitoring is paving the way for concepts such as Predictive Maintenance. By this means, the condition of plants and products in future production lines will be continuously analyzed with the objective to predict any kind of breakdown and trigger preventing actions proactively. Such ambitious predictions are commo… ▽ More The current development of today's production industry towards seamless sensor-based monitoring is paving the way for concepts such as Predictive Maintenance. By this means, the condition of plants and products in future production lines will be continuously analyzed with the objective to predict any kind of breakdown and trigger preventing actions proactively. Such ambitious predictions are commonly performed with support of machine learning algorithms. In this work, we utilize these algorithms to model complex systems, such as production plants, by focusing on their variable interactions. The core of this contribution is a sliding window based algorithm, designed to detect changes of the identified interactions, which might indicate beginning malfunctions in the context of a monitored production plant. Besides a detailed description of the algorithm, we present results from experiments with a synthetic dynamical system, simulating stable and drifting system behavior. △ Less

Submitted 6 August, 2021; originally announced August 2021.

Comments: International Conference on Computer Aided Systems Theory, Eurocast 2019, pp. 296-303

Journal ref: In: Moreno-Díaz R. et al (eds) Computer Aided Systems Theory, Eurocast 2019. Lecture Notes in Computer Science, Vol. 12013. Springer, Cham

arXiv:2108.01595 [pdf, other]

Extending a Physics-Based Constitutive Model using Genetic Programming

Authors: Gabriel Kronberger, Evgeniya Kabliman, Johannes Kronsteiner, Michael Kommenda

Abstract: In material science, models are derived to predict emergent material properties (e.g. elasticity, strength, conductivity) and their relations to processing conditions. A major drawback is the calibration of model parameters that depend on processing conditions. Currently, these parameters must be optimized to fit measured data since their relations to processing conditions (e.g. deformation temper… ▽ More In material science, models are derived to predict emergent material properties (e.g. elasticity, strength, conductivity) and their relations to processing conditions. A major drawback is the calibration of model parameters that depend on processing conditions. Currently, these parameters must be optimized to fit measured data since their relations to processing conditions (e.g. deformation temperature, strain rate) are not fully understood. We present a new approach that identifies the functional dependency of calibration parameters from processing conditions based on genetic programming. We propose two (explicit and implicit) methods to identify these dependencies and generate short interpretable expressions. The approach is used to extend a physics-based constitutive model for deformation processes. This constitutive model operates with internal material variables such as a dislocation density and contains a number of parameters, among them three calibration parameters. The derived expressions extend the constitutive model and replace the calibration parameters. Thus, interpolation between various processing parameters is enabled. Our results show that the implicit method is computationally more expensive than the explicit approach but also produces significantly better results. △ Less

Submitted 19 November, 2021; v1 submitted 3 August, 2021; originally announced August 2021.

Comments: Preprint submitted to Applications in Engineering Sciences

MSC Class: 68T05; 74-04; 74-05; 74-10; 74C99

arXiv:2107.13821 [pdf, other]

doi 10.1007/978-3-030-45093-9_32

Concept for a Technical Infrastructure for Management of Predictive Models in Industrial Applications

Authors: Florian Bachinger, Gabriel Kronberger

Abstract: With the increasing number of created and deployed prediction models and the complexity of machine learning workflows we require so called model management systems to support data scientists in their tasks. In this work we describe our technological concept for such a model management system. This concept includes versioned storage of data, support for different machine learning algorithms, fine t… ▽ More With the increasing number of created and deployed prediction models and the complexity of machine learning workflows we require so called model management systems to support data scientists in their tasks. In this work we describe our technological concept for such a model management system. This concept includes versioned storage of data, support for different machine learning algorithms, fine tuning of models, subsequent deployment of models and monitoring of model performance after deployment. We describe this concept with a close focus on model lifecycle requirements stemming from our industry application cases, but generalize key features that are relevant for all applications of machine learning. △ Less

Submitted 29 July, 2021; originally announced July 2021.

Comments: International Conference on Computer Aided Systems Theory, Eurocast 2019, pp 263-270

Journal ref: In: Moreno-Díaz R. et al (eds) Computer Aided Systems Theory. Lecture Notes in Computer Science, Vol. 12013 (2020)

arXiv:2107.10640 [pdf, other]

doi 10.1007/2F978-3-030-45093-9_44

Hash-Based Tree Similarity and Simplification in Genetic Programming for Symbolic Regression

Authors: Bogdan Burlacu, Lukas Kammerer, Michael Affenzeller, Gabriel Kronberger

Abstract: We introduce in this paper a runtime-efficient tree hashing algorithm for the identification of isomorphic subtrees, with two important applications in genetic programming for symbolic regression: fast, online calculation of population diversity and algebraic simplification of symbolic expression trees. Based on this hashing approach, we propose a simple diversity-preservation mechanism with promi… ▽ More We introduce in this paper a runtime-efficient tree hashing algorithm for the identification of isomorphic subtrees, with two important applications in genetic programming for symbolic regression: fast, online calculation of population diversity and algebraic simplification of symbolic expression trees. Based on this hashing approach, we propose a simple diversity-preservation mechanism with promising results on a collection of symbolic regression benchmark problems. △ Less

Submitted 22 July, 2021; originally announced July 2021.

Comments: International Conference on Computer Aided Systems Theory, EUROCAST 2019

Journal ref: In: Moreno-Díaz R. et al. Computer Aided Systems Theory. Lecture Notes in Computer Science, Vol. 12013. Springer, 2020, pp 361-369

arXiv:2107.09484 [pdf, ps, other]

doi 10.1145/3205455.3205522

Predicting Friction System Performance with Symbolic Regression and Genetic Programming with Factor Variables

Authors: Gabriel Kronberger, Michael Kommenda, Andreas Promberger, Falk Nickel

Abstract: Friction systems are mechanical systems wherein friction is used for force transmission (e.g. mechanical braking systems or automatic gearboxes). For finding optimal and safe design parameters, engineers have to predict friction system performance. This is especially difficult in real-world applications, because it is affected by many parameters. We have used symbolic regression and genetic progra… ▽ More Friction systems are mechanical systems wherein friction is used for force transmission (e.g. mechanical braking systems or automatic gearboxes). For finding optimal and safe design parameters, engineers have to predict friction system performance. This is especially difficult in real-world applications, because it is affected by many parameters. We have used symbolic regression and genetic programming for finding accurate and trustworthy prediction models for this task. However, it is not straight-forward how nominal variables can be included. In particular, a one-hot-encoding is unsatisfactory because genetic programming tends to remove such indicator variables. We have therefore used so-called factor variables for representing nominal variables in symbolic regression models. Our results show that GP is able to produce symbolic regression models for predicting friction performance with predictive accuracy that is comparable to artificial neural networks. The symbolic regression models with factor variables are less complex than models using a one-hot encoding. △ Less

Submitted 19 July, 2021; originally announced July 2021.

Comments: Genetic and Evolutionary Computation Conference (GECCO), July 15th-19th, 2018

Journal ref: In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1278-1285. ACM. (July 2018)

arXiv:2107.09458 [pdf, other]

Using Shape Constraints for Improving Symbolic Regression Models

Authors: Christian Haider, Fabricio Olivetti de França, Bogdan Burlacu, Gabriel Kronberger

Abstract: We describe and analyze algorithms for shape-constrained symbolic regression, which allows the inclusion of prior knowledge about the shape of the regression function. This is relevant in many areas of engineering -- in particular whenever a data-driven model obtained from measurements must have certain properties (e.g. positivity, monotonicity or convexity/concavity). We implement shape constrain… ▽ More We describe and analyze algorithms for shape-constrained symbolic regression, which allows the inclusion of prior knowledge about the shape of the regression function. This is relevant in many areas of engineering -- in particular whenever a data-driven model obtained from measurements must have certain properties (e.g. positivity, monotonicity or convexity/concavity). We implement shape constraints using a soft-penalty approach which uses multi-objective algorithms to minimize constraint violations and training error. We use the non-dominated sorting genetic algorithm (NSGA-II) as well as the multi-objective evolutionary algorithm based on decomposition (MOEA/D). We use a set of models from physics textbooks to test the algorithms and compare against earlier results with single-objective algorithms. The results show that all algorithms are able to find models which conform to all shape constraints. Using shape constraints helps to improve extrapolation behavior of the models. △ Less

Submitted 20 July, 2021; originally announced July 2021.

Comments: 33 pages, 6 figures

arXiv:2107.06131 [pdf, ps, other]

doi 10.1007/978-3-030-45093-9

Identification of Dynamical Systems using Symbolic Regression

Authors: Gabriel Kronberger, Lukas Kammerer, Michael Kommenda

Abstract: We describe a method for the identification of models for dynamical systems from observational data. The method is based on the concept of symbolic regression and uses genetic programming to evolve a system of ordinary differential equations (ODE). The novelty is that we add a step of gradient-based optimization of the ODE parameters. For this we calculate the sensitivities of the solution to the… ▽ More We describe a method for the identification of models for dynamical systems from observational data. The method is based on the concept of symbolic regression and uses genetic programming to evolve a system of ordinary differential equations (ODE). The novelty is that we add a step of gradient-based optimization of the ODE parameters. For this we calculate the sensitivities of the solution to the initial value problem (IVP) using automatic differentiation. The proposed approach is tested on a set of 19 problem instances taken from the literature which includes datasets from simulated systems as well as datasets captured from mechanical systems. We find that gradient-based optimization of parameters improves predictive accuracy of the models. The best results are obtained when we first fit the individual equations to the numeric differences and then subsequently fine-tune the identified parameter values by fitting the IVP solution to the observed variable values. △ Less

Submitted 6 July, 2021; originally announced July 2021.

Comments: The final authenticated publication is available online at https://doi.org/10.1007/978-3-030-45093-9

Journal ref: In Computer Aided Systems Theory - EUROCAST 2019, Series Volume 12013, pp. 370-377. Springer. (2020)

arXiv:2103.15624 [pdf, other]

doi 10.1162/evco_a_00294

Shape-constrained Symbolic Regression -- Improving Extrapolation with Prior Knowledge

Authors: Gabriel Kronberger, Fabricio Olivetti de França, Bogdan Burlacu, Christian Haider, Michael Kommenda

Abstract: We investigate the addition of constraints on the function image and its derivatives for the incorporation of prior knowledge in symbolic regression. The approach is called shape-constrained symbolic regression and allows us to enforce e.g. monotonicity of the function over selected inputs. The aim is to find models which conform to expected behaviour and which have improved extrapolation capabili… ▽ More We investigate the addition of constraints on the function image and its derivatives for the incorporation of prior knowledge in symbolic regression. The approach is called shape-constrained symbolic regression and allows us to enforce e.g. monotonicity of the function over selected inputs. The aim is to find models which conform to expected behaviour and which have improved extrapolation capabilities. We demonstrate the feasibility of the idea and propose and compare two evolutionary algorithms for shape-constrained symbolic regression: i) an extension of tree-based genetic programming which discards infeasible solutions in the selection step, and ii) a two population evolutionary algorithm that separates the feasible from the infeasible solutions. In both algorithms we use interval arithmetic to approximate bounds for models and their partial derivatives. The algorithms are tested on a set of 19 synthetic and four real-world regression problems. Both algorithms are able to identify models which conform to shape constraints which is not the case for the unmodified symbolic regression algorithms. However, the predictive accuracy of models with constraints is worse on the training set and the test set. Shape-constrained polynomial regression produces the best results for the test set but also significantly larger models. △ Less

Submitted 31 May, 2021; v1 submitted 29 March, 2021; originally announced March 2021.

arXiv:1902.00882 [pdf, other]

doi 10.1109/CEC.2019.8790162

Online Diversity Control in Symbolic Regression via a Fast Hash-based Tree Similarity Measure

Authors: Bogdan Burlacu, Michael Affenzeller, Gabriel Kronberger, Michael Kommenda

Abstract: Diversity represents an important aspect of genetic programming, being directly correlated with search performance. When considered at the genotype level, diversity often requires expensive tree distance measures which have a negative impact on the algorithm's runtime performance. In this work we introduce a fast, hash-based tree distance measure to massively speed-up the calculation of population… ▽ More Diversity represents an important aspect of genetic programming, being directly correlated with search performance. When considered at the genotype level, diversity often requires expensive tree distance measures which have a negative impact on the algorithm's runtime performance. In this work we introduce a fast, hash-based tree distance measure to massively speed-up the calculation of population diversity during the algorithmic run. We combine this measure with the standard GA and the NSGA-II genetic algorithms to steer the search towards higher diversity. We validate the approach on a collection of benchmark problems for symbolic regression where our method consistently outperforms the standard GA as well as NSGA-II configurations with different secondary objectives. △ Less

Submitted 3 February, 2019; originally announced February 2019.

Comments: 8 pages, conference, submitted to congress on evolutionary computation

arXiv:1309.5931 [pdf, ps, other]

doi 10.1007/978-3-642-27549-4_51

Data Mining using Unguided Symbolic Regression on a Blast Furnace Dataset

Authors: Michael Kommenda, Gabriel Kronberger, Christoph Feilmayr, Michael Affenzeller

Abstract: In this paper a data mining approach for variable selection and knowledge extraction from datasets is presented. The approach is based on unguided symbolic regression (every variable present in the dataset is treated as the target variable in multiple regression runs) and a novel variable relevance metric for genetic programming. The relevance of each input variable is calculated and a model appro… ▽ More In this paper a data mining approach for variable selection and knowledge extraction from datasets is presented. The approach is based on unguided symbolic regression (every variable present in the dataset is treated as the target variable in multiple regression runs) and a novel variable relevance metric for genetic programming. The relevance of each input variable is calculated and a model approximating the target variable is created. The genetic programming configurations with different target variables are executed multiple times to reduce stochastic effects and the aggregated results are displayed as a variable interaction network. This interaction network highlights important system components and implicit relations between the variables. The whole approach is tested on a blast furnace dataset, because of the complexity of the blast furnace and the many interrelations between the variables. Finally the achieved results are discussed with respect to existing knowledge about the blast furnace process. △ Less

Submitted 23 September, 2013; originally announced September 2013.

Comments: Presented at Workshop for Heuristic Problem Solving, Computer Aided Systems Theory - EUROCAST 2011. The final publication is available at http://link.springer.com/chapter/10.1007/978-3-642-27549-4_51

Journal ref: Computer Aided Systems Theory - EUROCAST 2011, Lecture Notes in Computer Science Volume 6927, 2012, pp 400-407

arXiv:1309.5896 [pdf, ps, other]

doi 10.1007/978-3-642-04772-5_102

On the Success Rate of Crossover Operators for Genetic Programming with Offspring Selection

Authors: Gabriel Kronberger, Stephan Winkler, Michael Affenzeller, Andreas Beham, Stefan Wagner

Abstract: Genetic programming is a powerful heuristic search technique that is used for a number of real world applications to solve among others regression, classification, and time-series forecasting problems. A lot of progress towards a theoretic description of genetic programming in form of schema theorems has been made, but the internal dynamics and success factors of genetic programming are still not… ▽ More Genetic programming is a powerful heuristic search technique that is used for a number of real world applications to solve among others regression, classification, and time-series forecasting problems. A lot of progress towards a theoretic description of genetic programming in form of schema theorems has been made, but the internal dynamics and success factors of genetic programming are still not fully understood. In particular, the effects of different crossover operators in combination with offspring selection are largely unknown. This contribution sheds light on the ability of well-known GP crossover operators to create better offspring when applied to benchmark problems. We conclude that standard (sub-tree swap**) crossover is a good default choice in combination with offspring selection, and that GP with offspring selection and random selection of crossover operators can improve the performance of the algorithm in terms of best solution quality when no solution size constraints are applied. △ Less

Submitted 23 September, 2013; originally announced September 2013.

Comments: The final publication is available at http://link.springer.com/chapter/10.1007/978-3-642-04772-5_102

Journal ref: Computer Aided Systems Theory - EUROCAST 2009, Lecture Notes in Computer Science Volume 5717, 2009, pp 793-800, Springer

arXiv:1306.0202 [pdf, ps, other]

Declarative Modeling and Bayesian Inference of Dark Matter Halos

Authors: Gabriel Kronberger

Abstract: Probabilistic programming allows specification of probabilistic models in a declarative manner. Recently, several new software systems and languages for probabilistic programming have been developed on the basis of newly developed and improved methods for approximate inference in probabilistic models. In this contribution a probabilistic model for an idealized dark matter localization problem is d… ▽ More Probabilistic programming allows specification of probabilistic models in a declarative manner. Recently, several new software systems and languages for probabilistic programming have been developed on the basis of newly developed and improved methods for approximate inference in probabilistic models. In this contribution a probabilistic model for an idealized dark matter localization problem is described. We first derive the probabilistic model for the inference of dark matter locations and masses, and then show how this model can be implemented using BUGS and Infer.NET, two software systems for probabilistic programming. Finally, the different capabilities of both systems are discussed. The presented dark matter model includes mainly non-conjugate factors, thus, it is difficult to implement this model with Infer.NET. △ Less

Submitted 2 June, 2013; originally announced June 2013.

Comments: Presented at the Workshop "Intelligent Information Processing", EUROCAST2013. To appear in selected papers of Computer Aided Systems Theory - EUROCAST 2013; Volumes Editors: Roberto Moreno-Díaz, Franz R. Pichler, Alexis Quesada-Arencibia; LNCS Springer

arXiv:1305.3794 [pdf, ps, other]

Evolution of Covariance Functions for Gaussian Process Regression using Genetic Programming

Authors: Gabriel Kronberger, Michael Kommenda

Abstract: In this contribution we describe an approach to evolve composite covariance functions for Gaussian processes using genetic programming. A critical aspect of Gaussian processes and similar kernel-based models such as SVM is, that the covariance function should be adapted to the modeled data. Frequently, the squared exponential covariance function is used as a default. However, this can lead to a mi… ▽ More In this contribution we describe an approach to evolve composite covariance functions for Gaussian processes using genetic programming. A critical aspect of Gaussian processes and similar kernel-based models such as SVM is, that the covariance function should be adapted to the modeled data. Frequently, the squared exponential covariance function is used as a default. However, this can lead to a misspecified model, which does not fit the data well. In the proposed approach we use a grammar for the composition of covariance functions and genetic programming to search over the space of sentences that can be derived from the grammar. We tested the proposed approach on synthetic data from two-dimensional test functions, and on the Mauna Loa CO2 time series. The results show, that our approach is feasible, finding covariance functions that perform much better than a default covariance function. For the CO2 data set a composite covariance function is found, that matches the performance of a hand-tuned covariance function. △ Less

Submitted 22 May, 2013; v1 submitted 16 May, 2013; originally announced May 2013.

Comments: Presented at the Workshop "Theory and Applications of Metaheuristic Algorithms", EUROCAST2013. To appear in selected papers of Computer Aided Systems Theory - EUROCAST 2013; Volumes Editors: Roberto Moreno-Díaz, Franz R. Pichler, Alexis Quesada-Arencibia; LNCS Springer

arXiv:1212.2044 [pdf, ps, other]

doi 10.1007/978-3-642-20520-0_11

Macro-Economic Time Series Modeling and Interaction Networks

Authors: Gabriel Kronberger, Stefan Fink, Michael Kommenda, Michael Affenzeller

Abstract: Macro-economic models describe the dynamics of economic quantities. The estimations and forecasts produced by such models play a substantial role for financial and political decisions. In this contribution we describe an approach based on genetic programming and symbolic regression to identify variable interactions in large datasets. In the proposed approach multiple symbolic regression runs are e… ▽ More Macro-economic models describe the dynamics of economic quantities. The estimations and forecasts produced by such models play a substantial role for financial and political decisions. In this contribution we describe an approach based on genetic programming and symbolic regression to identify variable interactions in large datasets. In the proposed approach multiple symbolic regression runs are executed for each variable of the dataset to find potentially interesting models. The result is a variable interaction network that describes which variables are most relevant for the approximation of each variable of the dataset. This approach is applied to a macro-economic dataset with monthly observations of important economic indicators in order to identify potentially interesting dependencies of these indicators. The resulting interaction network of macro-economic indicators is briefly discussed and two of the identified models are presented in detail. The two models approximate the help wanted index and the CPI inflation in the US. △ Less

Submitted 23 September, 2013; v1 submitted 10 December, 2012; originally announced December 2012.

Comments: The original publication is available at http://link.springer.com/chapter/10.1007/978-3-642-20520-0_11

Journal ref: Applications of Evolutionary Computation, LNCS 6625 (Springer Berlin Heidelberg), pp. 101-110 (2011)

Showing 1–31 of 31 results for author: Kronberger, G