-
A Comparison of Recent Algorithms for Symbolic Regression to Genetic Programming
Authors:
Yousef A. Radwan,
Gabriel Kronberger,
Stephan Winkler
Abstract:
Symbolic regression is a machine learning method with the goal to produce interpretable results. Unlike other machine learning methods such as, e.g. random forests or neural networks, which are opaque, symbolic regression aims to model and map data in a way that can be understood by scientists. Recent advancements, have attempted to bridge the gap between these two fields; new methodologies attemp…
▽ More
Symbolic regression is a machine learning method with the goal to produce interpretable results. Unlike other machine learning methods such as, e.g. random forests or neural networks, which are opaque, symbolic regression aims to model and map data in a way that can be understood by scientists. Recent advancements, have attempted to bridge the gap between these two fields; new methodologies attempt to fuse the map** power of neural networks and deep learning techniques with the explanatory power of symbolic regression. In this paper, we examine these new emerging systems and test the performance of an end-to-end transformer model for symbolic regression versus the reigning traditional methods based on genetic programming that have spearheaded symbolic regression throughout the years. We compare these systems on novel datasets to avoid bias to older methods who were improved on well-known benchmark datasets. Our results show that traditional GP methods as implemented e.g., by Operon still remain superior to two recently published symbolic regression methods.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
The Inefficiency of Genetic Programming for Symbolic Regression -- Extended Version
Authors:
Gabriel Kronberger,
Fabricio Olivetti de Franca,
Harry Desmond,
Deaglan J. Bartlett,
Lukas Kammerer
Abstract:
We analyse the search behaviour of genetic programming for symbolic regression in practically relevant but limited settings, allowing exhaustive enumeration of all solutions. This enables us to quantify the success probability of finding the best possible expressions, and to compare the search efficiency of genetic programming to random search in the space of semantically unique expressions. This…
▽ More
We analyse the search behaviour of genetic programming for symbolic regression in practically relevant but limited settings, allowing exhaustive enumeration of all solutions. This enables us to quantify the success probability of finding the best possible expressions, and to compare the search efficiency of genetic programming to random search in the space of semantically unique expressions. This analysis is made possible by improved algorithms for equality saturation, which we use to improve the Exhaustive Symbolic Regression algorithm; this produces the set of semantically unique expression structures, orders of magnitude smaller than the full symbolic regression search space. We compare the efficiency of random search in the set of unique expressions and genetic programming. For our experiments we use two real-world datasets where symbolic regression has been used to produce well-fitting univariate expressions: the Nikuradse dataset of flow in rough pipes and the Radial Acceleration Relation of galaxy dynamics. The results show that genetic programming in such limited settings explores only a small fraction of all unique expressions, and evaluates expressions repeatedly that are congruent to already visited expressions.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
A precise symbolic emulator of the linear matter power spectrum
Authors:
Deaglan J. Bartlett,
Lukas Kammerer,
Gabriel Kronberger,
Harry Desmond,
Pedro G. Ferreira,
Benjamin D. Wandelt,
Bogdan Burlacu,
David Alonso,
Matteo Zennaro
Abstract:
Computing the matter power spectrum, $P(k)$, as a function of cosmological parameters can be prohibitively slow in cosmological analyses, hence emulating this calculation is desirable. Previous analytic approximations are insufficiently accurate for modern applications, so black-box, uninterpretable emulators are often used. We utilise an efficient genetic programming based symbolic regression fra…
▽ More
Computing the matter power spectrum, $P(k)$, as a function of cosmological parameters can be prohibitively slow in cosmological analyses, hence emulating this calculation is desirable. Previous analytic approximations are insufficiently accurate for modern applications, so black-box, uninterpretable emulators are often used. We utilise an efficient genetic programming based symbolic regression framework to explore the space of potential mathematical expressions which can approximate the power spectrum and $σ_8$. We learn the ratio between an existing low-accuracy fitting function for $P(k)$ and that obtained by solving the Boltzmann equations and thus still incorporate the physics which motivated this earlier approximation. We obtain an analytic approximation to the linear power spectrum with a root mean squared fractional error of 0.2% between $k = 9\times10^{-3} - 9 \, h{\rm \, Mpc^{-1}}$ and across a wide range of cosmological parameters, and we provide physical interpretations for various terms in the expression. Our analytic approximation is 950 times faster to evaluate than camb and 36 times faster than the neural network based matter power spectrum emulator BACCO. We also provide a simple analytic approximation for $σ_8$ with a similar accuracy, with a root mean squared fractional error of just 0.1% when evaluated across the same range of cosmologies. This function is easily invertible to obtain $A_{\rm s}$ as a function of $σ_8$ and the other cosmological parameters, if preferred. It is possible to obtain symbolic approximations to a seemingly complex function at a precision required for current and future cosmological analyses without resorting to deep-learning techniques, thus avoiding their black-box nature and large number of parameters. Our emulator will be usable long after the codes on which numerical approximations are built become outdated.
△ Less
Submitted 15 April, 2024; v1 submitted 27 November, 2023;
originally announced November 2023.
-
Learning Difference Equations with Structured Grammatical Evolution for Postprandial Glycaemia Prediction
Authors:
Daniel Parra,
David Joedicke,
J. Manuel Velasco,
Gabriel Kronberger,
J. Ignacio Hidalgo
Abstract:
People with diabetes must carefully monitor their blood glucose levels, especially after eating. Blood glucose regulation requires a proper combination of food intake and insulin boluses. Glucose prediction is vital to avoid dangerous post-meal complications in treating individuals with diabetes. Although traditional methods, such as artificial neural networks, have shown high accuracy rates, some…
▽ More
People with diabetes must carefully monitor their blood glucose levels, especially after eating. Blood glucose regulation requires a proper combination of food intake and insulin boluses. Glucose prediction is vital to avoid dangerous post-meal complications in treating individuals with diabetes. Although traditional methods, such as artificial neural networks, have shown high accuracy rates, sometimes they are not suitable for develo** personalised treatments by physicians due to their lack of interpretability. In this study, we propose a novel glucose prediction method emphasising interpretability: Interpretable Sparse Identification by Grammatical Evolution. Combined with a previous clustering stage, our approach provides finite difference equations to predict postprandial glucose levels up to two hours after meals. We divide the dataset into four-hour segments and perform clustering based on blood glucose values for the twohour window before the meal. Prediction models are trained for each cluster for the two-hour windows after meals, allowing predictions in 15-minute steps, yielding up to eight predictions at different time horizons. Prediction safety was evaluated based on Parkes Error Grid regions. Our technique produces safe predictions through explainable expressions, avoiding zones D (0.2% average) and E (0%) and reducing predictions on zone C (6.2%). In addition, our proposal has slightly better accuracy than other techniques, including sparse identification of non-linear dynamics and artificial neural networks. The results demonstrate that our proposal provides interpretable solutions without sacrificing prediction accuracy, offering a promising approach to glucose prediction in diabetes management that balances accuracy, interpretability, and computational efficiency.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Steel Phase Kinetics Modeling using Symbolic Regression
Authors:
David Piringer,
Bernhard Bloder,
Gabriel Kronberger
Abstract:
We describe an approach for empirical modeling of steel phase kinetics based on symbolic regression and genetic programming. The algorithm takes processed data gathered from dilatometer measurements and produces a system of differential equations that models the phase kinetics. Our initial results demonstrate that the proposed approach allows to identify compact differential equations that fit the…
▽ More
We describe an approach for empirical modeling of steel phase kinetics based on symbolic regression and genetic programming. The algorithm takes processed data gathered from dilatometer measurements and produces a system of differential equations that models the phase kinetics. Our initial results demonstrate that the proposed approach allows to identify compact differential equations that fit the data. The model predicts ferrite, pearlite and bainite formation for a single steel type. Martensite is not yet included in the model. Future work shall incorporate martensite and generalize to multiple steel types with different chemical compositions.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Identifying Differential Equations to predict Blood Glucose using Sparse Identification of Nonlinear Systems
Authors:
David Jödicke,
Daniel Parra,
Gabriel Kronberger,
Stephan Winkler
Abstract:
Describing dynamic medical systems using machine learning is a challenging topic with a wide range of applications. In this work, the possibility of modeling the blood glucose level of diabetic patients purely on the basis of measured data is described. A combination of the influencing variables insulin and calories are used to find an interpretable model. The absorption speed of external substanc…
▽ More
Describing dynamic medical systems using machine learning is a challenging topic with a wide range of applications. In this work, the possibility of modeling the blood glucose level of diabetic patients purely on the basis of measured data is described. A combination of the influencing variables insulin and calories are used to find an interpretable model. The absorption speed of external substances in the human body depends strongly on external influences, which is why time-shifts are added for the influencing variables. The focus is put on identifying the best timeshifts that provide robust models with good prediction accuracy that are independent of other unknown external influences. The modeling is based purely on the measured data using Sparse Identification of Nonlinear Dynamics. A differential equation is determined which, starting from an initial value, simulates blood glucose dynamics. By applying the best model to test data, we can show that it is possible to simulate the long-term blood glucose dynamics using differential equations and few, influencing variables.
△ Less
Submitted 28 September, 2022;
originally announced September 2022.
-
Symbolic Regression with Fast Function Extraction and Nonlinear Least Squares Optimization
Authors:
Lukas Kammerer,
Gabriel Kronberger,
Michael Kommenda
Abstract:
Fast Function Extraction (FFX) is a deterministic algorithm for solving symbolic regression problems. We improve the accuracy of FFX by adding parameters to the arguments of nonlinear functions. Instead of only optimizing linear parameters, we optimize these additional nonlinear parameters with separable nonlinear least squared optimization using a variable projection algorithm. Both FFX and our n…
▽ More
Fast Function Extraction (FFX) is a deterministic algorithm for solving symbolic regression problems. We improve the accuracy of FFX by adding parameters to the arguments of nonlinear functions. Instead of only optimizing linear parameters, we optimize these additional nonlinear parameters with separable nonlinear least squared optimization using a variable projection algorithm. Both FFX and our new algorithm is applied on the PennML benchmark suite. We show that the proposed extensions of FFX leads to higher accuracy while providing models of similar length and with only a small increase in runtime on the given data. Our results are compared to a large set of regression methods that were already published for the given benchmark suite.
△ Less
Submitted 20 September, 2022;
originally announced September 2022.
-
Comparing Shape-Constrained Regression Algorithms for Data Validation
Authors:
Florian Bachinger,
Gabriel Kronberger
Abstract:
Industrial and scientific applications handle large volumes of data that render manual validation by humans infeasible. Therefore, we require automated data validation approaches that are able to consider the prior knowledge of domain experts to produce dependable, trustworthy assessments of data quality. Prior knowledge is often available as rules that describe interactions of inputs with regard…
▽ More
Industrial and scientific applications handle large volumes of data that render manual validation by humans infeasible. Therefore, we require automated data validation approaches that are able to consider the prior knowledge of domain experts to produce dependable, trustworthy assessments of data quality. Prior knowledge is often available as rules that describe interactions of inputs with regard to the target e.g. the target must be monotonically decreasing and convex over increasing input values. Domain experts are able to validate multiple such interactions at a glance. However, existing rule-based data validation approaches are unable to consider these constraints. In this work, we compare different shape-constrained regression algorithms for the purpose of data validation based on their classification accuracy and runtime performance.
△ Less
Submitted 20 September, 2022;
originally announced September 2022.
-
Prediction Intervals and Confidence Regions for Symbolic Regression Models based on Likelihood Profiles
Authors:
Fabricio Olivetti de Franca,
Gabriel Kronberger
Abstract:
Symbolic regression is a nonlinear regression method which is commonly performed by an evolutionary computation method such as genetic programming. Quantification of uncertainty of regression models is important for the interpretation of models and for decision making. The linear approximation and so-called likelihood profiles are well-known possibilities for the calculation of confidence and pred…
▽ More
Symbolic regression is a nonlinear regression method which is commonly performed by an evolutionary computation method such as genetic programming. Quantification of uncertainty of regression models is important for the interpretation of models and for decision making. The linear approximation and so-called likelihood profiles are well-known possibilities for the calculation of confidence and prediction intervals for nonlinear regression models. These simple and effective techniques have been completely ignored so far in the genetic programming literature. In this work we describe the calculation of likelihood profiles in details and also provide some illustrative examples with models created with three different symbolic regression algorithms on two different datasets. The examples highlight the importance of the likelihood profiles to understand the limitations of symbolic regression models and to help the user taking an informed post-prediction decision.
△ Less
Submitted 14 September, 2022;
originally announced September 2022.
-
Local Optimization Often is Ill-conditioned in Genetic Programming for Symbolic Regression
Authors:
Gabriel Kronberger
Abstract:
Gradient-based local optimization has been shown to improve results of genetic programming (GP) for symbolic regression. Several state-of-the-art GP implementations use iterative nonlinear least squares (NLS) algorithms such as the Levenberg-Marquardt algorithm for local optimization. The effectiveness of NLS algorithms depends on appropriate scaling and conditioning of the optimization problem. T…
▽ More
Gradient-based local optimization has been shown to improve results of genetic programming (GP) for symbolic regression. Several state-of-the-art GP implementations use iterative nonlinear least squares (NLS) algorithms such as the Levenberg-Marquardt algorithm for local optimization. The effectiveness of NLS algorithms depends on appropriate scaling and conditioning of the optimization problem. This has so far been ignored in symbolic regression and GP literature. In this study we use a singular value decomposition of NLS Jacobian matrices to determine the numeric rank and the condition number. We perform experiments with a GP implementation and six different benchmark datasets. Our results show that rank-deficient and ill-conditioned Jacobian matrices occur frequently and for all datasets. The issue is less extreme when restricting GP tree size and when using many non-linear functions in the function set.
△ Less
Submitted 2 September, 2022;
originally announced September 2022.
-
Symbolic Regression in Materials Science: Discovering Interatomic Potentials from Data
Authors:
Bogdan Burlacu,
Michael Kommenda,
Gabriel Kronberger,
Stephan Winkler,
Michael Affenzeller
Abstract:
Particle-based modeling of materials at atomic scale plays an important role in the development of new materials and understanding of their properties. The accuracy of particle simulations is determined by interatomic potentials, which allow to calculate the potential energy of an atomic system as a function of atomic coordinates and potentially other properties. First-principles-based ab initio p…
▽ More
Particle-based modeling of materials at atomic scale plays an important role in the development of new materials and understanding of their properties. The accuracy of particle simulations is determined by interatomic potentials, which allow to calculate the potential energy of an atomic system as a function of atomic coordinates and potentially other properties. First-principles-based ab initio potentials can reach arbitrary levels of accuracy, however their aplicability is limited by their high computational cost.
Machine learning (ML) has recently emerged as an effective way to offset the high computational costs of ab initio atomic potentials by replacing expensive models with highly efficient surrogates trained on electronic structure data. Among a plethora of current methods, symbolic regression (SR) is gaining traction as a powerful "white-box" approach for discovering functional forms of interatomic potentials.
This contribution discusses the role of symbolic regression in Materials Science (MS) and offers a comprehensive overview of current methodological challenges and state-of-the-art results. A genetic programming-based approach for modeling atomic potentials from raw data (consisting of snapshots of atomic positions and associated potential energy) is presented and empirically validated on ab initio electronic structure data.
△ Less
Submitted 21 July, 2022; v1 submitted 13 June, 2022;
originally announced June 2022.
-
Optimization Networks for Integrated Machine Learning
Authors:
Michael Kommenda,
Johannes Karder,
Andreas Beham,
Bogdan Burlacu,
Gabriel Kronberger,
Stefan Wagner,
Michael Affenzeller
Abstract:
Optimization networks are a new methodology for holistically solving interrelated problems that have been developed with combinatorial optimization problems in mind. In this contribution we revisit the core principles of optimization networks and demonstrate their suitability for solving machine learning problems. We use feature selection in combination with linear model creation as a benchmark ap…
▽ More
Optimization networks are a new methodology for holistically solving interrelated problems that have been developed with combinatorial optimization problems in mind. In this contribution we revisit the core principles of optimization networks and demonstrate their suitability for solving machine learning problems. We use feature selection in combination with linear model creation as a benchmark application and compare the results of optimization networks to ordinary least squares with optional elastic net regularization. Based on this example we justify the advantages of optimization networks by adapting the network to solve other machine learning problems. Finally, optimization analysis is presented, where optimal input values of a system have to be found to achieve desired output values. Optimization analysis can be divided into three subproblems: model creation to describe the system, model selection to choose the most appropriate one and parameter optimization to obtain the input values. Therefore, optimization networks are an obvious choice for handling optimization analysis tasks.
△ Less
Submitted 1 September, 2021;
originally announced October 2021.
-
Cluster Analysis of a Symbolic Regression Search Space
Authors:
Gabriel Kronberger,
Lukas Kammerer,
Bogdan Burlacu,
Stephan M. Winkler,
Michael Kommenda,
Michael Affenzeller
Abstract:
In this chapter we take a closer look at the distribution of symbolic regression models generated by genetic programming in the search space. The motivation for this work is to improve the search for well-fitting symbolic regression models by using information about the similarity of models that can be precomputed independently from the target function. For our analysis, we use a restricted gramma…
▽ More
In this chapter we take a closer look at the distribution of symbolic regression models generated by genetic programming in the search space. The motivation for this work is to improve the search for well-fitting symbolic regression models by using information about the similarity of models that can be precomputed independently from the target function. For our analysis, we use a restricted grammar for uni-variate symbolic regression models and generate all possible models up to a fixed length limit. We identify unique models and cluster them based on phenotypic as well as genotypic similarity. We find that phenotypic similarity leads to well-defined clusters while genotypic similarity does not produce a clear clustering. By map** solution candidates visited by GP to the enumerated search space we find that GP initially explores the whole search space and later converges to the subspace of highest quality expressions in a run for a simple benchmark problem.
△ Less
Submitted 28 September, 2021;
originally announced September 2021.
-
Symbolic Regression by Exhaustive Search: Reducing the Search Space Using Syntactical Constraints and Efficient Semantic Structure Deduplication
Authors:
Lukas Kammerer,
Gabriel Kronberger,
Bogdan Burlacu,
Stephan M. Winkler,
Michael Kommenda,
Michael Affenzeller
Abstract:
Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available. Such scenarios often require specific model properties such as interpretability, robustness, trustworthiness and plausibility, that are not easily achievable using standard approaches like genetic programming for symbolic regression. In this chapter we…
▽ More
Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available. Such scenarios often require specific model properties such as interpretability, robustness, trustworthiness and plausibility, that are not easily achievable using standard approaches like genetic programming for symbolic regression. In this chapter we introduce a deterministic symbolic regression algorithm specifically designed to address these issues. The algorithm uses a context-free grammar to produce models that are parameterized by a non-linear least squares local optimization procedure. A finite enumeration of all possible models is guaranteed by structural restrictions as well as a caching mechanism for detecting semantically equivalent solutions. Enumeration order is established via heuristics designed to improve search efficiency. Empirical tests on a comprehensive benchmark suite show that our approach is competitive with genetic programming in many noiseless problems while maintaining desirable properties such as simple, reliable models and reproducibility.
△ Less
Submitted 28 September, 2021;
originally announced September 2021.
-
Complexity Measures for Multi-objective Symbolic Regression
Authors:
Michael Kommenda,
Andreas Beham,
Michael Affenzeller,
Gabriel Kronberger
Abstract:
Multi-objective symbolic regression has the advantage that while the accuracy of the learned models is maximized, the complexity is automatically adapted and need not be specified a-priori. The result of the optimization is not a single solution anymore, but a whole Pareto-front describing the trade-off between accuracy and complexity. In this contribution we study which complexity measures are mo…
▽ More
Multi-objective symbolic regression has the advantage that while the accuracy of the learned models is maximized, the complexity is automatically adapted and need not be specified a-priori. The result of the optimization is not a single solution anymore, but a whole Pareto-front describing the trade-off between accuracy and complexity. In this contribution we study which complexity measures are most appropriately used in symbolic regression when performing multi- objective optimization with NSGA-II. Furthermore, we present a novel complexity measure that includes semantic information based on the function symbols occurring in the models and test its effects on several benchmark datasets. Results comparing multiple complexity measures are presented in terms of the achieved accuracy and model length to illustrate how the search direction of the algorithm is affected.
△ Less
Submitted 1 September, 2021;
originally announced September 2021.
-
Data Aggregation for Reducing Training Data in Symbolic Regression
Authors:
Lukas Kammerer,
Gabriel Kronberger,
Michael Kommenda
Abstract:
The growing volume of data makes the use of computationally intense machine learning techniques such as symbolic regression with genetic programming more and more impractical. This work discusses methods to reduce the training data and thereby also the runtime of genetic programming. The data is aggregated in a preprocessing step before running the actual machine learning algorithm. K-means cluste…
▽ More
The growing volume of data makes the use of computationally intense machine learning techniques such as symbolic regression with genetic programming more and more impractical. This work discusses methods to reduce the training data and thereby also the runtime of genetic programming. The data is aggregated in a preprocessing step before running the actual machine learning algorithm. K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method. We analyze the achieved speed-up in training and the effects on the trained models test accuracy for every method on four real-world data sets. The performance of genetic programming is compared with random forests and linear regression. It is shown, that k-means and random sampling lead to very small loss in test accuracy when the data is reduced down to only 30% of the original data, while the speed-up is proportional to the size of the data set. Binning on the contrary, leads to models with very high test error.
△ Less
Submitted 24 August, 2021;
originally announced August 2021.
-
Smooth Symbolic Regression: Transformation of Symbolic Regression into a Real-valued Optimization Problem
Authors:
Erik Pitzer,
Gabriel Kronberger
Abstract:
The typical methods for symbolic regression produce rather abrupt changes in solution candidates. In this work, we have tried to transform symbolic regression from an optimization problem, with a landscape that is so rugged that typical analysis methods do not produce meaningful results, to one that can be compared to typical and very smooth real-valued problems. While the ruggedness might not int…
▽ More
The typical methods for symbolic regression produce rather abrupt changes in solution candidates. In this work, we have tried to transform symbolic regression from an optimization problem, with a landscape that is so rugged that typical analysis methods do not produce meaningful results, to one that can be compared to typical and very smooth real-valued problems. While the ruggedness might not interfere with the performance of optimization, it restricts the possibilities of analysis. Here, we have explored different aspects of a transformation and propose a simple procedure to create real-valued optimization problems from symbolic regression problems.
△ Less
Submitted 6 August, 2021;
originally announced August 2021.
-
Concept Drift Detection with Variable Interaction Networks
Authors:
Jan Zenisek,
Gabriel Kronberger,
Josef Wolfartsberger,
Norbert Wild,
Michael Affenzeller
Abstract:
The current development of today's production industry towards seamless sensor-based monitoring is paving the way for concepts such as Predictive Maintenance. By this means, the condition of plants and products in future production lines will be continuously analyzed with the objective to predict any kind of breakdown and trigger preventing actions proactively. Such ambitious predictions are commo…
▽ More
The current development of today's production industry towards seamless sensor-based monitoring is paving the way for concepts such as Predictive Maintenance. By this means, the condition of plants and products in future production lines will be continuously analyzed with the objective to predict any kind of breakdown and trigger preventing actions proactively. Such ambitious predictions are commonly performed with support of machine learning algorithms. In this work, we utilize these algorithms to model complex systems, such as production plants, by focusing on their variable interactions. The core of this contribution is a sliding window based algorithm, designed to detect changes of the identified interactions, which might indicate beginning malfunctions in the context of a monitored production plant. Besides a detailed description of the algorithm, we present results from experiments with a synthetic dynamical system, simulating stable and drifting system behavior.
△ Less
Submitted 6 August, 2021;
originally announced August 2021.
-
Extending a Physics-Based Constitutive Model using Genetic Programming
Authors:
Gabriel Kronberger,
Evgeniya Kabliman,
Johannes Kronsteiner,
Michael Kommenda
Abstract:
In material science, models are derived to predict emergent material properties (e.g. elasticity, strength, conductivity) and their relations to processing conditions. A major drawback is the calibration of model parameters that depend on processing conditions. Currently, these parameters must be optimized to fit measured data since their relations to processing conditions (e.g. deformation temper…
▽ More
In material science, models are derived to predict emergent material properties (e.g. elasticity, strength, conductivity) and their relations to processing conditions. A major drawback is the calibration of model parameters that depend on processing conditions. Currently, these parameters must be optimized to fit measured data since their relations to processing conditions (e.g. deformation temperature, strain rate) are not fully understood. We present a new approach that identifies the functional dependency of calibration parameters from processing conditions based on genetic programming. We propose two (explicit and implicit) methods to identify these dependencies and generate short interpretable expressions. The approach is used to extend a physics-based constitutive model for deformation processes. This constitutive model operates with internal material variables such as a dislocation density and contains a number of parameters, among them three calibration parameters. The derived expressions extend the constitutive model and replace the calibration parameters. Thus, interpolation between various processing parameters is enabled.
Our results show that the implicit method is computationally more expensive than the explicit approach but also produces significantly better results.
△ Less
Submitted 19 November, 2021; v1 submitted 3 August, 2021;
originally announced August 2021.
-
Concept for a Technical Infrastructure for Management of Predictive Models in Industrial Applications
Authors:
Florian Bachinger,
Gabriel Kronberger
Abstract:
With the increasing number of created and deployed prediction models and the complexity of machine learning workflows we require so called model management systems to support data scientists in their tasks. In this work we describe our technological concept for such a model management system. This concept includes versioned storage of data, support for different machine learning algorithms, fine t…
▽ More
With the increasing number of created and deployed prediction models and the complexity of machine learning workflows we require so called model management systems to support data scientists in their tasks. In this work we describe our technological concept for such a model management system. This concept includes versioned storage of data, support for different machine learning algorithms, fine tuning of models, subsequent deployment of models and monitoring of model performance after deployment. We describe this concept with a close focus on model lifecycle requirements stemming from our industry application cases, but generalize key features that are relevant for all applications of machine learning.
△ Less
Submitted 29 July, 2021;
originally announced July 2021.
-
Hash-Based Tree Similarity and Simplification in Genetic Programming for Symbolic Regression
Authors:
Bogdan Burlacu,
Lukas Kammerer,
Michael Affenzeller,
Gabriel Kronberger
Abstract:
We introduce in this paper a runtime-efficient tree hashing algorithm for the identification of isomorphic subtrees, with two important applications in genetic programming for symbolic regression: fast, online calculation of population diversity and algebraic simplification of symbolic expression trees. Based on this hashing approach, we propose a simple diversity-preservation mechanism with promi…
▽ More
We introduce in this paper a runtime-efficient tree hashing algorithm for the identification of isomorphic subtrees, with two important applications in genetic programming for symbolic regression: fast, online calculation of population diversity and algebraic simplification of symbolic expression trees. Based on this hashing approach, we propose a simple diversity-preservation mechanism with promising results on a collection of symbolic regression benchmark problems.
△ Less
Submitted 22 July, 2021;
originally announced July 2021.
-
Predicting Friction System Performance with Symbolic Regression and Genetic Programming with Factor Variables
Authors:
Gabriel Kronberger,
Michael Kommenda,
Andreas Promberger,
Falk Nickel
Abstract:
Friction systems are mechanical systems wherein friction is used for force transmission (e.g. mechanical braking systems or automatic gearboxes). For finding optimal and safe design parameters, engineers have to predict friction system performance. This is especially difficult in real-world applications, because it is affected by many parameters. We have used symbolic regression and genetic progra…
▽ More
Friction systems are mechanical systems wherein friction is used for force transmission (e.g. mechanical braking systems or automatic gearboxes). For finding optimal and safe design parameters, engineers have to predict friction system performance. This is especially difficult in real-world applications, because it is affected by many parameters. We have used symbolic regression and genetic programming for finding accurate and trustworthy prediction models for this task. However, it is not straight-forward how nominal variables can be included. In particular, a one-hot-encoding is unsatisfactory because genetic programming tends to remove such indicator variables. We have therefore used so-called factor variables for representing nominal variables in symbolic regression models. Our results show that GP is able to produce symbolic regression models for predicting friction performance with predictive accuracy that is comparable to artificial neural networks. The symbolic regression models with factor variables are less complex than models using a one-hot encoding.
△ Less
Submitted 19 July, 2021;
originally announced July 2021.
-
Using Shape Constraints for Improving Symbolic Regression Models
Authors:
Christian Haider,
Fabricio Olivetti de França,
Bogdan Burlacu,
Gabriel Kronberger
Abstract:
We describe and analyze algorithms for shape-constrained symbolic regression, which allows the inclusion of prior knowledge about the shape of the regression function. This is relevant in many areas of engineering -- in particular whenever a data-driven model obtained from measurements must have certain properties (e.g. positivity, monotonicity or convexity/concavity). We implement shape constrain…
▽ More
We describe and analyze algorithms for shape-constrained symbolic regression, which allows the inclusion of prior knowledge about the shape of the regression function. This is relevant in many areas of engineering -- in particular whenever a data-driven model obtained from measurements must have certain properties (e.g. positivity, monotonicity or convexity/concavity). We implement shape constraints using a soft-penalty approach which uses multi-objective algorithms to minimize constraint violations and training error. We use the non-dominated sorting genetic algorithm (NSGA-II) as well as the multi-objective evolutionary algorithm based on decomposition (MOEA/D). We use a set of models from physics textbooks to test the algorithms and compare against earlier results with single-objective algorithms. The results show that all algorithms are able to find models which conform to all shape constraints. Using shape constraints helps to improve extrapolation behavior of the models.
△ Less
Submitted 20 July, 2021;
originally announced July 2021.
-
Identification of Dynamical Systems using Symbolic Regression
Authors:
Gabriel Kronberger,
Lukas Kammerer,
Michael Kommenda
Abstract:
We describe a method for the identification of models for dynamical systems from observational data. The method is based on the concept of symbolic regression and uses genetic programming to evolve a system of ordinary differential equations (ODE). The novelty is that we add a step of gradient-based optimization of the ODE parameters. For this we calculate the sensitivities of the solution to the…
▽ More
We describe a method for the identification of models for dynamical systems from observational data. The method is based on the concept of symbolic regression and uses genetic programming to evolve a system of ordinary differential equations (ODE). The novelty is that we add a step of gradient-based optimization of the ODE parameters. For this we calculate the sensitivities of the solution to the initial value problem (IVP) using automatic differentiation. The proposed approach is tested on a set of 19 problem instances taken from the literature which includes datasets from simulated systems as well as datasets captured from mechanical systems. We find that gradient-based optimization of parameters improves predictive accuracy of the models. The best results are obtained when we first fit the individual equations to the numeric differences and then subsequently fine-tune the identified parameter values by fitting the IVP solution to the observed variable values.
△ Less
Submitted 6 July, 2021;
originally announced July 2021.
-
Shape-constrained Symbolic Regression -- Improving Extrapolation with Prior Knowledge
Authors:
Gabriel Kronberger,
Fabricio Olivetti de França,
Bogdan Burlacu,
Christian Haider,
Michael Kommenda
Abstract:
We investigate the addition of constraints on the function image and its derivatives for the incorporation of prior knowledge in symbolic regression. The approach is called shape-constrained symbolic regression and allows us to enforce e.g. monotonicity of the function over selected inputs. The aim is to find models which conform to expected behaviour and which have improved extrapolation capabili…
▽ More
We investigate the addition of constraints on the function image and its derivatives for the incorporation of prior knowledge in symbolic regression. The approach is called shape-constrained symbolic regression and allows us to enforce e.g. monotonicity of the function over selected inputs. The aim is to find models which conform to expected behaviour and which have improved extrapolation capabilities. We demonstrate the feasibility of the idea and propose and compare two evolutionary algorithms for shape-constrained symbolic regression: i) an extension of tree-based genetic programming which discards infeasible solutions in the selection step, and ii) a two population evolutionary algorithm that separates the feasible from the infeasible solutions. In both algorithms we use interval arithmetic to approximate bounds for models and their partial derivatives. The algorithms are tested on a set of 19 synthetic and four real-world regression problems. Both algorithms are able to identify models which conform to shape constraints which is not the case for the unmodified symbolic regression algorithms. However, the predictive accuracy of models with constraints is worse on the training set and the test set. Shape-constrained polynomial regression produces the best results for the test set but also significantly larger models.
△ Less
Submitted 31 May, 2021; v1 submitted 29 March, 2021;
originally announced March 2021.
-
Online Diversity Control in Symbolic Regression via a Fast Hash-based Tree Similarity Measure
Authors:
Bogdan Burlacu,
Michael Affenzeller,
Gabriel Kronberger,
Michael Kommenda
Abstract:
Diversity represents an important aspect of genetic programming, being directly correlated with search performance. When considered at the genotype level, diversity often requires expensive tree distance measures which have a negative impact on the algorithm's runtime performance. In this work we introduce a fast, hash-based tree distance measure to massively speed-up the calculation of population…
▽ More
Diversity represents an important aspect of genetic programming, being directly correlated with search performance. When considered at the genotype level, diversity often requires expensive tree distance measures which have a negative impact on the algorithm's runtime performance. In this work we introduce a fast, hash-based tree distance measure to massively speed-up the calculation of population diversity during the algorithmic run. We combine this measure with the standard GA and the NSGA-II genetic algorithms to steer the search towards higher diversity. We validate the approach on a collection of benchmark problems for symbolic regression where our method consistently outperforms the standard GA as well as NSGA-II configurations with different secondary objectives.
△ Less
Submitted 3 February, 2019;
originally announced February 2019.
-
Data Mining using Unguided Symbolic Regression on a Blast Furnace Dataset
Authors:
Michael Kommenda,
Gabriel Kronberger,
Christoph Feilmayr,
Michael Affenzeller
Abstract:
In this paper a data mining approach for variable selection and knowledge extraction from datasets is presented. The approach is based on unguided symbolic regression (every variable present in the dataset is treated as the target variable in multiple regression runs) and a novel variable relevance metric for genetic programming. The relevance of each input variable is calculated and a model appro…
▽ More
In this paper a data mining approach for variable selection and knowledge extraction from datasets is presented. The approach is based on unguided symbolic regression (every variable present in the dataset is treated as the target variable in multiple regression runs) and a novel variable relevance metric for genetic programming. The relevance of each input variable is calculated and a model approximating the target variable is created. The genetic programming configurations with different target variables are executed multiple times to reduce stochastic effects and the aggregated results are displayed as a variable interaction network. This interaction network highlights important system components and implicit relations between the variables. The whole approach is tested on a blast furnace dataset, because of the complexity of the blast furnace and the many interrelations between the variables. Finally the achieved results are discussed with respect to existing knowledge about the blast furnace process.
△ Less
Submitted 23 September, 2013;
originally announced September 2013.
-
On the Success Rate of Crossover Operators for Genetic Programming with Offspring Selection
Authors:
Gabriel Kronberger,
Stephan Winkler,
Michael Affenzeller,
Andreas Beham,
Stefan Wagner
Abstract:
Genetic programming is a powerful heuristic search technique that is used for a number of real world applications to solve among others regression, classification, and time-series forecasting problems. A lot of progress towards a theoretic description of genetic programming in form of schema theorems has been made, but the internal dynamics and success factors of genetic programming are still not…
▽ More
Genetic programming is a powerful heuristic search technique that is used for a number of real world applications to solve among others regression, classification, and time-series forecasting problems. A lot of progress towards a theoretic description of genetic programming in form of schema theorems has been made, but the internal dynamics and success factors of genetic programming are still not fully understood. In particular, the effects of different crossover operators in combination with offspring selection are largely unknown.
This contribution sheds light on the ability of well-known GP crossover operators to create better offspring when applied to benchmark problems. We conclude that standard (sub-tree swap**) crossover is a good default choice in combination with offspring selection, and that GP with offspring selection and random selection of crossover operators can improve the performance of the algorithm in terms of best solution quality when no solution size constraints are applied.
△ Less
Submitted 23 September, 2013;
originally announced September 2013.
-
Declarative Modeling and Bayesian Inference of Dark Matter Halos
Authors:
Gabriel Kronberger
Abstract:
Probabilistic programming allows specification of probabilistic models in a declarative manner. Recently, several new software systems and languages for probabilistic programming have been developed on the basis of newly developed and improved methods for approximate inference in probabilistic models. In this contribution a probabilistic model for an idealized dark matter localization problem is d…
▽ More
Probabilistic programming allows specification of probabilistic models in a declarative manner. Recently, several new software systems and languages for probabilistic programming have been developed on the basis of newly developed and improved methods for approximate inference in probabilistic models. In this contribution a probabilistic model for an idealized dark matter localization problem is described. We first derive the probabilistic model for the inference of dark matter locations and masses, and then show how this model can be implemented using BUGS and Infer.NET, two software systems for probabilistic programming. Finally, the different capabilities of both systems are discussed. The presented dark matter model includes mainly non-conjugate factors, thus, it is difficult to implement this model with Infer.NET.
△ Less
Submitted 2 June, 2013;
originally announced June 2013.
-
Evolution of Covariance Functions for Gaussian Process Regression using Genetic Programming
Authors:
Gabriel Kronberger,
Michael Kommenda
Abstract:
In this contribution we describe an approach to evolve composite covariance functions for Gaussian processes using genetic programming. A critical aspect of Gaussian processes and similar kernel-based models such as SVM is, that the covariance function should be adapted to the modeled data. Frequently, the squared exponential covariance function is used as a default. However, this can lead to a mi…
▽ More
In this contribution we describe an approach to evolve composite covariance functions for Gaussian processes using genetic programming. A critical aspect of Gaussian processes and similar kernel-based models such as SVM is, that the covariance function should be adapted to the modeled data. Frequently, the squared exponential covariance function is used as a default. However, this can lead to a misspecified model, which does not fit the data well. In the proposed approach we use a grammar for the composition of covariance functions and genetic programming to search over the space of sentences that can be derived from the grammar. We tested the proposed approach on synthetic data from two-dimensional test functions, and on the Mauna Loa CO2 time series. The results show, that our approach is feasible, finding covariance functions that perform much better than a default covariance function. For the CO2 data set a composite covariance function is found, that matches the performance of a hand-tuned covariance function.
△ Less
Submitted 22 May, 2013; v1 submitted 16 May, 2013;
originally announced May 2013.
-
Macro-Economic Time Series Modeling and Interaction Networks
Authors:
Gabriel Kronberger,
Stefan Fink,
Michael Kommenda,
Michael Affenzeller
Abstract:
Macro-economic models describe the dynamics of economic quantities. The estimations and forecasts produced by such models play a substantial role for financial and political decisions. In this contribution we describe an approach based on genetic programming and symbolic regression to identify variable interactions in large datasets. In the proposed approach multiple symbolic regression runs are e…
▽ More
Macro-economic models describe the dynamics of economic quantities. The estimations and forecasts produced by such models play a substantial role for financial and political decisions. In this contribution we describe an approach based on genetic programming and symbolic regression to identify variable interactions in large datasets. In the proposed approach multiple symbolic regression runs are executed for each variable of the dataset to find potentially interesting models. The result is a variable interaction network that describes which variables are most relevant for the approximation of each variable of the dataset. This approach is applied to a macro-economic dataset with monthly observations of important economic indicators in order to identify potentially interesting dependencies of these indicators. The resulting interaction network of macro-economic indicators is briefly discussed and two of the identified models are presented in detail. The two models approximate the help wanted index and the CPI inflation in the US.
△ Less
Submitted 23 September, 2013; v1 submitted 10 December, 2012;
originally announced December 2012.