Symbolic regression outperforms other models for small data sets

Wilstrup, Casper; Kasak, Jaan

Computer Science > Machine Learning

arXiv:2103.15147 (cs)

[Submitted on 28 Mar 2021 (v1), last revised 16 May 2021 (this version, v3)]

Title:Symbolic regression outperforms other models for small data sets

Authors:Casper Wilstrup, Jaan Kasak

View PDF

Abstract:Machine learning is often applied in health science to obtain predictions and new understandings of complex phenomena and relationships, but an availability of sufficient data for model training is a widespread problem. Traditional machine learning techniques, such as random forests and gradient boosting, tend to overfit when working with data sets of only a few hundred observations. This study demonstrates that for small training sets of 250 observations, symbolic regression generalises better to out-of-sample data than traditional machine learning frameworks, as measured by the coefficient of determination R2 on the validation set. In 132 out of 240 cases, symbolic regression achieves a higher R2 than any of the other models on the out-of-sample data. Furthermore, symbolic regression also preserves the interpretability of linear models and decision trees, an added benefit to its superior generalisation. The second best algorithm was found to be a random forest, which performs best in 37 of the 240 cases. When restricting the comparison to interpretable models, symbolic regression performs best in 184 out of 240 cases.

Comments:	10 pages, 2 figures, 2 tables. Pending review at BMC Bioinformatics
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2103.15147 [cs.LG]
	(or arXiv:2103.15147v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2103.15147

Submission history

From: Jaan Kasak [view email]
[v1] Sun, 28 Mar 2021 15:00:59 UTC (94 KB)
[v2] Fri, 16 Apr 2021 15:45:30 UTC (411 KB)
[v3] Sun, 16 May 2021 10:37:52 UTC (551 KB)

Computer Science > Machine Learning

Title:Symbolic regression outperforms other models for small data sets

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Symbolic regression outperforms other models for small data sets

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators