Search | arXiv e-print repository

Co-data Learning for Bayesian Additive Regression Trees

Authors: Jeroen M. Goedhart, Thomas Klausch, Jurriaan Janssen, Mark A. van de Wiel

Abstract: Medical prediction applications often need to deal with small sample sizes compared to the number of covariates. Such data pose problems for prediction and variable selection, especially when the covariate-response relationship is complicated. To address these challenges, we propose to incorporate co-data, i.e. external information on the covariates, into Bayesian additive regression trees (BART),… ▽ More Medical prediction applications often need to deal with small sample sizes compared to the number of covariates. Such data pose problems for prediction and variable selection, especially when the covariate-response relationship is complicated. To address these challenges, we propose to incorporate co-data, i.e. external information on the covariates, into Bayesian additive regression trees (BART), a sum-of-trees prediction model that utilizes priors on the tree parameters to prevent overfitting. To incorporate co-data, an empirical Bayes (EB) framework is developed that estimates, assisted by a co-data model, prior covariate weights in the BART model. The proposed method can handle multiple types of co-data simultaneously. Furthermore, the proposed EB framework enables the estimation of the other hyperparameters of BART as well, rendering an appealing alternative to cross-validation. We show that the method finds relevant covariates and that it improves prediction compared to default BART in simulations. If the covariate-response relationship is nonlinear, the method benefits from the flexibility of BART to outperform regression-based co-data learners. Finally, the use of co-data enhances prediction in an application to diffuse large B-cell lymphoma prognosis based on clinical covariates, gene mutations, DNA translocations, and DNA copy number data. Keywords: Bayesian additive regression trees; Empirical Bayes; Co-data; High-dimensional data; Omics; Prediction △ Less

Submitted 16 November, 2023; originally announced November 2023.

Comments: 30 pages, 3 Figures, 2 Tables

arXiv:2301.03964 [pdf, other]

Trade-offs between cost and information in cellular prediction

Authors: Age J. Tjalma, Vahe Galstyan, Jeroen Goedhart, Lotte Slim, Nils B. Becker, Pieter Rein ten Wolde

Abstract: Living cells can leverage correlations in environmental fluctuations to predict the future environment and mount a response ahead of time. To this end, cells need to encode the past signal into the output of the intracellular network from which the future input is predicted. Yet, storing information is costly while not all features of the past signal are equally informative on the future input sig… ▽ More Living cells can leverage correlations in environmental fluctuations to predict the future environment and mount a response ahead of time. To this end, cells need to encode the past signal into the output of the intracellular network from which the future input is predicted. Yet, storing information is costly while not all features of the past signal are equally informative on the future input signal. Here, we show, for two classes of input signals, that cellular networks can reach the fundamental bound on the predictive information as set by the information extracted from the past signal: push-pull networks can reach this information bound for Markovian signals, while networks that take a temporal derivative can reach the bound for predicting the future derivative of non-Markovian signals. However, the bits of past information that are most informative about the future signal are also prohibitively costly. As a result, the optimal system that maximizes the predictive information for a given resource cost is, in general, not at the information bound. Applying our theory to the chemotaxis network of Escherichia coli reveals that its adaptive kernel is optimal for predicting future concentration changes over a broad range of background concentrations, and that the system has been tailored to predicting these changes in shallow gradients. △ Less

Submitted 10 January, 2023; originally announced January 2023.

arXiv:2206.03825 [pdf, other]

Estimation of Predictive Performance in High-Dimensional Data Settings using Learning Curves

Authors: Jeroen M. Goedhart, Thomas Klausch, Mark A. van de Wiel

Abstract: In high-dimensional prediction settings, it remains challenging to reliably estimate the test performance. To address this challenge, a novel performance estimation framework is presented. This framework, called Learn2Evaluate, is based on learning curves by fitting a smooth monotone curve depicting test performance as a function of the sample size. Learn2Evaluate has several advantages compared t… ▽ More In high-dimensional prediction settings, it remains challenging to reliably estimate the test performance. To address this challenge, a novel performance estimation framework is presented. This framework, called Learn2Evaluate, is based on learning curves by fitting a smooth monotone curve depicting test performance as a function of the sample size. Learn2Evaluate has several advantages compared to commonly applied performance estimation methodologies. Firstly, a learning curve offers a graphical overview of a learner. This overview assists in assessing the potential benefit of adding training samples and it provides a more complete comparison between learners than performance estimates at a fixed subsample size. Secondly, a learning curve facilitates in estimating the performance at the total sample size rather than a subsample size. Thirdly, Learn2Evaluate allows the computation of a theoretically justified and useful lower confidence bound. Furthermore, this bound may be tightened by performing a bias correction. The benefits of Learn2Evaluate are illustrated by a simulation study and applications to omics data. △ Less

Submitted 8 June, 2022; originally announced June 2022.

Comments: 19 pages, 2 figures, 2 tables

Showing 1–3 of 3 results for author: Goedhart, J