-
A survey and benchmark of high-dimensional Bayesian optimization of discrete sequences
Authors:
Miguel González-Duque,
Richard Michael,
Simon Bartels,
Yevgen Zainchkovskyy,
Søren Hauberg,
Wouter Boomsma
Abstract:
Optimizing discrete black-box functions is key in several domains, e.g. protein engineering and drug design. Due to the lack of gradient information and the need for sample efficiency, Bayesian optimization is an ideal candidate for these tasks. Several methods for high-dimensional continuous and categorical Bayesian optimization have been proposed recently. However, our survey of the field reveal…
▽ More
Optimizing discrete black-box functions is key in several domains, e.g. protein engineering and drug design. Due to the lack of gradient information and the need for sample efficiency, Bayesian optimization is an ideal candidate for these tasks. Several methods for high-dimensional continuous and categorical Bayesian optimization have been proposed recently. However, our survey of the field reveals highly heterogeneous experimental set-ups across methods and technical barriers for the replicability and application of published algorithms to real-world tasks. To address these issues, we develop a unified framework to test a vast array of high-dimensional Bayesian optimization methods and a collection of standardized black-box functions representing real-world application domains in chemistry and biology. These two components of the benchmark are each supported by flexible, scalable, and easily extendable software libraries (poli and poli-baselines), allowing practitioners to readily incorporate new optimization objectives or discrete optimizers. Project website: https://machinelearninglifescience.github.io/hdbo_benchmark
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
A Continuous Relaxation for Discrete Bayesian Optimization
Authors:
Richard Michael,
Simon Bartels,
Miguel González-Duque,
Yevgen Zainchkovskyy,
Jes Frellsen,
Søren Hauberg,
Wouter Boomsma
Abstract:
To optimize efficiently over discrete data and with only few available target observations is a challenge in Bayesian optimization. We propose a continuous relaxation of the objective function and show that inference and optimization can be computationally tractable. We consider in particular the optimization domain where very few observations and strict budgets exist; motivated by optimizing prot…
▽ More
To optimize efficiently over discrete data and with only few available target observations is a challenge in Bayesian optimization. We propose a continuous relaxation of the objective function and show that inference and optimization can be computationally tractable. We consider in particular the optimization domain where very few observations and strict budgets exist; motivated by optimizing protein sequences for expensive to evaluate bio-chemical properties. The advantages of our approach are two-fold: the problem is treated in the continuous setting, and available prior knowledge over sequences can be incorporated directly. More specifically, we utilize available and learned distributions over the problem domain for a weighting of the Hellinger distance which yields a covariance function. We show that the resulting acquisition function can be optimized with both continuous or discrete optimization algorithms and empirically assess our method on two bio-chemical sequence optimization tasks.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
Modelling of flow through spatially varying porous media with application to topology optimization
Authors:
Rakotobe Michaël,
Ramalingom Delphine,
Cocquet Pierre-Henri,
Bastide Alain
Abstract:
The objective of this study is to highlight the effect of porosity variation in a topology optimization process in the field of fluid dynamics. Usually a penalization term added to momentum equation provides to get material distribution. Every time material is added inside the computational domain, there is creation of new fluid-solid interfaces and apparition of gradient of porosity. However, at…
▽ More
The objective of this study is to highlight the effect of porosity variation in a topology optimization process in the field of fluid dynamics. Usually a penalization term added to momentum equation provides to get material distribution. Every time material is added inside the computational domain, there is creation of new fluid-solid interfaces and apparition of gradient of porosity. However, at present, porosity variation is not taken account in topology optimization and the penalization term used to locate the solid is analogous to a Darcy term used for flows in porous media. With that in mind, in this paper, we first develop an original one-domain macroscopic model for the modelling of flow through spatially varying porous media that goes beyond the scope of Darcy regime. Next, we numerically solve a topology optimization problem and compare the results obtained with the standard model that does not include effect of porosity variation with those obtained with our model. Among our results, we show for instance that the designs obtained are different but percentages of reduction of objective functional remain quite close (below 4\% of difference). In addition, we illustrate effects of porosity and particle diameter values on final optimized designs.
△ Less
Submitted 22 April, 2020;
originally announced April 2020.
-
Using massive health insurance claims data to predict very high-cost claimants: a machine learning approach
Authors:
José M. Maisog,
Wenhong Li,
Yanchun Xu,
Brian Hurley,
Hetal Shah,
Ryan Lemberg,
Tina Borden,
Stephen Bandeian,
Melissa Schline,
Roxanna Cross,
Alan Spiro,
Russ Michael,
Alexander Gutfraind
Abstract:
Due to escalating healthcare costs, accurately predicting which patients will incur high costs is an important task for payers and providers of healthcare. High-cost claimants (HiCCs) are patients who have annual costs above $\$250,000…
▽ More
Due to escalating healthcare costs, accurately predicting which patients will incur high costs is an important task for payers and providers of healthcare. High-cost claimants (HiCCs) are patients who have annual costs above $\$250,000$ and who represent just 0.16% of the insured population but currently account for 9% of all healthcare costs. In this study, we aimed to develop a high-performance algorithm to predict HiCCs to inform a novel care management system. Using health insurance claims from 48 million people and augmented with census data, we applied machine learning to train binary classification models to calculate the personal risk of HiCC. To train the models, we developed a platform starting with 6,006 variables across all clinical and demographic dimensions and constructed over one hundred candidate models. The best model achieved an area under the receiver operating characteristic curve of 91.2%. The model exceeds the highest published performance (84%) and remains high for patients with no prior history of high-cost status (89%), who have less than a full year of enrollment (87%), or lack pharmacy claims data (88%). It attains an area under the precision-recall curve of 23.1%, and precision of 74% at a threshold of 0.99. A care management program enrolling 500 people with the highest HiCC risk is expected to treat 199 true HiCCs and generate a net savings of $\$7.3$ million per year. Our results demonstrate that high-performing predictive models can be constructed using claims data and publicly available data alone, even for rare high-cost claimants exceeding $\$250,000$. Our model demonstrates the transformational power of machine learning and artificial intelligence in care management, which would allow healthcare payers and providers to introduce the next generation of care management programs.
△ Less
Submitted 30 December, 2019;
originally announced December 2019.