-
Inverse Protein Folding Using Deep Bayesian Optimization
Authors:
Natalie Maus,
Yimeng Zeng,
Daniel Allen Anderson,
Phillip Maffettone,
Aaron Solomon,
Peyton Greenside,
Osbert Bastani,
Jacob R. Gardner
Abstract:
Inverse protein folding -- the task of predicting a protein sequence from its backbone atom coordinates -- has surfaced as an important problem in the "top down", de novo design of proteins. Contemporary approaches have cast this problem as a conditional generative modelling problem, where a large generative model over protein sequences is conditioned on the backbone. While these generative models…
▽ More
Inverse protein folding -- the task of predicting a protein sequence from its backbone atom coordinates -- has surfaced as an important problem in the "top down", de novo design of proteins. Contemporary approaches have cast this problem as a conditional generative modelling problem, where a large generative model over protein sequences is conditioned on the backbone. While these generative models very rapidly produce promising sequences, independent draws from generative models may fail to produce sequences that reliably fold to the correct backbone. Furthermore, it is challenging to adapt pure generative approaches to other settings, e.g., when constraints exist. In this paper, we cast the problem of improving generated inverse folds as an optimization problem that we solve using recent advances in "deep" or "latent space" Bayesian optimization. Our approach consistently produces protein sequences with greatly reduced structural error to the target backbone structure as measured by TM score and RMSD while using fewer computational resources. Additionally, we demonstrate other advantages of an optimization-based approach to the problem, such as the ability to handle constraints.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Self-driving Multimodal Studies at User Facilities
Authors:
Phillip M. Maffettone,
Daniel B. Allan,
Stuart I. Campbell,
Matthew R. Carbone,
Thomas A. Caswell,
Brian L. DeCost,
Dmitri Gavrilov,
Marcus D. Hanwell,
Howie Joress,
Joshua Lynch,
Bruce Ravel,
Stuart B. Wilkins,
Jakub Wlodek,
Daniel Olds
Abstract:
Multimodal characterization is commonly required for understanding materials. User facilities possess the infrastructure to perform these measurements, albeit in serial over days to months. In this paper, we describe a unified multimodal measurement of a single sample library at distant instruments, driven by a concert of distributed agents that use analysis from each modality to inform the direct…
▽ More
Multimodal characterization is commonly required for understanding materials. User facilities possess the infrastructure to perform these measurements, albeit in serial over days to months. In this paper, we describe a unified multimodal measurement of a single sample library at distant instruments, driven by a concert of distributed agents that use analysis from each modality to inform the direction of the other in real time. Powered by the Bluesky project at the National Synchrotron Light Source II, this experiment is a world's first for beamline science, and provides a blueprint for future approaches to multimodal and multifidelity experiments at user facilities.
△ Less
Submitted 22 January, 2023;
originally announced January 2023.
-
Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders
Authors:
Samuel Stanton,
Wesley Maddox,
Nate Gruver,
Phillip Maffettone,
Emily Delaney,
Peyton Greenside,
Andrew Gordon Wilson
Abstract:
Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of mult…
▽ More
Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of multi-objective acquisition functions in the latent space of the autoencoder. These acquisition functions allow LaMBO to balance the explore-exploit tradeoff over multiple design rounds, and to balance objective tradeoffs by optimizing sequences at many different points on the Pareto frontier. We evaluate LaMBO on two small-molecule design tasks, and introduce new tasks optimizing \emph{in silico} and \emph{in vitro} properties of large-molecule fluorescent proteins. In our experiments LaMBO outperforms genetic optimizers and does not require a large pretraining corpus, demonstrating that BayesOpt is practical and effective for biological sequence design.
△ Less
Submitted 12 July, 2022; v1 submitted 23 March, 2022;
originally announced March 2022.
-
Machine learning enabling high-throughput and remote operations at large-scale user facilities
Authors:
Tatiana Konstantinova,
Phillip M. Maffettone,
Bruce Ravel,
Stuart I. Campbell,
Andi M. Barbour,
Daniel Olds
Abstract:
Imaging, scattering, and spectroscopy are fundamental in understanding and discovering new functional materials. Contemporary innovations in automation and experimental techniques have led to these measurements being performed much faster and with higher resolution, thus producing vast amounts of data for analysis. These innovations are particularly pronounced at user facilities and synchrotron li…
▽ More
Imaging, scattering, and spectroscopy are fundamental in understanding and discovering new functional materials. Contemporary innovations in automation and experimental techniques have led to these measurements being performed much faster and with higher resolution, thus producing vast amounts of data for analysis. These innovations are particularly pronounced at user facilities and synchrotron light sources. Machine learning (ML) methods are regularly developed to process and interpret large datasets in real-time with measurements. However, there remain conceptual barriers to entry for the facility general user community, whom often lack expertise in ML, and technical barriers for deploying ML models. Herein, we demonstrate a variety of archetypal ML models for on-the-fly analysis at multiple beamlines at the National Synchrotron Light Source II (NSLS-II). We describe these examples instructively, with a focus on integrating the models into existing experimental workflows, such that the reader can easily include their own ML techniques into experiments at NSLS-II or facilities with a common infrastructure. The framework presented here shows how with little effort, diverse ML models operate in conjunction with feedback loops via integration into the existing Bluesky Suite for experimental orchestration and data management.
△ Less
Submitted 9 January, 2022;
originally announced January 2022.
-
Constrained non-negative matrix factorization enabling real-time insights of $\textit{in situ}$ and high-throughput experiments
Authors:
Phillip M. Maffettone,
Aidan C. Daly,
Daniel Olds
Abstract:
Non-negative Matrix Factorization (NMF) methods offer an appealing unsupervised learning method for real-time analysis of streaming spectral data in time-sensitive data collection, such as $\textit{in situ}$ characterization of materials. However, canonical NMF methods are optimized to reconstruct a full dataset as closely as possible, with no underlying requirement that the reconstruction produce…
▽ More
Non-negative Matrix Factorization (NMF) methods offer an appealing unsupervised learning method for real-time analysis of streaming spectral data in time-sensitive data collection, such as $\textit{in situ}$ characterization of materials. However, canonical NMF methods are optimized to reconstruct a full dataset as closely as possible, with no underlying requirement that the reconstruction produces components or weights representative of the true physical processes. In this work, we demonstrate how constraining NMF weights or components, provided as known or assumed priors, can provide significant improvement in revealing true underlying phenomena. We present a PyTorch based method for efficiently applying constrained NMF and demonstrate this on several synthetic examples. When applied to streaming experimentally measured spectral data, an expert researcher-in-the-loop can provide and dynamically adjust the constraints. This set of interactive priors to the NMF model can, for example, contain known or identified independent components, as well as functional expectations about the mixing of components. We demonstrate this application on measured X-ray diffraction and pair distribution function data from $\textit{in situ}$ beamline experiments. Details of the method are described, and general guidance provided to employ constrained NMF in extraction of critical information and insights during $\textit{in situ}$ and high-throughput experiments.
△ Less
Submitted 1 April, 2021;
originally announced April 2021.