Search | arXiv e-print repository

Inverse Protein Folding Using Deep Bayesian Optimization

Authors: Natalie Maus, Yimeng Zeng, Daniel Allen Anderson, Phillip Maffettone, Aaron Solomon, Peyton Greenside, Osbert Bastani, Jacob R. Gardner

Abstract: Inverse protein folding -- the task of predicting a protein sequence from its backbone atom coordinates -- has surfaced as an important problem in the "top down", de novo design of proteins. Contemporary approaches have cast this problem as a conditional generative modelling problem, where a large generative model over protein sequences is conditioned on the backbone. While these generative models… ▽ More Inverse protein folding -- the task of predicting a protein sequence from its backbone atom coordinates -- has surfaced as an important problem in the "top down", de novo design of proteins. Contemporary approaches have cast this problem as a conditional generative modelling problem, where a large generative model over protein sequences is conditioned on the backbone. While these generative models very rapidly produce promising sequences, independent draws from generative models may fail to produce sequences that reliably fold to the correct backbone. Furthermore, it is challenging to adapt pure generative approaches to other settings, e.g., when constraints exist. In this paper, we cast the problem of improving generated inverse folds as an optimization problem that we solve using recent advances in "deep" or "latent space" Bayesian optimization. Our approach consistently produces protein sequences with greatly reduced structural error to the target backbone structure as measured by TM score and RMSD while using fewer computational resources. Additionally, we demonstrate other advantages of an optimization-based approach to the problem, such as the ability to handle constraints. △ Less

Submitted 24 May, 2023; originally announced May 2023.

arXiv:2203.12742 [pdf, other]

Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders

Authors: Samuel Stanton, Wesley Maddox, Nate Gruver, Phillip Maffettone, Emily Delaney, Peyton Greenside, Andrew Gordon Wilson

Abstract: Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of mult… ▽ More Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of multi-objective acquisition functions in the latent space of the autoencoder. These acquisition functions allow LaMBO to balance the explore-exploit tradeoff over multiple design rounds, and to balance objective tradeoffs by optimizing sequences at many different points on the Pareto frontier. We evaluate LaMBO on two small-molecule design tasks, and introduce new tasks optimizing \emph{in silico} and \emph{in vitro} properties of large-molecule fluorescent proteins. In our experiments LaMBO outperforms genetic optimizers and does not require a large pretraining corpus, demonstrating that BayesOpt is practical and effective for biological sequence design. △ Less

Submitted 12 July, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

Comments: ICML 2022. Code available at https://github.com/samuelstanton/lambo

arXiv:2007.10263 [pdf, other]

A Hierarchical Approach to Scaling Batch Active Search Over Structured Data

Authors: Vivek Myers, Peyton Greenside

Abstract: Active search is the process of identifying high-value data points in a large and often high-dimensional parameter space that can be expensive to evaluate. Traditional active search techniques like Bayesian optimization trade off exploration and exploitation over consecutive evaluations, and have historically focused on single or small (<5) numbers of examples evaluated per round. As modern data s… ▽ More Active search is the process of identifying high-value data points in a large and often high-dimensional parameter space that can be expensive to evaluate. Traditional active search techniques like Bayesian optimization trade off exploration and exploitation over consecutive evaluations, and have historically focused on single or small (<5) numbers of examples evaluated per round. As modern data sets grow, so does the need to scale active search to large data sets and batch sizes. In this paper, we present a general hierarchical framework based on bandit algorithms to scale active search to large batch sizes by maximizing information derived from the unique structure of each dataset. Our hierarchical framework, Hierarchical Batch Bandit Search (HBBS), strategically distributes batch selection across a learned embedding space by facilitating wide exploration of different structural elements within a dataset. We focus our application of HBBS on modern biology, where large batch experimentation is often fundamental to the research process, and demonstrate batch design of biological sequences (protein and DNA). We also present a new Gym environment to easily simulate diverse biological sequences and to enable more comprehensive evaluation of active search methods across heterogeneous data sets. The HBBS framework improves upon standard performance, wall-clock, and scalability benchmarks for batch search by using a broad exploration strategy across coarse partitions and fine-grained exploitation within each partition of structured data. △ Less

Submitted 20 July, 2020; originally announced July 2020.

Comments: Presented at the 2020 ICML Workshop on Real World Experiment Design and Active Learning

arXiv:1704.02685 [pdf, other]

Learning Important Features Through Propagating Activation Differences

Authors: Avanti Shrikumar, Peyton Greenside, Anshul Kundaje

Abstract: The purported "black box" nature of neural networks is a barrier to adoption in applications where interpretability is essential. Here we present DeepLIFT (Deep Learning Important FeaTures), a method for decomposing the output prediction of a neural network on a specific input by backpropagating the contributions of all neurons in the network to every feature of the input. DeepLIFT compares the ac… ▽ More The purported "black box" nature of neural networks is a barrier to adoption in applications where interpretability is essential. Here we present DeepLIFT (Deep Learning Important FeaTures), a method for decomposing the output prediction of a neural network on a specific input by backpropagating the contributions of all neurons in the network to every feature of the input. DeepLIFT compares the activation of each neuron to its 'reference activation' and assigns contribution scores according to the difference. By optionally giving separate consideration to positive and negative contributions, DeepLIFT can also reveal dependencies which are missed by other approaches. Scores can be computed efficiently in a single backward pass. We apply DeepLIFT to models trained on MNIST and simulated genomic data, and show significant advantages over gradient-based methods. Video tutorial: http://goo.gl/qKb7pL, ICML slides: bit.ly/deeplifticmlslides, ICML talk: https://vimeo.com/238275076, code: http://goo.gl/RM8jvH. △ Less

Submitted 12 October, 2019; v1 submitted 9 April, 2017; originally announced April 2017.

Comments: Updated to include changes present in the ICML camera-ready paper, and other small corrections

Journal ref: PMLR 70:3145-3153, 2017

arXiv:1605.01713 [pdf, other]

Not Just a Black Box: Learning Important Features Through Propagating Activation Differences

Authors: Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, Anshul Kundaje

Abstract: Note: This paper describes an older version of DeepLIFT. See https://arxiv.longhoe.net/abs/1704.02685 for the newer version. Original abstract follows: The purported "black box" nature of neural networks is a barrier to adoption in applications where interpretability is essential. Here we present DeepLIFT (Learning Important FeaTures), an efficient and effective method for computing importance scores in a… ▽ More Note: This paper describes an older version of DeepLIFT. See https://arxiv.longhoe.net/abs/1704.02685 for the newer version. Original abstract follows: The purported "black box" nature of neural networks is a barrier to adoption in applications where interpretability is essential. Here we present DeepLIFT (Learning Important FeaTures), an efficient and effective method for computing importance scores in a neural network. DeepLIFT compares the activation of each neuron to its 'reference activation' and assigns contribution scores according to the difference. We apply DeepLIFT to models trained on natural images and genomic data, and show significant advantages over gradient-based methods. △ Less

Submitted 11 April, 2017; v1 submitted 5 May, 2016; originally announced May 2016.

Comments: 6 pages, 3 figures, this is an older version; see https://arxiv.longhoe.net/abs/1704.02685 for the newer version

Showing 1–5 of 5 results for author: Greenside, P