Showing 1–2 of 2 results for author: John, P C S

Search v0.5.6 released 2020-02-24

arXiv:2212.09925 [pdf, other]

cs.LG q-bio.BM

doi 10.1088/2632-2153/accacd

Plug & Play Directed Evolution of Proteins with Gradient-based Discrete MCMC

Authors: Patrick Emami, Aidan Perreault, Jeffrey Law, David Biagioni, Peter C. St. John

Abstract: A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations that improve the function of a known protein. We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models, such as protein language models, and supervised models that predict protein function from sequence. By… ▽ More A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations that improve the function of a known protein. We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models, such as protein language models, and supervised models that predict protein function from sequence. By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins. Our framework achieves this without any model fine-tuning or re-training by constructing a product of experts distribution directly in discrete protein space. Instead of resorting to brute force search or random sampling, which is typical of classic directed evolution, we introduce a fast MCMC sampler that uses gradients to propose promising mutations. We conduct in silico directed evolution experiments on wide fitness landscapes and across a range of different pre-trained unsupervised models, including a 650M parameter protein language model. Our results demonstrate an ability to efficiently discover variants with high evolutionary likelihood as well as estimated activity multiple mutations away from a wild type protein, suggesting our sampler provides a practical and effective new paradigm for machine-learning-based protein engineering. △ Less

Submitted 6 April, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

Comments: 31 pages, 8 figures. To appear in the Machine Learning: Science & Technology (ML:S&T) journal. Code is available at https://github.com/pemami4911/ppde. A short version of this work appeared at the NeurIPS 2022 Machine Learning in Structural Biology Workshop
arXiv:1807.10363 [pdf, other]

physics.comp-ph cs.LG stat.ML

doi 10.1063/1.5099132

Message-passing neural networks for high-throughput polymer screening

Authors: Peter C. St. John, Caleb Phillips, Travis W. Kemper, A. Nolan Wilson, Michael F. Crowley, Mark R. Nimlos, Ross E. Larsen

Abstract: Machine learning methods have shown promise in predicting molecular properties, and given sufficient training data machine learning approaches can enable rapid high-throughput virtual screening of large libraries of compounds. Graph-based neural network architectures have emerged in recent years as the most successful approach for predictions based on molecular structure, and have consistently ach… ▽ More Machine learning methods have shown promise in predicting molecular properties, and given sufficient training data machine learning approaches can enable rapid high-throughput virtual screening of large libraries of compounds. Graph-based neural network architectures have emerged in recent years as the most successful approach for predictions based on molecular structure, and have consistently achieved the best performance on benchmark quantum chemical datasets. However, these models have typically required optimized 3D structural information for the molecule to achieve the highest accuracy. These 3D geometries are costly to compute for high levels of theory, limiting the applicability and practicality of machine learning methods in high-throughput screening applications. In this study, we present a new database of candidate molecules for organic photovoltaic applications, comprising approximately 91,000 unique chemical structures.Compared to existing datasets, this dataset contains substantially larger molecules (up to 200 atoms) as well as extrapolated properties for long polymer chains. We show that message-passing neural networks trained with and without 3D structural information for these molecules achieve similar accuracy, comparable to state-of-the-art methods on existing benchmark datasets. These results therefore emphasize that for larger molecules with practical applications, near-optimal prediction results can be obtained without using optimized 3D geometry as an input. We further show that learned molecular representations can be leveraged to reduce the training data required to transfer predictions to a new DFT functional. △ Less

Submitted 5 April, 2019; v1 submitted 26 July, 2018; originally announced July 2018.

Comments: 7 pages, 3 figures

Search v0.5.6 released 2020-02-24