Search | arXiv e-print repository

PoET: A generative model of protein families as sequences-of-sequences

Authors: Timothy F. Truong Jr, Tristan Bepler

Abstract: Generative protein language models are a natural way to design new proteins with desired functions. However, current models are either difficult to direct to produce a protein from a specific family of interest, or must be trained on a large multiple sequence alignment (MSA) from the specific family of interest, making them unable to benefit from transfer learning across families. To address this,… ▽ More Generative protein language models are a natural way to design new proteins with desired functions. However, current models are either difficult to direct to produce a protein from a specific family of interest, or must be trained on a large multiple sequence alignment (MSA) from the specific family of interest, making them unable to benefit from transfer learning across families. To address this, we propose $\textbf{P}$r$\textbf{o}$tein $\textbf{E}$volutionary $\textbf{T}$ransformer (PoET), an autoregressive generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences across tens of millions of natural protein sequence clusters. PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest, and can extrapolate from short context lengths to generalize well even for small families. This is enabled by a unique Transformer layer; we model tokens sequentially within sequences while attending between sequences order invariantly, allowing PoET to scale to context lengths beyond those used during training. In extensive experiments on deep mutational scanning datasets, we show that PoET outperforms existing protein language models and evolutionary sequence models for variant function prediction across proteins of all MSA depths. We also demonstrate PoET's ability to controllably generate new protein sequences. △ Less

Submitted 1 November, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

Journal ref: Advances in Neural Information Processing Systems (Vol. 36), 2023

arXiv:2210.02881 [pdf, other]

Antibody Representation Learning for Drug Discovery

Authors: Lin Li, Esther Gupta, John Spaeth, Leslie Shing, Tristan Bepler, Rajmonda Sulo Caceres

Abstract: Therapeutic antibody development has become an increasingly popular approach for drug development. To date, antibody therapeutics are largely developed using large scale experimental screens of antibody libraries containing hundreds of millions of antibody sequences. The high cost and difficulty of develo** therapeutic antibodies create a pressing need for computational methods to predict antibo… ▽ More Therapeutic antibody development has become an increasingly popular approach for drug development. To date, antibody therapeutics are largely developed using large scale experimental screens of antibody libraries containing hundreds of millions of antibody sequences. The high cost and difficulty of develo** therapeutic antibodies create a pressing need for computational methods to predict antibody properties and create bespoke designs. However, the relationship between antibody sequence and activity is a complex physical process and traditional iterative design approaches rely on large scale assays and random mutagenesis. Deep learning methods have emerged as a promising way to learn antibody property predictors, but predicting antibody properties and target-specific activities depends critically on the choice of antibody representations and data linking sequences to properties is often limited. Existing works have not yet investigated the value, limitations and opportunities of these methods in application to antibody-based drug discovery. In this paper, we present results on a novel SARS-CoV-2 antibody binding dataset and an additional benchmark dataset. We compare three classes of models: conventional statistical sequence models, supervised learning on each dataset independently, and fine-tuning an antibody specific pre-trained language model. Experimental results suggest that self-supervised pretraining of feature representation consistently offers significant improvement in over previous approaches. We also investigate the impact of data size on the model performance, and discuss challenges and opportunities that the machine learning community can address to advance in silico engineering and design of therapeutic antibodies. △ Less

Submitted 5 October, 2022; originally announced October 2022.

arXiv:2204.01168 [pdf, other]

Few Shot Protein Generation

Authors: Soumya Ram, Tristan Bepler

Abstract: We present the MSA-to-protein transformer, a generative model of protein sequences conditioned on protein families represented by multiple sequence alignments (MSAs). Unlike existing approaches to learning generative models of protein families, the MSA-to-protein transformer conditions sequence generation directly on a learned encoding of the multiple sequence alignment, circumventing the need for… ▽ More We present the MSA-to-protein transformer, a generative model of protein sequences conditioned on protein families represented by multiple sequence alignments (MSAs). Unlike existing approaches to learning generative models of protein families, the MSA-to-protein transformer conditions sequence generation directly on a learned encoding of the multiple sequence alignment, circumventing the need for fitting dedicated family models. By training on a large set of well-curated multiple sequence alignments in Pfam, our MSA-to-protein transformer generalizes well to protein families not observed during training and outperforms conventional family modeling approaches, especially when MSAs are small. Our generative approach accurately models epistasis and indels and allows for exact inference and efficient sampling unlike other approaches. We demonstrate the protein sequence modeling capabilities of our MSA-to-protein transformer and compare it with alternative sequence modeling approaches in comprehensive benchmark experiments. △ Less

Submitted 3 April, 2022; originally announced April 2022.

arXiv:2112.01534 [pdf]

Learning to automate cryo-electron microscopy data collection with Ptolemy

Authors: Paul T. Kim, Alex J. Noble, Anchi Cheng, Tristan Bepler

Abstract: Over the past decade, cryogenic electron microscopy (cryo-EM) has emerged as a primary method for determining near-native, near-atomic resolution 3D structures of biological macromolecules. In order to meet increasing demand for cryo-EM, automated methods to improve throughput and efficiency while lowering costs are needed. Currently, all high-magnification cryo-EM data collection softwares requir… ▽ More Over the past decade, cryogenic electron microscopy (cryo-EM) has emerged as a primary method for determining near-native, near-atomic resolution 3D structures of biological macromolecules. In order to meet increasing demand for cryo-EM, automated methods to improve throughput and efficiency while lowering costs are needed. Currently, all high-magnification cryo-EM data collection softwares require human input and manual tuning of parameters. Expert operators must navigate low- and medium-magnification images to find good high-magnification collection locations. Automating this is non-trivial: the images suffer from low signal-to-noise ratio and are affected by a range of experimental parameters that can differ for each collection session. Here, we use various computer vision algorithms, including mixture models, convolutional neural networks, and U-Nets to develop the first pipeline to automate low- and medium-magnification targeting. Learned models in this pipeline are trained on a large internal dataset of images from real world cryo-EM data collection sessions, labeled with locations that were selected by operators. Using these models, we show that we can effectively detect and classify regions of interest in low- and medium-magnification images, and can generalize to unseen sessions, as well as to images captured using different microscopes from external facilities. We expect our open-source pipeline, Ptolemy, will be both immediately useful as a tool for automation of cryo-EM data collection, and serve as a foundation for future advanced methods for efficient and automated cryo-EM microscopy. △ Less

Submitted 14 January, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

Comments: Main: 12 pages, 11 figures. Appendix: 2 pages, 1 figure

ACM Class: I.4.9; J.3

arXiv:1909.11663 [pdf, other]

Explicitly disentangling image content from translation and rotation with spatial-VAE

Authors: Tristan Bepler, Ellen D. Zhong, Kotaro Kelley, Edward Brignole, Bonnie Berger

Abstract: Given an image dataset, we are often interested in finding data generative factors that encode semantic content independently from pose variables such as rotation and translation. However, current disentanglement approaches do not impose any specific structure on the learned latent representations. We propose a method for explicitly disentangling image rotation and translation from other unstructu… ▽ More Given an image dataset, we are often interested in finding data generative factors that encode semantic content independently from pose variables such as rotation and translation. However, current disentanglement approaches do not impose any specific structure on the learned latent representations. We propose a method for explicitly disentangling image rotation and translation from other unstructured latent factors in a variational autoencoder (VAE) framework. By formulating the generative model as a function of the spatial coordinate, we make the reconstruction error differentiable with respect to latent translation and rotation parameters. This formulation allows us to train a neural network to perform approximate inference on these latent variables while explicitly constraining them to only represent rotation and translation. We demonstrate that this framework, termed spatial-VAE, effectively learns latent representations that disentangle image rotation and translation from content and improves reconstruction over standard VAEs on several benchmark datasets, including applications to modeling continuous 2-D views of proteins from single particle electron microscopy and galaxies in astronomical images. △ Less

Submitted 25 September, 2019; originally announced September 2019.

Comments: 11 pages, 6 figures, to appear in the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019)

arXiv:1909.05215 [pdf, other]

Reconstructing continuous distributions of 3D protein structure from cryo-EM images

Authors: Ellen D. Zhong, Tristan Bepler, Joseph H. Davis, Bonnie Berger

Abstract: Cryo-electron microscopy (cryo-EM) is a powerful technique for determining the structure of proteins and other macromolecular complexes at near-atomic resolution. In single particle cryo-EM, the central problem is to reconstruct the three-dimensional structure of a macromolecule from $10^{4-7}$ noisy and randomly oriented two-dimensional projections. However, the imaged protein complexes may exhib… ▽ More Cryo-electron microscopy (cryo-EM) is a powerful technique for determining the structure of proteins and other macromolecular complexes at near-atomic resolution. In single particle cryo-EM, the central problem is to reconstruct the three-dimensional structure of a macromolecule from $10^{4-7}$ noisy and randomly oriented two-dimensional projections. However, the imaged protein complexes may exhibit structural variability, which complicates reconstruction and is typically addressed using discrete clustering approaches that fail to capture the full range of protein dynamics. Here, we introduce a novel method for cryo-EM reconstruction that extends naturally to modeling continuous generative factors of structural heterogeneity. This method encodes structures in Fourier space using coordinate-based deep neural networks, and trains these networks from unlabeled 2D cryo-EM images by combining exact inference over image orientation with variational inference for structural heterogeneity. We demonstrate that the proposed method, termed cryoDRGN, can perform ab initio reconstruction of 3D protein complexes from simulated and real 2D cryo-EM image data. To our knowledge, cryoDRGN is the first neural network-based approach for cryo-EM reconstruction and the first end-to-end method for directly reconstructing continuous ensembles of protein structures from cryo-EM images. △ Less

Submitted 14 February, 2020; v1 submitted 11 September, 2019; originally announced September 2019.

Journal ref: International Conference on Learning Representations (ICLR), 2020

arXiv:1902.08661 [pdf, other]

Learning protein sequence embeddings using information from structure

Authors: Tristan Bepler, Bonnie Berger

Abstract: Inferring the structural properties of a protein from its amino acid sequence is a challenging yet important problem in biology. Structures are not known for the vast majority of protein sequences, but structure is critical for understanding function. Existing approaches for detecting structural similarity between proteins from sequence are unable to recognize and exploit structural patterns when… ▽ More Inferring the structural properties of a protein from its amino acid sequence is a challenging yet important problem in biology. Structures are not known for the vast majority of protein sequences, but structure is critical for understanding function. Existing approaches for detecting structural similarity between proteins from sequence are unable to recognize and exploit structural patterns when sequences have diverged too far, limiting our ability to transfer knowledge between structurally related proteins. We newly approach this problem through the lens of representation learning. We introduce a framework that maps any protein sequence to a sequence of vector embeddings --- one per amino acid position --- that encode structural information. We train bidirectional long short-term memory (LSTM) models on protein sequences with a two-part feedback mechanism that incorporates information from (i) global structural similarity between proteins and (ii) pairwise residue contact maps for individual proteins. To enable learning from structural similarity information, we define a novel similarity measure between arbitrary-length sequences of vector embeddings based on a soft symmetric alignment (SSA) between them. Our method is able to learn useful position-specific embeddings despite lacking direct observations of position-level correspondence between sequences. We show empirically that our multi-task framework outperforms other sequence-based methods and even a top-performing structure-based alignment method when predicting structural similarity, our goal. Finally, we demonstrate that our learned embeddings can be transferred to other protein sequence problems, improving the state-of-the-art in transmembrane domain prediction. △ Less

Submitted 16 October, 2019; v1 submitted 22 February, 2019; originally announced February 2019.

Comments: 17 pages, 3 figures, 8 tables, proceedings of ICLR 2019

Journal ref: International Conference on Learning Representations, 2019

arXiv:1803.08207 [pdf]

doi 10.1038/s41592-019-0575-8

Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs

Authors: Tristan Bepler, Andrew Morin, Julia Brasch, Lawrence Shapiro, Alex J. Noble, Bonnie Berger

Abstract: Cryo-electron microscopy (cryoEM) is an increasingly popular method for protein structure determination. However, identifying a sufficient number of particles for analysis (often >100,000) can take months of manual effort. Current computational approaches are limited by high false positive rates and require significant ad-hoc post-processing, especially for unusually shaped particles. To address t… ▽ More Cryo-electron microscopy (cryoEM) is an increasingly popular method for protein structure determination. However, identifying a sufficient number of particles for analysis (often >100,000) can take months of manual effort. Current computational approaches are limited by high false positive rates and require significant ad-hoc post-processing, especially for unusually shaped particles. To address this shortcoming, we develop Topaz, an efficient and accurate particle picking pipeline using neural networks trained with few labeled particles by newly leveraging the remaining unlabeled particles through the framework of positive-unlabeled (PU) learning. Remarkably, despite using minimal labeled particles, Topaz allows us to improve reconstruction resolution by up to 0.15 Å over published particles on three public cryoEM datasets without any post-processing. Furthermore, we show that our novel generalized-expectation criteria approach to PU learning outperforms existing general PU learning approaches when applied to particle detection, especially for challenging datasets of non-globular proteins. We expect Topaz to be an essential component of cryoEM analysis. △ Less

Submitted 8 October, 2018; v1 submitted 21 March, 2018; originally announced March 2018.

Comments: 43 pages, 5 main figures, 6 supplemental figures

Journal ref: Nature Methods (2019)

Showing 1–8 of 8 results for author: Bepler, T