-
Evaluating representation learning on the protein structure universe
Authors:
Arian R. Jamasb,
Alex Morehead,
Chaitanya K. Joshi,
Zuobai Zhang,
Kieran Didi,
Simon V. Mathis,
Charles Harris,
Jian Tang,
Jianlin Cheng,
Pietro Lio,
Tom L. Blundell
Abstract:
We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relations…
▽ More
We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including AlphaFoldDB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB. ProteinWorkshop is available at: github.com/a-r-j/ProteinWorkshop.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Benchmarking Generated Poses: How Rational is Structure-based Drug Design with Generative Models?
Authors:
Charles Harris,
Kieran Didi,
Arian R. Jamasb,
Chaitanya K. Joshi,
Simon V. Mathis,
Pietro Lio,
Tom Blundell
Abstract:
Deep generative models for structure-based drug design (SBDD), where molecule generation is conditioned on a 3D protein pocket, have received considerable interest in recent years. These methods offer the promise of higher-quality molecule generation by explicitly modelling the 3D interaction between a potential drug and a protein receptor. However, previous work has primarily focused on the quali…
▽ More
Deep generative models for structure-based drug design (SBDD), where molecule generation is conditioned on a 3D protein pocket, have received considerable interest in recent years. These methods offer the promise of higher-quality molecule generation by explicitly modelling the 3D interaction between a potential drug and a protein receptor. However, previous work has primarily focused on the quality of the generated molecules themselves, with limited evaluation of the 3D molecule \emph{poses} that these methods produce, with most work simply discarding the generated pose and only reporting a "corrected" pose after redocking with traditional methods. Little is known about whether generated molecules satisfy known physical constraints for binding and the extent to which redocking alters the generated interactions. We introduce PoseCheck, an extensive analysis of multiple state-of-the-art methods and find that generated molecules have significantly more physical violations and fewer key interactions compared to baselines, calling into question the implicit assumption that providing rich 3D structure information improves molecule complementarity. We make recommendations for future research tackling identified failure modes and hope our benchmark can serve as a springboard for future SBDD generative modelling work to have a real-world impact.
△ Less
Submitted 14 August, 2023;
originally announced August 2023.
-
Structure-based Drug Design with Equivariant Diffusion Models
Authors:
Arne Schneuing,
Yuanqi Du,
Charles Harris,
Arian Jamasb,
Ilia Igashov,
Weitao Du,
Tom Blundell,
Pietro LiĆ³,
Carla Gomes,
Max Welling,
Michael Bronstein,
Bruno Correia
Abstract:
Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. In this paper, we formulate SBDD as a 3D-conditional generation problem and present DiffSBDD, an SE(3)-equivariant 3D-conditional diffusion model that generates novel ligands conditioned on protein pockets. Comprehensive in silico experiments demo…
▽ More
Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. In this paper, we formulate SBDD as a 3D-conditional generation problem and present DiffSBDD, an SE(3)-equivariant 3D-conditional diffusion model that generates novel ligands conditioned on protein pockets. Comprehensive in silico experiments demonstrate the efficiency and effectiveness of DiffSBDD in generating novel and diverse drug-like ligands with competitive docking scores. We further explore the flexibility of the diffusion framework for a broader range of tasks in drug design campaigns, such as off-the-shelf property optimization and partial molecular design with inpainting.
△ Less
Submitted 30 June, 2023; v1 submitted 24 October, 2022;
originally announced October 2022.
-
Utilising Graph Machine Learning within Drug Discovery and Development
Authors:
Thomas Gaudelet,
Ben Day,
Arian R. Jamasb,
Jyothish Soman,
Cristian Regep,
Gertrude Liu,
Jeremy B. R. Hayter,
Richard Vickers,
Charles Roberts,
Jian Tang,
David Roblin,
Tom L. Blundell,
Michael M. Bronstein,
Jake P. Taylor-King
Abstract:
Graph Machine Learning (GML) is receiving growing interest within the pharmaceutical and biotechnology industries for its ability to model biomolecular structures, the functional relationships between them, and integrate multi-omic datasets - amongst other data types. Herein, we present a multidisciplinary academic-industrial review of the topic within the context of drug discovery and development…
▽ More
Graph Machine Learning (GML) is receiving growing interest within the pharmaceutical and biotechnology industries for its ability to model biomolecular structures, the functional relationships between them, and integrate multi-omic datasets - amongst other data types. Herein, we present a multidisciplinary academic-industrial review of the topic within the context of drug discovery and development. After introducing key terms and modelling approaches, we move chronologically through the drug development pipeline to identify and summarise work incorporating: target identification, design of small molecules and biologics, and drug repurposing. Whilst the field is still emerging, key milestones including repurposed drugs entering in vivo studies, suggest graph machine learning will become a modelling framework of choice within biomedical machine learning.
△ Less
Submitted 10 February, 2021; v1 submitted 9 December, 2020;
originally announced December 2020.
-
RNA sampling and crystallographic refinement using Rappertk
Authors:
Swanand Gore,
Tom Blundell
Abstract:
Background. Dramatic increases in RNA structural data have made it possible to recognize its conformational preferences much better than a decade ago. This has created an opportunity to use discrete restraint-based conformational sampling for modelling RNA and automating its crystallographic refinement. Results. All-atom sampling of entire RNA chains, termini and loops is achieved using the Rich…
▽ More
Background. Dramatic increases in RNA structural data have made it possible to recognize its conformational preferences much better than a decade ago. This has created an opportunity to use discrete restraint-based conformational sampling for modelling RNA and automating its crystallographic refinement. Results. All-atom sampling of entire RNA chains, termini and loops is achieved using the Richardson RNA backbone rotamer library and an unbiased distribution for glycosidic dihedral angle. Sampling behaviour of Rappertk on a diverse dataset of RNA chains under varying spatial restraints is benchmarked. The iterative composite crystallographic refinement protocol developed here is demonstrated to outperform CNS-only refinement on parts of tRNA(Asp) structure. Conclusion. This work opens exciting possibilities for further work in RNA modelling and crystallography.
△ Less
Submitted 21 October, 2007;
originally announced October 2007.
-
Crystallographic modelling of protein loops and their heterogeneity with Rappertk
Authors:
Swanand Gore,
Tom Blundell
Abstract:
Background. All-atom crystallographic refinement of proteins is a laborious manually driven procedure, as a result of which, alternative and multiconformer interpretations are not routinely investigated.
Results. We describe efficient loop sampling procedures in Rappertk and demonstrate that single loops in proteins can be automatically and accurately modelled with few positional restraints. L…
▽ More
Background. All-atom crystallographic refinement of proteins is a laborious manually driven procedure, as a result of which, alternative and multiconformer interpretations are not routinely investigated.
Results. We describe efficient loop sampling procedures in Rappertk and demonstrate that single loops in proteins can be automatically and accurately modelled with few positional restraints. Loops constructed with a composite CNS/Rappertk protocol consistently have better Rfree than those with CNS alone. This approach is extended to a more realistic scenario where there are often large positional uncertainties in loops along with small imperfections in the secondary structural framework. Both ensemble and collection methods are used to estimate the structural heterogeneity of loop regions.
Conclusion. Apart from benchmarking Rappertk for the all-atom protein refinement task, this work also demonstrates its utility in both aspects of loop modelling - building a single conformer and estimating structural heterogeneity the loops can exhibit.
△ Less
Submitted 21 October, 2007;
originally announced October 2007.
-
Identification of specificity determining residues in enzymes using environment specific substitution tables
Authors:
Swanand Gore,
Tom Blundell
Abstract:
Environment specific substitution tables have been used effectively for distinguishing structural and functional constraints on proteins and thereby identify their active sites (Chelliah et al. (2004)). This work explores whether a similar approach can be used to identify specificity determining residues (SDRs) responsible for cofactor dependence, substrate specificity or subtle catalytic variat…
▽ More
Environment specific substitution tables have been used effectively for distinguishing structural and functional constraints on proteins and thereby identify their active sites (Chelliah et al. (2004)). This work explores whether a similar approach can be used to identify specificity determining residues (SDRs) responsible for cofactor dependence, substrate specificity or subtle catalytic variations. We combine structure-sequence information and functional annotation from various data sources to create structural alignments for homologous enzymes and functional partitions therein. We develop a scoring procedure to predict SDRs and assess their accuracy using information from bound specific ligands and published literature.
△ Less
Submitted 15 October, 2007;
originally announced October 2007.
-
Comparative analysis of protein structure using multiscale additive functionals
Authors:
Marconi Soares Barbosa,
Rinaldo Wander Montalvao,
Tom Blundell,
Luciano da Fontoura Costa
Abstract:
This work reports a new methodology aimed at describing characteristics of protein structural shapes, and suggests a framework in which to resolve or classify automatically such structures into known families. This new approach to protein structure characterization is based on elements of integral geometry using biologically relevant measurements of shape and considering them on a multi-scale re…
▽ More
This work reports a new methodology aimed at describing characteristics of protein structural shapes, and suggests a framework in which to resolve or classify automatically such structures into known families. This new approach to protein structure characterization is based on elements of integral geometry using biologically relevant measurements of shape and considering them on a multi-scale representation which align the proposed methodology to the recently reported "tube picture" of a protein structure as a minimal representation model. The method has been applied with good results to a subset of protein structures known to be especially challenging to revert into families, confirming the potential of the proposed method for accurate structure classification.
△ Less
Submitted 14 January, 2007;
originally announced January 2007.