-
Scoreformer: A Surrogate Model For Large-Scale Prediction of Docking Scores
Authors:
Álvaro Ciudad,
Adrián Morales-Pastor,
Laura Malo,
Isaac Filella-Mercè,
Victor Guallar,
Alexis Molina
Abstract:
In this study, we present ScoreFormer, a novel graph transformer model designed to accurately predict molecular docking scores, thereby optimizing high-throughput virtual screening (HTVS) in drug discovery. The architecture integrates Principal Neighborhood Aggregation (PNA) and Learnable Random Walk Positional Encodings (LRWPE), enhancing the model's ability to understand complex molecular struct…
▽ More
In this study, we present ScoreFormer, a novel graph transformer model designed to accurately predict molecular docking scores, thereby optimizing high-throughput virtual screening (HTVS) in drug discovery. The architecture integrates Principal Neighborhood Aggregation (PNA) and Learnable Random Walk Positional Encodings (LRWPE), enhancing the model's ability to understand complex molecular structures and their relationship with their respective docking scores. This approach significantly surpasses traditional HTVS methods and recent Graph Neural Network (GNN) models in both recovery and efficiency due to a wider coverage of the chemical space and enhanced performance. Our results demonstrate that ScoreFormer achieves competitive performance in docking score prediction and offers a substantial 1.65-fold reduction in inference time compared to existing models. We evaluated ScoreFormer across multiple datasets under various conditions, confirming its robustness and reliability in identifying potential drug candidates rapidly.
△ Less
Submitted 25 June, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
Are Protein Language Models Compute Optimal?
Authors:
Yaiza Serrano,
Álvaro Ciudad,
Alexis Molina
Abstract:
While protein language models (pLMs) have transformed biological research, the scaling laws governing their improvement remain underexplored. By adapting methodologies from NLP scaling laws, we investigated the optimal ratio between model parameters and training tokens within a fixed compute budget. Our study reveals that pLM sizes scale sublinearly with compute budget, showing diminishing returns…
▽ More
While protein language models (pLMs) have transformed biological research, the scaling laws governing their improvement remain underexplored. By adapting methodologies from NLP scaling laws, we investigated the optimal ratio between model parameters and training tokens within a fixed compute budget. Our study reveals that pLM sizes scale sublinearly with compute budget, showing diminishing returns in performance as model size increases, and we identify a performance plateau in training loss comparable to the one found in relevant works in the field. Our findings suggest that widely-used pLMs might not be compute-optimal, indicating that larger models could achieve convergence more efficiently. Training a 35M model on a reduced token set, we attained perplexity results comparable to larger models like ESM-2 (15B) and xTrimoPGLM (100B) with a single dataset pass. This work paves the way towards more compute-efficient pLMs, democratizing their training and practical application in computational biology.
△ Less
Submitted 26 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
GeoDirDock: Guiding Docking Along Geodesic Paths
Authors:
Raúl Miñán,
Javier Gallardo,
Álvaro Ciudad,
Alexis Molina
Abstract:
This work introduces GeoDirDock (GDD), a novel approach to molecular docking that enhances the accuracy and physical plausibility of ligand docking predictions. GDD guides the denoising process of a diffusion model along geodesic paths within multiple spaces representing translational, rotational, and torsional degrees of freedom. Our method leverages expert knowledge to direct the generative mode…
▽ More
This work introduces GeoDirDock (GDD), a novel approach to molecular docking that enhances the accuracy and physical plausibility of ligand docking predictions. GDD guides the denoising process of a diffusion model along geodesic paths within multiple spaces representing translational, rotational, and torsional degrees of freedom. Our method leverages expert knowledge to direct the generative modeling process, specifically targeting desired protein-ligand interaction regions. We demonstrate that GDD significantly outperforms existing blind docking methods in terms of RMSD accuracy and physicochemical pose realism. Our results indicate that incorporating domain expertise into the diffusion process leads to more biologically relevant docking predictions. Additionally, we explore the potential of GDD for lead optimization in drug discovery through angle transfer in maximal common substructure (MCS) docking, showcasing its capability to predict ligand orientations for chemically similar compounds accurately.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Optimizing Drug Design by Merging Generative AI With Active Learning Frameworks
Authors:
Isaac Filella-Merce,
Alexis Molina,
Marek Orzechowski,
Lucía Díaz,
Yang Ming Zhu,
Julia Vilalta Mor,
Laura Malo,
Ajay S Yekkirala,
Soumya Ray,
Victor Guallar
Abstract:
Traditional drug discovery programs are being transformed by the advent of machine learning methods. Among these, Generative AI methods (GM) have gained attention due to their ability to design new molecules and enhance specific properties of existing ones. However, current GM methods have limitations, such as low affinity towards the target, unknown ADME/PK properties, or the lack of synthetic tr…
▽ More
Traditional drug discovery programs are being transformed by the advent of machine learning methods. Among these, Generative AI methods (GM) have gained attention due to their ability to design new molecules and enhance specific properties of existing ones. However, current GM methods have limitations, such as low affinity towards the target, unknown ADME/PK properties, or the lack of synthetic tractability. To improve the applicability domain of GM methods, we have developed a workflow based on a variational autoencoder coupled with active learning steps. The designed GM workflow iteratively learns from molecular metrics, including drug likeliness, synthesizability, similarity, and docking scores. In addition, we also included a hierarchical set of criteria based on advanced molecular modeling simulations during a final selection step. We tested our GM workflow on two model systems, CDK2 and KRAS. In both cases, our model generated chemically viable molecules with a high predicted affinity toward the targets. Particularly, the proportion of high-affinity molecules inferred by our GM workflow was significantly greater than that in the training data. Notably, we also uncovered novel scaffolds significantly dissimilar to those known for each target. These results highlight the potential of our GM workflow to explore novel chemical space for specific targets, thereby opening up new possibilities for drug discovery endeavors.
△ Less
Submitted 4 May, 2023;
originally announced May 2023.
-
Analysis and modeling of low frequency local field oscillations in a hippocampus circuit under osmotic challenge: the possible role of arginine vasopressin circuit for hippocampal function
Authors:
Hernan Barrio Zhang,
Mariana Marquez-Machorro,
Vito S. Hernandez,
Andres Molina,
Limei Zhang,
Tzipe Govezensky,
Rafael A. Barrio
Abstract:
Electrophysiological time series were taken simultaneously in two locations in the hippocampus of a rat brain previously described as receiving innervation from the osmosensitive vasopressinergic neurons of the hypothalamus. A hyperosmotic saline solution injection was administered during the time of the experiment. We analyze the recorded time series using different methods. We detect a modificat…
▽ More
Electrophysiological time series were taken simultaneously in two locations in the hippocampus of a rat brain previously described as receiving innervation from the osmosensitive vasopressinergic neurons of the hypothalamus. A hyperosmotic saline solution injection was administered during the time of the experiment. We analyze the recorded time series using different methods. We detect a modification of the delta and theta oscillations just after the perturbation caused by the injection. We compare the quality and information that each one of the methods exhibit and we analyze the characteristics of the perturbation based on a hypothesis that the strength of the functional connections between the vasopressinergic hypothalamic magnocellular neurons and their target in the hippocampus is modified by the perturbation. We built a model of the hypothetic neural connections and numerically calculate the time series produced by the system when simulating the perturbation caused by the saline injection. The theoretical results resemble the experimental findings concerning the frequency and amplitude alterations of the delta and theta bands.
△ Less
Submitted 8 November, 2020;
originally announced November 2020.
-
The reproductive number of Zika in municipalities of Antioquia, Colombia: stratifying the potential transmission of an ongoing epidemic
Authors:
Juan Ospina,
Doracelly Hincapie-Palacio,
Jesús Ochoa,
Adriana Molina,
Guillermo Rúa,
Dubán Pájaro,
Marcela Arrubla,
Rita Almanza,
Marlio Paredes,
Anuj Mubayi
Abstract:
Introduction: Zika epidemic in America was declared a public health emergency of international concern after the rapid spread in the region. Stratification of the potential transmission of the disease is needed to address the efforts surveillance and disease control. The goal of this research is to compare the basic reproductive number of Zika in different municipalities, from an SIR model with im…
▽ More
Introduction: Zika epidemic in America was declared a public health emergency of international concern after the rapid spread in the region. Stratification of the potential transmission of the disease is needed to address the efforts surveillance and disease control. The goal of this research is to compare the basic reproductive number of Zika in different municipalities, from an SIR model with implicit vector dynamics, based on the daily case reporting data of Antioquia, Colombia, the second most affected country after Brazil. Methods: An simple SIR model with implicit vector dynamics was derived and used. The approximate solution of the model in terms of individuals recovered in each time unit, allowed to obtain estimate of the model parameters including the basic reproduction number (Ro) and its 95% confidence intervals. These parameters were estimated via fitting the solution of the model to the daily reported cumulative cases data from the regional surveillance system. Results: Ro was estimated for 20 municipalities, all located at less than 2200 meters above sea level. From January to April 2016, between 17 and 347 cases were reported from these municipalities. Of these, 15 municipalities had a high potential for transmission (Ro>1) and 5 had less potential (Ro<1), although in 3 of these, transmission was later possible because its upper confidence interval of Ro was greater than one. Conclusion: Surveillance and control of Zika should be directed primarily to municipalities with Ro>1. Furthermore, strategies that will strengthen the detection and management of cases in the remaining municipalities should be beneficial to the whole Antioquia region.
△ Less
Submitted 19 September, 2016;
originally announced September 2016.